Musings on Language: natural language processing

Monday, December 3, 2012

Project: Synesthesia

Recently in Hacker School, I finished a project using Python I called "Synesthesia." The concept was devised by my housemate Natan, but it seemed to be an interesting engineering problem that would allow me to revisit my psycholinguistics knowledge. And then we can hang our art in our living room.

The concept: A generative art program into which you feed an image and you get back another image such that all the colored regions in the original are replaced with words that are related to that colors. So, a previously red patch might sport the words "cherry," "communism," and "balloon." Here's what happens:

This program has several parts, the logic behind each I will explain in more detail later. They are:

Making the image's color palette simpler based on what the user's preferences. For the image above, I chose green, blue, orange, and black.
Figuring out which words are related to those selected colors
Putting the words where the colors are

1. Simplifying the Image: imagesimpler.py

In order to best figure out what words should go where, I needed to simplify the color space of the image. So, we as humans know that the blue section in the original above is all, well, blue. However, neighboring pixels in image are a) different colors and b) defined not by their names but by the amount of red, green, and blue they comprise. This trio is called the "RGB value."

To do this, I used a nice module called webcolors that allowed me to make a list of the RGB value for each of the colors specified by the user. Then, I calculated the Euclidean distance between each pixel in the original image and the list of RGB values I'd created. Subsequently, I replaced the pixel with its nearest approximate, creating a simplified image.

2. Related Words: cooccurrence.py

In order to get a list of words that are related with the colors in the image, I appealed to the idea that related words co-occur with one another. I created a module that will find words that co-occur with any target word. To do this, I first scraped Wikipedia articles that come up when you do a search for the target word, compiling them all into a single corpus. I did this using Beautiful Soup. Following that, I created a frequency distribution of words that occur within a certain distance of the target word within it's scraped corpus, seeing how many times each word appeared. I then used this distribution to determine which words co-occur significantly with the target.

3. Creating the Art: synesthesizer.py

To create the "synesthesized" image, I started at the top left corner, figured out what color the pixel was and popped the first word related to that color from the list of related words, and put in a new image in the corresponding space. Each letter in the word stood in for a pixel, and each word is separated by a space, which is also a pixel. The "color" of the word is determined by the color of the pixel-space it begins on. So, if the word "cat" were placed, the next word would begin 4 pixel-spaces after "cat" began. If the next word in line for the color doesn't fit on the line, it's put back in the stack and I start again at the new line. I know that's a bit confusing, so here's a visualization:

I admit that this is a lazy approach, which both muddles the image and makes the right side of the image have a terribly jagged border. I should perhaps create a dynamic programming algorithm (it's a knapsack-y problem) that will pick words based on which ones with fit in the amount of the color that is left.

EDIT 12/5/12:
Writing this post and having to defend all my lazy design decisions made me realize that well, I was just being lazy and I could learn so much from actually solving the problem instead of just saying "close enough." I wanted to fill in each line of color without wasting space or running off the edge of the color's space. Sounds like a linear packing problem to me! A month of Hacker School has passed since I made my original (lazy) design decision, and in the past month, I've realized that I can solve any hard problem if I just work enough at it. Even an NP-hard problem!

I paired up with another Hacker Schooler, Betsy, because two brains are better than one, and we set to making a better painting algorithm. After joking a bit about solving the bin-packing problem and then getting a Field's medal, we decided that just making this algorithm better would be sufficient, resigning to our O(n!) fate. The way to find the most optimal solution is to find all the solutions and choose the best one.

Quite expensive yes, but we know storage is our best friend. Because the corpuses from which the co-occurrences are calculated aren't scraped freshly every time, we realized that we could store the results from our combinatorics in a pickled dictionary keyed by color and then by length, greatly reducing the online runtime. However, we realized that we didn't need to calculate all of the combinations. We found only all combinations that are shorter than a certain number of characters (25), realizing that anything longer than that could be recursively split into two shorter segments which would have solutions within the dictionary. Then, when filling in the images line by line, we see how long each color segment is and then pop off the next word in from the "reconstituted" dictionary. In the event that a segment of color is shorter than any of the words, we put in exclamation points as a place holder. This method allowed us to reproduce the image extremely precisely, but we are going to look into different solutions.

Another little interesting problem I'll touch on was that there were originally vertical bands of spaces running down the entirely orange segment in the Rothko you can see as the sample image above (the orange and blue one). We realized that this was because the line was always being split in half at the same place, leading to there being a space in the same place on each line at the junctions between the spliced segments. We added a random shift from 0 to 4 when splitting longer segments so to add noise and eliminate the band. Sweet!

It's really exciting to me now that more detailed images actually look better now than ones that are less so, like the above Rothko. Look, it's me!

Check it all out on github!

Tuesday, May 24, 2011

Psychological Word Space

This is the final project for the Computational Cognitive Science class I took this semester. The graph is a comparison of two mental words space approximators...

Step 1: Latent Semantic Analysis

The first was a purely text-based analysis called Latent Semantic Analysis. The data is represented by the red x's in the picture. What this algorithm does, in short, is sees how many times a bunch of words appear in a bunch of documents and from that can give a pretty good picture of the relationship between words. Cool!

To implement LSA, we needed a huge corpus of text to evaluate. Websites about LSA recommended using over 1000 documents, and our specific algorithm needed all these texts in a single plain text document. To get this, we first used a pretty simple bash script to collect 1000 URLs from Google Blog Search for "education" using a text-based browser called Lynx. Be aware that this is actually against Google’s Terms of Service and we ran into a little trouble with Google noticing some funny activity from our IP address. (All in the name of science!). (If you are going to attempt this project, I encourage you to plan ahead and look into this) Then, we ran another pretty simple script that again used Lynx and went through the URL list, pulling the text from each of the websites, and putting them all into a single plain text document. (The code was not developed by me, so I do not feel comfortable sharing it here!) The next step was to pass the corpus to a script from from a MATLAB toolbox that made from it a document-term matrix. In doing this, we also asked the script to disregard stop words, high-frequency words low in content that help us to form meaningful syntax, as is common practice in natural language processing. From this matrix, we were able to determine the 100 most prevalent words in the corpus. Despite fears we had about the quality of a corpus gathered via automated script from the Internet, the most popular was indeed our search term, “education.” From these 100 words, we pulled 10 that were salient in the list and that we thought would make for meaningful comparisons for the second part of our experiment. To make the matrix manageable, we created a smaller document-term matrix with just these 10 terms, then ran a dimensionality reduction algorithm called singular value decomposition (also in that MATLAB toolbox) on it. We then plotted the data using the second singular value as the X and third singular value as the Y, resulting in a word space. We chose not to use the first singular value because it actually indicates the number of times that word has been used in all documents, and thus it made more sense to use the second and third.

Step 2: Multidimensional Scaling

We also wanted to compare the findings using LSA to a human psychological space. To do this, we first gathered all unique pairwise similarity ratings of salient terms from the list of the top 100 most popular words in the LSA corpus: college, university, state, students, teacher, school, research, business, government, job. This resulted in 45 unique comparisons. We gathered this data from 116 subjects using Google forms. After translating this data to a similarity matrix, we ran a different dimensionality reduction algorithm, called multidimensional scaling, on it to make a plot-able data set that supposedly represents the psychological word space surrounding the concept of education (The the green o's on the graph).

Step 3: Comparison!

Upon comparing the output of SVD and MDS run just for the ten terms and then graphed, we found similarities between the two graphs. Yay! Our project worked! Most striking was the fact that “state” was far away from the rest of the words, off to one side, but on opposite sides in each graph. Also in both plots, “students” was on the opposite side of the graph from “state”. Upon realizing this, we realized that both LSA and MDS had created a similar spectrum from learning (“student”) to bureaucracy (“state”). LSA showed that “job” and “business” were similar concepts, as well as “college” and “university”. Unfortunately, everything in MDS was close to each other and therefore we cannot comment as meaningfully on it as we can on the LSA plot. Due to the striking similarities we mentioned earlier, we decided that, legitimately or not, we would superimpose the graphs, flipping the LSA graph so that “state” and “student” matched in sides. This was accomplished by multiplying the SVD matrix by -1. This plot can be seen in Figure 3. In doing this, we realized that the more bureaucracy-oriented terms matched pretty closely while the learning-oriented ones did not.

Overall, we found that our intuitions about the similarity of 10 terms was actually captured better by the LSA plot than by the MDS plot. We think this might have had something to do with the fact that people had trouble with the task, treating it often as an all or nothing rating rather than a scale that would be appropriate for approximating a psychological space. If we were to make adjustments to this part, we would choose instead a three-way comparison task, forcing people to make judgements that our previous two-way comparison task did not properly encourage.

Yeah, it looks like a measly graph with x's and o's and twenty words, but it's a guess at our mental lexicon. It was quite fun to develop and code. Here's to a future in computational psycholinguistics!