Musings on Language: 2011

This is the final project for the Computational Cognitive Science class I took this semester. The graph is a comparison of two mental words space approximators...

Step 1: Latent Semantic Analysis

The first was a purely text-based analysis called Latent Semantic Analysis. The data is represented by the red x's in the picture. What this algorithm does, in short, is sees how many times a bunch of words appear in a bunch of documents and from that can give a pretty good picture of the relationship between words. Cool!

To implement LSA, we needed a huge corpus of text to evaluate. Websites about LSA recommended using over 1000 documents, and our specific algorithm needed all these texts in a single plain text document. To get this, we first used a pretty simple bash script to collect 1000 URLs from Google Blog Search for "education" using a text-based browser called Lynx. Be aware that this is actually against Google’s Terms of Service and we ran into a little trouble with Google noticing some funny activity from our IP address. (All in the name of science!). (If you are going to attempt this project, I encourage you to plan ahead and look into this) Then, we ran another pretty simple script that again used Lynx and went through the URL list, pulling the text from each of the websites, and putting them all into a single plain text document. (The code was not developed by me, so I do not feel comfortable sharing it here!) The next step was to pass the corpus to a script from from a MATLAB toolbox that made from it a document-term matrix. In doing this, we also asked the script to disregard stop words, high-frequency words low in content that help us to form meaningful syntax, as is common practice in natural language processing. From this matrix, we were able to determine the 100 most prevalent words in the corpus. Despite fears we had about the quality of a corpus gathered via automated script from the Internet, the most popular was indeed our search term, “education.” From these 100 words, we pulled 10 that were salient in the list and that we thought would make for meaningful comparisons for the second part of our experiment. To make the matrix manageable, we created a smaller document-term matrix with just these 10 terms, then ran a dimensionality reduction algorithm called singular value decomposition (also in that MATLAB toolbox) on it. We then plotted the data using the second singular value as the X and third singular value as the Y, resulting in a word space. We chose not to use the first singular value because it actually indicates the number of times that word has been used in all documents, and thus it made more sense to use the second and third.

Step 2: Multidimensional Scaling

We also wanted to compare the findings using LSA to a human psychological space. To do this, we first gathered all unique pairwise similarity ratings of salient terms from the list of the top 100 most popular words in the LSA corpus: college, university, state, students, teacher, school, research, business, government, job. This resulted in 45 unique comparisons. We gathered this data from 116 subjects using Google forms. After translating this data to a similarity matrix, we ran a different dimensionality reduction algorithm, called multidimensional scaling, on it to make a plot-able data set that supposedly represents the psychological word space surrounding the concept of education (The the green o's on the graph).

Step 3: Comparison!

Upon comparing the output of SVD and MDS run just for the ten terms and then graphed, we found similarities between the two graphs. Yay! Our project worked! Most striking was the fact that “state” was far away from the rest of the words, off to one side, but on opposite sides in each graph. Also in both plots, “students” was on the opposite side of the graph from “state”. Upon realizing this, we realized that both LSA and MDS had created a similar spectrum from learning (“student”) to bureaucracy (“state”). LSA showed that “job” and “business” were similar concepts, as well as “college” and “university”. Unfortunately, everything in MDS was close to each other and therefore we cannot comment as meaningfully on it as we can on the LSA plot. Due to the striking similarities we mentioned earlier, we decided that, legitimately or not, we would superimpose the graphs, flipping the LSA graph so that “state” and “student” matched in sides. This was accomplished by multiplying the SVD matrix by -1. This plot can be seen in Figure 3. In doing this, we realized that the more bureaucracy-oriented terms matched pretty closely while the learning-oriented ones did not.

Overall, we found that our intuitions about the similarity of 10 terms was actually captured better by the LSA plot than by the MDS plot. We think this might have had something to do with the fact that people had trouble with the task, treating it often as an all or nothing rating rather than a scale that would be appropriate for approximating a psychological space. If we were to make adjustments to this part, we would choose instead a three-way comparison task, forcing people to make judgements that our previous two-way comparison task did not properly encourage.

Yeah, it looks like a measly graph with x's and o's and twenty words, but it's a guess at our mental lexicon. It was quite fun to develop and code. Here's to a future in computational psycholinguistics!

http://www.gapminder.org/videos/the-joy-of-stats/

At about 40:30, this sensational documentary by Hans Rosling talks about the efforts Google is making in real-time translation so people speaking different languages can communicate in a flash. I'm still parsing through my thoughts on it. On the one hand, it is fascinating and mind-blowing that this technology can exist.

My torn-ness revolves around the place of the variety of different languages that there are. Does language difference serve a purpose? Sapir, Whorf, and the theory of linguistic relativity would say that because these people speak different languages, they think differently, at least to an extent. So does it go the other way too? Does diversity of language reflect the nuances in culture? This is getting a little out of my purview, but it makes me think— is there any advantage to having the thousands of different languages that people across the globe speak? In this global age, is it just a burden to progress? The idea of getting people across the world able to communicate has been around for ages, probably not even starting with the creators of Esperanto. But why don't we speak Esperanto these days? I really don't know. It seems awesome to me. Did it fail because it was a synthetic language?

But this Google project seems a little different. It lets people have the cultural differences, to allow them those nuances that their language provides them, and then translates it. There is a reason why the term "lost in translation" exists. Can Google's gizmo get good enough at translating the extra-linguistic nuances? This becomes super relevant then with the real-time audio translation. Supralinguistic features of language (inflection, intonation, etc.) are different across different languages. Will Google's translator-voice take that into account? I mean, even human translation is not perfect...

I'm excited to see what comes of this. Imagine the globalization opportunities!

Musings on Language

Tuesday, May 24, 2011

Post Script...

Psychological Word Space

Monday, January 24, 2011

Taking a Wrecking Ball to the Tower of Babel