Earlier, I played around with topic modeling/recommendation engines in Apache Spark. Since then, I’ve been curious to see if I could make any gains by adopting another text processing approach in place of topic modeling—word2vec. For those who don’t know word2vec, it takes individual words and maps them into a vector space where the vector weights are determined by a neural network that trains on a corpus of text documents.
I won’t go into major depth on neural networks here (a gentle introduction for those who are interested), except to say that they are considered among many to be the bleeding-edge of artificial intelligence. Personally, I like word2vec because you don’t necessarily have to train the vectors yourself. Google has pre-trained vectors derived from a massive corpus of news documents they’ve indexed. These vectors are rich in semantic meaning, so it’s pretty cool that you can leverage their value with no extra work. All you have to do is download the (admittedly large 1.5 gig) file onto your computer and you’re good to go.
Almost. Originally, I had wanted to do this on top of my earlier spark project, using the same pseudo-distributed docker cluster on my old-ass laptop. But when I tried to load the pre-trained Google word vectors into memory, I got a big fat
MemoryError, which I actually thought was pretty generous because it was nice enough to tell me exactly what it was.
I had three options: commandeer some computers in the cloud on Amazon, try to finagle spark’s configuration like I did last time, or finally, try running Spark in local mode. Since I am still operating on the cheap, I wasn’t gonna go with option one. And since futzing around with Spark’s configuration put me in a dead end last time, I decided to ditch the pseudo-cluster and try running Spark in local mode.
Although local mode was way slower on some tasks, it could still load Google’s pre-trained word2vec model, so I was in business. Similar to my approach with topic modeling, I created a representative vector (or ‘profile’) for each user in the Movielens dataset. But whereas in the topic model, I created a profile vector by taking the max value in each topic across a user’s top-rated movies, here I instead averaged the vectors I derived from each movie (which were themselves averages of word vectors).
Let’s make this a bit more clear. First you take a plot summary scraped from Wikipedia, and then you remove common stop words (‘the’, ‘a’, ‘my’, etc.). Then you pass those words through the pre-trained word2vec model. This maps each word to a vector of length 300 (a word vector can in principle be of any length, but Google’s are of length 300). Now you have
D vectors of length 300, where
D is the number of words in a plot summary. If you average the values in those
D vectors, you arrive at a single vector that represents one movie’s plot summary.
Note: there are other ways of aggregating word vectors into a single document representation (including doc2vec), but I proceeded with averages because I was curious to see whether I could make any gains by using the most dead simple approach.
Once you have an average vector for each movie, you can get a profile vector for each user by averaging (again) across a user’s top-rated movies. At this point, recommendations can be made by ranking the cosine similarity between a user’s profile and the average vectors for each movie. This could power a recommendation engine its own—or supplement explicit ratings for
(user, movie) pairs that aren’t observed in the training data.
Cognizant of the hardware limitations I ran up against last time, I opted for the same approach I adopted then, which was to pretend I knew less about users and their preferences than I really did. My main goal was to see whether word2vec could beat out the topic modeling approach, and in fact it did. With 25% of the data covered up, the two algorithms performed roughly the same against the covered up data. But with 75% of the data covered up, word2vec resulted in an 8% performance boost (as compared with 3% gained from topic modeling)
So with very little extra work (simple averaging and pre-trained word vectors), word2vec has pretty encouraging out of the box performance. It definitely makes me eager to use word2vec in the future.
Also a point in word2vec’s favor: when I sanity checked the cosine similarity scores of word2vec’s average vectors across different movies, The Ipcress File shot to the top of the list of movies most similar The Bourne Ultimatum. Still don’t know what The Ipcress File is? Then I don’t feel bad re-using the same joke as a meme sign-off.