Word2vec and all that jazz.

Wikipedia assingment

Last week we had an assignment, to use word2vec and see what it could be used for, as it was a group assignment, we misused our time on the assignment, a bunch. but in the end we pulled trough, it was actually very interesting, first tho.

What the hell is word2vec?

Word2vec is a group of related models, but you can think of it as a program. A program that takes a shitton of words, like say, Wikipedia, and gives all the words a vector. A vector is two points in a reference map with a direction (you should know this from high school math) think of an x, y scale, like the ones used in graphs, the you put in two points, and draw a line between them. Give that line a direction, heading up, or down or whatever, that is a vector. All of this was created by researchers at google led by Tomas Mikolov.


With all these words now put into vectors, we can start using word2vec, at first we were given examples of how we could use the program, if you input; man is to king as woman is to BLANK, the program spits out queen as the most likely. now you may look at that and think, that`s not really impressive, the kicker is that no one taught the program this. We can do more complex executions with words2vec, if you give it four words, and ask it to pic one that is the least like the others, more times than not it will select the missplaced one.

Now what we did on this project was trying to find how we could use word2vec in translating programs, a very high goal if you ask me, and we failed. we readjusted our ambitions on a more reasonable search, how to use word2vec to study something in Wikipedia. and it worked out quite well. even though we wanted to do more, time is a cruelest mistress.

in the end.

we did work out something useful in the end, but i still think that looking at word2vec in a translating way is exiting, and way to BIG to look at for us right now. The most relevant thing we worked out was the clear limitations of the word2vec program, using double words or case sensitive words did not work out to well, all in all it was a fun assignment, i could post our findings, but since i’m not very proud of our failing to look at the translation value of word2vec, i’m not going to.


