Skip to content
This repository has been archived by the owner on Aug 22, 2020. It is now read-only.

Word Vectors Workshop

Elisa Beshero-Bondar edited this page Jun 1, 2018 · 1 revision

Word Vector CMU Workshop w/ Ben Schmidt 2018-06-01

word embedding : a general strategy for treating words as numbers

meanings of words converted to numbers

Word2Vec: An algorithm that performs particularly useful word embedding (made in 2013…Its developer was so famous from this that he moved his job from Google to Facebook immediately!) —this did a better job of making a useful set of word embeddings than previous sets had done. Had a better optimization function, better description of words/spaces. Runs fast, but not extremely fast.

To train a model, you need about a million words—20-50 books is about a bare minimum—this takes some time. (We won’t do that in this workshop.) The complete run of a journal is about right; the complete works of an author is not enough.

You need the one in a million chance of a word to come up, e.g. the word “hazard”—it might not come up otherwise. (It’s not unusual for an ordinary word to have a one-in-a-million chance of showing up. You need lots and lots of text to develop sufficient richness in the model.)

wordVectors: an actual program, an R package we just installed, that you can run inside R Studio to do this yourself. If you like Python, use gensim. —we’re working with a streamlined version…This is a wrapper for R around the code that Google release of word2Vec.

Word embeddings place words into space that:

  1. approximate semantic distance as spatial distance for computation.
  2. approximate frequent relationships as vectors in space.

Vector drawn between words to indicate its relationship. Parallel vectors (say drawn between cat to tiger, would be parallel as dog to wolf)

Vector directions and angles indicate kinds of relationships. (Think of a domesticity axis…)

   wolf		  tiger

dog cat

(example for parallel vectors…imagine line vectors running between)

How do you get words to occupy vector space? Words that have similar meanings in a similar space to each other…Word meanings are predicted from context.

window size = number of words fore and aft of a word from which to calculate vectors. larger windows produce more semantic relationships…people tend to set a window size of 5 to get a distribution. 5 or 10 is often good…(5 to 10 words on each side)… Relatively small windows gets you better results than “huge” document (whole document) windows. [ebb Discussion: you might want a smaller window for Dr. Seuss than for German sentences or Shakespeare plays]

A million word corpus gives you a million sliding relationships at a time—that can give you very fine-grained results.

EEBO corpus might take about 12 hours to train the model. 20-30 books might take 15 minutes or so.

This is a little like topic modeling in its way of clustering related words. See: http://benschmidt.org/profCloud/ for 3-d example Each distinct(ly spelled) word will appear only once. You can do pre-processing to deal with the complexity this can generate (e.g. the distance we saw in his sample graph between bcuz and becuz )

300 dimensions! Word2Vec uses a few hundred dimensions for each word. 300 dimensional space. NOT like topic models in this case—b/c all the vectors/ clusters are calculated and clustered on the same basis.

When you build your model, you specify the number of dimensions you want. The tool doesn’t give you much guidance on this.

Word embeddings and Topic models:

  • Word embeddings try to create a detailed picture of language in a corpus.
  • By contrast, topic models try to simplify the vocab in each text into a higher level abstraction than an individual word.
  • Word embedding models are much more unstructured in the assumptions they make…they are just optimizing relationships in a space in order to position them relative to each other.
  • In topic modeling, the number of dimensions you choose alters what you can see in the model. By contrast, in word embedding, 100 - 300 dimensions is optimal. There’s a degradation that occurs with too many…Is it “overfitting”? Ben thinks not. Ben hasn’t yet since a convincing paper about why dimensions of 1000 are problematic, for example.

Topic models are better than word embeddings because:

  1. They represent documents as important elements and let you make comparisons across them.
  2. Statisticians really like topic models—the math is easier to understand. Word vectors are challenging to mathematicians.
  3. Word embedding models have NO notion of a document at all—just give you sliding windows of 10 - 15 words…
  4. Topic models can handle words with multiple meanings sensibly
  5. Topic models let you abstract away from words to understand documents using broad measures—to give you sense of what’s in a pile of articles.

Word embeddings are better than topic models because

  1. They retain words as core elements and let you make comparisons among them or generalize out from a few to many others.
  2. They make it possible (not easy) to define your own clusters of interest in terms of words.
  3. Topic models fit words into topic bins. Word embeddings let you abstract away from documents to understand words .

The embedding strategy (“representation learning”—another idea of machine learning. Not “deep learning” which would be training neural networks w/ multiple layers.) Word2Vec is a neural network technique!

  • Ascendant in Machine Learning today
  • Define a generally useful task, like
    • predicts word from context
    • identify the content of an image
    • predict the classification of a text
  • Find a transformation function that creates a representation of about 100 to 50000 umbers that works well at that task.
  • The same function may be generally useful in other contexts.

A convolutional net (don’t worry about defining this for this workshop…look up later…) Something to do with the way the relationships are calculated

WEMS: Word similarities:



Google Translate is running on embedded word vectors. Translate a Turkish sentence: She is a doctor; he is a nurse into English. B/c Turkish has no gendered pronouns (gender conveyed in other ways, and Google Translate is working on predicted models from word embedding, the translation to English comes back: He is a doctor; she is a nurse.

CLOSING QUOTE: “Semantics derived automatically from language corpora necessarily contain human biases” (Caliskan-Islam et. al.)

“We demonstrate here for the first time what some have long suspected (Quine 19t0)—that semantics, the meaning of words, necessarily reflects regularities latent in our culture, some of which we no know to be prejudiced”

Try pulling out the closest word to “bossy” in a gendered vector space…and then try setting int a non-gendered vector space. Try vectors like misspelling vector or capitalization vector. Make a vector from “Frankenstein vs “frankenstein” (if your model has both) Linear algebra operation is “rejection”—basically subtraction.

There are ways of aligning models to see how vector space changes over time. How words shift in their context over time. gay in the 1900s to gay in the 1950s to gay in the 1990s.

Similarly, try broadcast, and awful.

Instructions on how to train your model are on Ben’s site: benschmidt.org/pittsburgh/

Q/A: Ryan Heuser is someone doing LOTS of work in this area, looking at changes in language btw 1750 - 1850. long-term project. Looking for major shifts in semantics around time of French Revolution. What are the ways that batches of words changes in similar directions over 100 years?

Ryan is also working on effective ways to visualize stuff, and to work with multiple models and average the differences, etc.

People (Dan Evans) talks of running this on EEBO (runs fine on a laptop)

Ryan’s been working with ECCO (1701-1800)

Chris (from 6 degrees): it’s possible to join 2-grams, 3-grams, 4-grams, and search across them… There’s a function for that. Metadata can be added as tags (as in “this is poetry” “this is fiction”

Community is probably stronger in Python than in R. One reason is that you can drop the results more easily into other NLP pipelines than you can with R.