Text Analysis: Training a Word2Vec deep learning model

Some months ago I built a twitter sentiment analyser. The analyzer was designed to determine automatically if a tweet is positive or negative. Word embedding and word2vec was a crucial part of the pipeline and I thought I would write a bit about this block.

What is word embedding and word2vec?

If you are not familiar with word embedding, I would advise you to take a look at this overview of word2vec. The overview is enlightening and a really good place to get you started. The actual model in this post is Gensim’s w2v.

Word2vec works as a bridge between words (messages, emails, chats) and machine learning. You can also think of it as a feature extractor. Basically, it “just” translates words into vectors and therefore makes word-based observations machine learnable. With an appropriate preprocessing a word embedding model can learn any language.

How to preprocess?

I did some pretty basic preprocessing for the tweets before feeding them to the model. The most important part of the preprocessing pipeline was to turn all tweets into sentences. To train the model you need sentences, because the model uses context to determine relationships between words. You can find ready-made tokenizers for separating sentences, for example NLTK is a good one for python.

The newly created sentences are then used to train the model. Notice that later on you use the model with individual words, not sentences. This is a bit different from usual models. Usually you train a model with a similar dataset as you will use to make predictions.

Training the w2v model

To train the model I used a database of labeled tweets. The database had 1,570,000 entries.

Before the actual training w2v scans through the whole dataset. During the scan it counts word frequencies and decides which words are used in the model. Here is the terminal output from the scan:



The report is pretty self-explanatory. Let’s look at what it found:

  • 441755 word types / unique words
  • 14830060 raw words in the corpus (collection of words)
  • 1981440 sentences
  • 12201 words with a min count of 40 (~2.8% of 441755)
  • 62 words with a frequency of 0.001 or higher

Training the model was surprisingly fast. It took only 100 seconds to train over 74 million raw words. You can see that it used 4 cores for the training. Below is output from the end of the process:



Now we have a fitted model ready for machine learning! Twitter sentiment analyzer with this model achieved an accuracy of ~75%