Sentiment analysis in the project aimed to determine if a tweet was positive or negative. The starting point was a dataset of over 1.5 million labeled tweets. Each tweet was labeled as positive or negative (0 or 1). A simple classifier was fitted on processed and transformed tweets. And it worked!😀
Let’s do a rough overview of what was done:
- Cleaning and preparing the data (text)
- Fitting a Gensim Word2Vec model on the data (text)
- Mapping text through Word2Vec model to 300-dimensional space (text to vector)
- Using the newly created 300-dimensional vectors for machine learning (vector)
- Optimizing machine learning parameters with 5-fold CV
- Evaluating the model performance
The modelling was done in Python with Scipy-stack, Gensim, Scikit-learn and Natural Language Tool Kit. The final modelling pipeline was about 1000 lines of code.
The final model had ~0.75 5-fold CV accuracy, which means that it was right 75% of the time. Accuracy could be increased for example by using more sophisticated models or by optimizing hyper-parameters better. A natural upper limit for accuracy in sentiment analysis is roughly 0.9 since some opinions are debatable.
This kind of sentiment analysis has a variety of potential applications. The model could be reused on Twitter streams with certain hashtags. One could for example stream tweets with “Nordea” or “Kone” in them. Afterwards sentiment analysis could show how certain actions affected the popular opinion about the companies.
Live and prosper!🙂