Does pre-processing affect the correlation indicator between Twitter message volume and stock market trading volume?

Joanna Michalak



Motivation: More and more authors empirically verify the relationship between the volume of tweets and the stock market indicators. The patterns explored from Twitter most often take the form of time series that represent user’s activity on different level of granularity (moods, emotions, relevant topic or query-related messages). Sentiment analysis is a technique used to transform text data into information on the mood and related behavioral categories. Supervised machine learning is the most commonly used approach to sentiment analysis. Thus, the results of an empirical analysis of the relationship between social media and stock depend on the quality of results of classification task. The quality of the features used to learn the classifier plays a key role. The feature space is modified using various data pre-processing scenarios that aim to increase accuracy of classification. The impact of pre-processing data on the quality of classification is often discussed in studies. Very few authors discuss the impact of pre-processing on the correlation indicator between Twitter and stock market.

Aim: Analysis of the impact of tweets pre-processing on the Pearson correlation indicator between the mood of Twitter users and stock market trading volume.

Results: The correlation between the volume of stock market trading and the volume of tweets has been empirically confirmed. The effect of pre-processing on the correlation index was noted for the variables ‘all_tweets’ and ‘negative_tweets’. This is because the training set has a significant amount of tweets with negation. However, the results are not conclusive. The differences between the Pearson correlation index calculated for scenario one and scenario four are not significant. However, this indicates that the effect of noise data may reduce the quality and precision of conclusions. Especially in the case of frequent repetition of a certain category of noise.


twitter sentiment analysis; behavioral economy; data mining

Full Text:



Agarwal, A., Xie, B., Vovsha, I., Rambow, O., & Passonneau, R. (2011). Sentiment analysis of Twitter data. In M. Nagarajan, & M. Gamon (Eds.), LSM’11: proceedings of the workshop on languages in social media. Stroudsburg: ACL.

Antweiler, W., & Frank, M.Z. (2004). Is all that talk just noise: the information content of internet stock message boards. The Journal of Finance, 59(3). doi:10.1111/j.1540-6261.2004.00662.x.

Bollen, J., Mao, H., & Pepe A. (2011). Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In N. Nicolov, & J.G. Shanahan (Eds.), Proceedings of the fifth international AAAI Conference on weblogs and social media. Barcelona: AAAI.

Chen, E.E., & Wojcik, S.P. (2016). A practical guide to big data research in psychology. Psychological Methods, 21(4). doi:10.1037/met0000111.

Freedman, D.A. (2009). Statistical models: theory and practice. Leiden: Cambridge University Press.

Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. Retrieved 01.04.2020 from

Haddi, E., Liu, X., & Shi, Y. (2013). The role of text pre-processing in sentiment analysis. Procedia Computer Science. 17. doi:10.1016/j.procs.2013.05.005.

Ishikawa, H. (2015). Social big data mining. Boca Raton: CRC Press.


Liu, B. (2012). Sentiment analysis and opinion mining. San Rafael: Morgan & Claypool Publishers.

Mao, Y., Wei, W., Wang, B., & Liu, B. (2012). Correlating S&P 500 stocks with Twitter data. In X. Fu, P. Gloor, & J. Tang (Eds.), Proceedings of the first ACM international workshop on hot topics on interdisciplinary social networks research. New York: ACM. doi:10.1145/2392622.2392634.

Mittal, A., & Goel, A. (2012). Stock prediction using twitter sentiment analysis. Retrieved 01.04.2020 from

Nisar, T.M., & Yeung, M. (2018). Twitter as a tool for forecasting stock market movements: a short-window event study. The Journal of Finance and Data Science, 4(2). doi:10.1016/j.jfds.2017.11.002.

Oh, C., & Sheng, O. (2011). Investigating predictive power of stock micro blog sentiment in forecasting future stock price directional movement. In D.F. Galletta & T.P. Liang (Eds.), Proceedings of the international conference on information systems. Atlanta: AIS.

Olshannikova, E., Olsson, T., Huhtakamäki, J., & Kärkkäinen, H. (2017). Conceptualizing big social data. Journal of Big Data, 4(1).doi:10.1186/s40537-017-0063-x.

Paudel, S., Prasad, P.W.C., Alsadoon, A., Islam, M.R., & Elchouemi, A. (2019). Feature selection approach for Twitter sentiment analysis and text classification based on Chi-Square and Naïve Bayes. In J. Abawajy, K.R. Choo, R. Islam, Z. Xu, & M. Atiquzzaman (Eds.), International conference on applications and techniques in cyber security and intelligence ATCI 2018: applications and techniques in cyber security and intelligence. Cham: Springer. doi:10.1007/978-3-319-98776-7_30.

Porshnev, A., Lakshina, V., & Redkin, I. (2016). Could emotional markers in Twitter posts add information to the stock market ARMAX–GARCH Model. Higher School of Economics Research Paper, 54/FE/2016. doi:10.2139/ssrn.2763583.

Rao, T., & Srivastava, S. (2013). Modeling movements in oil, gold, forex and market indices using search volume index and Twitter sentiments. In H. Davis, H. Halpin, & A. Pentland (Eds.), WebSci’13: Proceedings of the 5th annual ACM web science conference. New York: ACM. doi:10.1145/2464464.2464521.

Singh, T., & Kumari, M. (2016). Role of text pre-processing in Twitter sentiment analysis. Procedia Computer Science, 89, 549. doi:10.1016/j.procs.2016.06.095.

Strauß, N., Vliegenthart, R., & Verhoeven, P. (2018). Intraday news trading: the reciprocal relationships between the stock market and economic news. Communication Research, 45(7). doi:10.1177/0093650217705528.

Symeonidis, S., Effrosynidis, D., & Arampatzis, A. (2018). A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Systems with Applications, 110. doi:10.1016/j.eswa.2018.06.022.

Uysal, A.K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1). doi:10.1016/j.ipm.2013.08.006.

Wysocki, P.D. (1999). Cheap talk on the web: the determinants of postings on stock message boards. University of Michigan Business School Working Paper, 98025. doi:10.2139/ssrn.160170.

Zhang, X., Fuehres, H., & Gloor, P.A. (2011). Predicting stock market indicators through Twitter ‘I hope it is not as bad as I fear’. Procedia: Social and Behavioral Sciences, 26. doi:10.1016/j.sbspro.2011.10.562.

Zobal, V. (2017). Sentiment analysis of social media and its relation to stock market. Unpublished bachelor thesis, Charles University, Prague. Retrieved 01.04.2020 from

Tweeter Developer. (2020). Retrieved 01.04.2020 from

Michailidis, M. (2017). Sentiment 140 dataset with 1.6 million tweets. Retrieved 01.04.2020 from

ISSN 1898-2255 (print)
ISSN 2392-1625 (online)

Partnerzy platformy czasopism