Polish sentiment analysis that works

Sentiment analysis of natural speech is a classical problem in machine learning. There are plenty of good solutions. for English. Unfortunately, other languages are greatly neglected by researches. Compared to English, Polish language lacks of high-quality training data. Moreover, the algorithms developed for English applied directly to morphologically rich languages results in a poor performance.

Our goal is to push further research on natural language processing for Polish and other European languages. We started with a language modelling task which is a basic building block of efficient NLP models. Our model is based on Universal Language Model Fine Tuning (ULMFiT) by J. Howard and S. Ruder and uses subword tokenisation to better use inflection of Polish. We won the first place in PolEval 2018 competition.

Of course the language model for Polish is just the beginning. We are testing its performance as a backbone for sentiment analysis and the preliminary results look promising. Moreover, we showed that our approach can be successfully applied to German language (see sentiment analysis for German).