English version and some details below the image keep scrolling!

Lista tagów zawartych w tweetach zebranych w okresie wakacji (Czerwiec – Sierpień) w roku 2017. Tweety były filtrowane słowem ‘polityka’.

Clustering twitter data is not easy. I know that now :).  A single tweet is mostly a noise. So at the end you will get a sparse matrix of corpus with density far below 1%.  Simply if all corpus contains around 50 000 words (dimensions) but a single tweet has only 10 words (and probably only one or two has some meaning). Lots of noise and sparsity. And finally clustering like k-means put all tweets to one big cluster.

So I decided to change approach and try get something different from collected data.

On my first attempt was to try make words cloud with hashtags.

To do it I used nice python module:

  • website: http://amueller.github.io/word_cloud/
  • blog: http://peekaboo-vision.blogspot.co.uk/2012/11/a-wordcloud-in-python.html

This image was done on twitter data collected during June, July and August in 2017. The data was filter by polish word ‘polityka’.