October 2017 – Szymon Lis Portfolio

English version and some details below the image keep scrolling!

Lista tagów zawartych w tweetach zebranych w okresie wakacji (Czerwiec – Sierpień) w roku 2017. Tweety były filtrowane słowem ‘polityka’.

Clustering twitter data is not easy. I know that now :). A single tweet is mostly a noise. So at the end you will get a sparse matrix of corpus with density far below 1%. Simply if all corpus contains around 50 000 words (dimensions) but a single tweet has only 10 words (and probably only one or two has some meaning). Lots of noise and sparsity. And finally clustering like k-means put all tweets to one big cluster.

So I decided to change approach and try get something different from collected data.

On my first attempt was to try make words cloud with hashtags.

To do it I used nice python module:

website: http://amueller.github.io/word_cloud/
blog: http://peekaboo-vision.blogspot.co.uk/2012/11/a-wordcloud-in-python.html

This image was done on twitter data collected during June, July and August in 2017. The data was filter by polish word ‘polityka’.

Szymon Lis Portfolio

Game Development, Data Science, Research Software

Month: October 2017

Word Cloud