There are two primary sources of codes (both written in Python): twitterscraper by Ahmet Taspinar and zhihu_spider by
Liu Ruoyu. Codes can be downloaded from Github through the links.
Text Analysis
Code in this part is written in R.
Build Corpus
Clean The Corpus
Several things need to be done before we can run the analysis:
Converts all text to lower case
Removes Punctuation
Removes common english words
Transforms to root words
Takes out https (since these are tweets there are a bunch of https)
Takes out spaces left by removing previous misc.
Using Tweets as an example example, the code should look like:
Make Term Document Matrix
See freq of words, then exclude to only words showing more than 7 times
Generate a Wordcloud
R packages:
The other way of making a word cloud is using the code chartjs-chart-wordcloud by Samuel Gratzl. The code is written in Typescript and Javascript.
Sentiment Analysis
Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
In R, we can use package syuzhet:
However, there is a good package in Python aviliable on Github. The package is called pattern, from Computational Linguistics Research Group.