Web Scraper + Text Analysis + Sentiment Analysis

Web Scraper

There are two primary sources of codes (both written in Python): twitterscraper by Ahmet Taspinar and zhihu_spider by Liu Ruoyu. Codes can be downloaded from Github through the links.

Text Analysis

Code in this part is written in R.

Build Corpus

install.packages("tm")
install.packages("SnowballC")
library("tm")
library("SnowballC")


Clean The Corpus

Several things need to be done before we can run the analysis:

  • Converts all text to lower case
  • Removes Punctuation
  • Removes common english words
  • Transforms to root words
  • Takes out https (since these are tweets there are a bunch of https)
  • Takes out spaces left by removing previous misc.

Using Tweets as an example example, the code should look like:

TweetCorpus <- TweetCorpus %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
TweetCorpus <- tm_map(TweetCorpus, content_transformer(tolower))
TweetCorpus <- tm_map(TweetCorpus, removeWords, stopwords("english"))
TweetCorpus <- tm_map(TweetCorpus, stemDocument) 
removeURL <- function (x) gsub('http[[:alnum:]]*','', x)
TweetCorpus <- tm_map(TweetCorpus, content_transformer(removeURL))
TweetCorpus <- tm_map (TweetCorpus, stripWhitespace)
inspect (TweetCorpus[1:5])


Make Term Document Matrix

Tweetdm <- TermDocumentMatrix(TweetCorpus)
Tweetdm <- as.matrix(Tweetdm)
Tweetdm[1:10, 1:20]

See freq of words, then exclude to only words showing more than 7 times

eachword <- rowSums(Tweetdm)
eachword
subofeach <-subset(eachword, eachword>=5)
subofeach
#barplot with words typed vertically
barplot (subofeach, las=2)


Generate a Wordcloud

R packages:

install.packages(wordcloud)
install.packages(RColorBrewer)
library(RColorBrewer)
library(wordcloud)


TweetCloud <- sort(rowSums(Tweetdm), decreasing=TRUE)
set.seed (123)
wordcloud (words=names(subofeach),
           freq=subofeach,
           max.words=30,
           colors=brewer.pal(8, "Dark2"))

barplot The other way of making a word cloud is using the code chartjs-chart-wordcloud by Samuel Gratzl. The code is written in Typescript and Javascript. word


Sentiment Analysis

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

In R, we can use package syuzhet:

install.packages(syuzhet)
library(syuzhet)
scores <- get_nrc_sentiment(Slotkinonly)
head (scores)

Slotkinonly[4]
get_nrc_sentiment('thank')
get_nrc_sentiment('president')

# using original tweets, run the sentiment analysis and plot it on a bar chart.
barplot (colSums(scores),
         las=2,
         col=rainbow(10),
         ylab="counts",
         main="Sentiment Score for Paul Junge Tweets")

sentimental
However, there is a good package in Python aviliable on Github. The package is called pattern, from Computational Linguistics Research Group.