Week0 Project
Summary:
Using data from Twitter (global data + african data), we created a repository on github to host the code we would be using to analyse it.
After extracting and cleaning the data, we loaded it to the pandas dataframe.
We also to expanded and implemented new features using the data, particularly visualisations, to better understand the data as well as prepare it for appropriate modelling.
Then, we created a code base that connected the steps above with as much clarity as possible
Using another code base, we put together the modules involved in data analysis, namely topic modelling and sentiment analysis.
Finally, we built a dashboard to showcase the different models and plots generated from the data we gathered and put all this together in a robust machine learning pipeline to connect all the steps together cohesively.
This was all to better understand the US-China-Taiwan situation as it was happening and to be able to make informed decisions based on the questions we are able to answer.
We also wanted to also be able to understand the limitations inherent in the data and be able to account for them in decision-making as well as to evade them.
Report on Twitter Data
Insights
From both the Topic Modelling and the Sentiment Analysis of the global twitter data, it seems that the Taiwan-China Conflict is being talked about a lot right now.
The content of the tweets is shown to be mostly not subjective, polarising or sensitive suggesting that facts and maybe resources are the things being talked about on global twitter at the moment.
Topic Modelling Insights
From the LDA model that was built, we discover that there is a lot of interlap in the 5 main topics parsed from the Twitter data, with most of them mentioning Taiwan and/or China.
This leads us to conclude that the Taiwan-China conflict is a very hot topic at the moment.
From the analysis of the results the LDA model gace us, we can get visual confirmation that the topics featured above do indeed overlap significantly as shown below:
Topic0 concerns mostly: taiwan-china-military
Topic1 concerns mostly: taiwan-china
Topic2 concerns mostly: china-taiwan-ukraine
Topic3 concerns mostly: in-island-china-taiwan
Topic4 concerns mostly: taiwan-china-hotel
Sentiment Analysis Insights
-From the sentiment classifier, we were to predict what sentiment a tweet contained by categorising the text string into predefined categories.
-From the EDA, we see that most of the tweets contained information that was not subjective at all. (It could possibly be objective about the Taiwan-China Conflict)
-This is shown below:
-Most of the tweets also contained information that was not polarising at all as shown below:
-We can also see, from the volume of tweets collected, the Taiwan-China Conflict is a hot topic all over the world.
-More of the tweets were taken from India, USA, Hong Kong and Vietnam, compared to the rest of the world.
-This can be seen below:
Analysis Process
Topic Modelling
LDA Model
-Relevant python libraries were downloaded before starting the modelling.
-They included:
-warnings
-matplotlib.pyplot
-seaborn
-gensim
-gensim.models
-pandas
-ppront
-string
-os
-re
-The raw dataset (in json) is then loaded.
-We then extracted the columns we’ll be using and create a pandas dataframe which is ready for preprocessing.
-A sample of the dataframe looks as follows:
-During preprocessing, the tweets are converted to a list of words for feature engineering and a corpus is created that contains the word id and its frequency in each document.
-A sample of the corpus is shown below:
-We then build a Latent Dirichlet Allocation (LDA) Model to do the topic modelling.
Model Analysis
-We also measure the perplexity of this model so as to see how well this model predicted this sample.
-This is calculated on the LDA model we just did.
-We got a coherency score of 0.58 from the model which meant it performed reasonably well in topic modelling.
-A sample of the results is shown below:
Sentiment Analysis
Building a Sentiment Classifier using Scikit-Learn
-We build a classifier of tweets taken during the Taiwan-China Conflict.
-We first import the required libraries to understand the data:
-warnings
-gensim
-gensim.models
-pprint
-string
-numpy
-pandas
-re
-nltk
-matplotlib
-seaborn
-os
-We then load and read the data from a dataloader class
-We get the data in the following format:
Creating a basic Exploratory Data Analysis (EDA)
-We first plot the parameters of the dataframe.
-Getting the number of tweets for each subjectivity value gives the following pie chart:
-Plotting the number of tweets for each subjectivity value gieves the following pie chart below:
-Plotting the number of each tweets per place recorded gives the following pie chart:
-The distribution of possibly sensitive tweets among all the tweets is shown below:
Progress Made
●Created a respository on github to house the data and its dependencies
●Utilised github projects and issues to outline the ongoing work
●Extracted the data from the Twitter global_data dataset
●Partially cleaned the extracted data
●Conducted topic modelling
●Conducted sentiment analysis
●Analysed the models to see how effective the results they gave us were
●Understood the general sentiment of the tweets surrounding the Taiwan-China Conflict
●Created a database to store the dataframe extracted from the tweets
Challenges Faced
-When conducting the sentiment analysis on the twitter data, it was not possible to get a bar plot of the subjectivity and polarity of tweets plotted against the places they were originating from.
-This was due to the jupyter kernel taking more than 10 minutes to evaluate this data.
-It is possibly due to the data set being too large.
-Creating a streamlit app.
-Connecting the streamlit application to a MySQL connection despite installing the module on the VS Code virtual environment as shown below:
-Deploying the dashboard.
-Understanding how to utilise docker from a Windows environment.
Future Work
-Create more, aggregate, columns on the dataframe utilising the data available so as to try out more exploratory analyses
-Split the data and evaluate how well both models perform.
-Carry our more sentiment analysis using a combination of more and different columns.
-Carry out more topic modelling using a combination of more and different columns.
-Explore the different functions available in the libraries above to find out which models would be optimal in topic modelling as well as in sentiment analysis.
-Look into other ways of deploying apps onto the web.
-Explore dockerisation.
-Explore the streamlit application creation process and how it relates to MySQL.
-Try to run the project on a Linux virtual machine to see if the dependencies are easier to resolve.
End