Project repository here.
Social network analysis is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures as nodes and edges- users and their tweets directed at others, in this case. This technique, along with others such as NLP, can be leveraged to help understand communication and behavior in online contexts.
This project was developed for the Analysis of Complex Data course from the Master's in Data Science and Engineering at the Faculty of Engineering of the University of Porto.
The goal of this project was to explore unstructured social media data from Twitter - tweets - using social network analysis and natural language processing in order to uncover relationships between users and help identify important actors in the network, as well as gain insights about the topics being discussed and the overall sentiment surrounding them.
Using the TweetGrab module, two searches for tweets were performed: one using the keyword "Lula" and the second using the keyword "Bolsonaro". The data was retrieved, stored in a SQLite database, read into a Pandas dataframe, and then preprocessed. Preprocessing included steps for cleaning the text itself, formatting datetime, reading the json structure, extracting hashtags, etc.
For the sake of simplicity, a relationship between two users was defined as one person being mentioned in another's tweet. This is a very basic and simplistic model, but acceptable in the context of this work. With this being said, the network analysis ensued, using the NetworkX module for Python.
Some network metrics such as degree, closeness and betweeness centrality were calculated in order to identify the central nodes.
graph_centrality = nx.degree_centrality(largest_subgraph) max_de = max(graph_centrality.items(), key=itemgetter(1)) graph_closeness = nx.closeness_centrality(largest_subgraph) max_clo = max(graph_closeness.items(), key=itemgetter(1)) graph_betweenness = nx.betweenness_centrality(largest_subgraph, normalized=True, endpoints=False) max_bet = max(graph_betweenness.items(), key=itemgetter(1))
To find sentiment embedded in the text, a classifier was trained on a dataset containing 300.000 tweets in Portuguese with binary sentiment labels (pos, neg).
"A relationship was defined as one person being mentioned in another's tweet."
There are several other - more complex and often better - ways to model online conversations. This was supposed to be a simple study that could illustrate the potential of NLP and social network analysis.
Inspite of these simplifications, we were able to identify central nodes in both networks and take a peek at the topics flowing in them.
There are many ways in which this work can be expanded. First, one could use a different definition for interaction between users, e.g. follower-followee, retweets, replies, etc.
The documentation of the NetworkX module itself does not recommend their tool for creating network visualizations. Thus, another dedicated tool could be used for this purpose, such as Gephi, Graphviz or Cytoscape.
Also, the tweet collection period could be expanded, as well as the sheer amount of tweets to be analyzed.
Photographs by Unsplash.