Discourse and sentiment analysis using social networking websites State of the Nation Program Esteban Durán Monge CBS-UT Data Camp 2016
The experience and challenges in the State of Nation Program Data: Population Census, Agricultural Census, employment surveys, administrative records, science and technoloy indicators, exports, patents... No "Big Data" data sets Big Data and Sustaniable Human Development Experience: Text Mining
Parties manifestos analysis using text mining Number of references by topic in manifestos, by political party. 2014 Poverty and inequality FA ML PAC PASE PLN PUSC RC 30 953 Productivity and employment Politics Environment Fiscal
Social Network Analysis -Relationship analysis -Edges density -Precense of topics in parties proposals Productivity and employment Fiscal
Data camp project proposal: Discourse and sentiment analysis using social networking websites Data levels: -Monitor government and opinion makers speech and sentiments Government -Monitor people sentimets and reactions to the discourse Citizens Opinion makers -Real time data -Data source: Facebook and Twitter -Text mining
Research questions What are the main topics included in the government's and opinion makers political discourse? What is the attitude of these actors to specific issues? What are the trends in people's reactions to sensitive topics for the country? Is it possible to use information from social networks to analyze the tones of political discourse over time
Methods and data Text mining with R: RFacebook, SocialMediaLab, TM, SnowballC, ggplot2, wordcloud Data source: -President's facebook page -Collection of 3519 posts for 2012-2016 period
Data set structure Variables: -Id -User name -Message -Creation time -Type -Link -Likes count -Comments count -Shares count -Reactions
Gathering and processing text data 1. Access Facebook API: Rfacebook 2. Subset the data needed for the analysis 3. Cleaning data (remove punctuation, special characters, white spaces, numbers and lowercase) 4. Word stemming: collapse words to a common root to aid vocabulary comparison 5. Transform the data into a term document matrix
First results and findings
Facebook page main trends 14K President's assumed office Comments Shares Likes 12K 10K 8K Popularity increase 3 months prior to elections 6K 4K 2K 0K 2013 2014 2015 2016 2017
Text mining first results: wordcloud -First glance -Lots of spare words -Issues with the standard too available in R
má costa rica paí añ nacion gobierno hoy millon proyecto costarricens persona desarrollo toda cada nuevo trabajo mejor inversión nueva vamo compromiso esta mañana familia dí social san accion gran mujer gracia educación obra mayor pública infraestructura ruta alegrí comunidad zona esfuerzo cambio centro mucha empleo día colon ley propuesta semana vez con política pobreza derecho gent primera sector seguridad 0 500 1,000 1,500 Word frequency plot
Word frequency plot 1,500 -Identify words with higher semantic charge -Focus on important data 1,000 500 0 má costa rica paí añ nacion gobierno hoy millon proyecto costarricens persona desarrollo toda cada nuevo trabajo mejor inversión nueva vamo compromiso esta mañana familia dí social san accion gran mujer gracia educación obra mayor pública infraestructura ruta alegrí comunidad zona esfuerzo cambio centro mucha empleo día colon ley propuesta semana vez con política pobreza derecho gent primera sector seguridad
-Identify words with high semantics -Create a reference dictionary by topic -Focus on specific information based on context -Discover text and discourse meaning Creation of a reference dictionary
Political discourse by topic Poverty and inequality Productivity and employment First approach: using political program dictionary Environment Main topics and priorities in the discourse: information flows Politics Fiscal Semantics in a general level: discourse intention Possibilities for improvement
Semantic sense and trends over time 15 10 5 Change Select specific tokens or combination 0 15 10 5 0 10 5 0 Fiscal Social 2013 2014 2015 2016 2017 2013 2014 2015 2016 2017 2013 2014 2015 2016 2017 Analyse political discourse over time Information flows are weaker or stronger for some topics o tokens? Words versus actions Abstract versus policy
Further steps and challenges Create optimized dictionary Use machine learning to create a dictionary using an automatized procedure Sentiment and discourse analysis by topic, tokens or combination of tokens Escalate this analysis to the other levels: citizens and opinion makers Create data visualization to present results: combine all the information in the same dashboard
Discourse and sentiment analysis using social networking websites Programa Estado de la Nación CBS-UT Data Camp 2016