Topic models allow you to find topics and relationship among documents in large collections of texts. There is a (growing) number of approaches to explore and make statistical inferences about texts. This session will provide a hands-on introduction to text analysis and topic modelling using stm: An R package for Structural Topic Models.
stm is a comprehensive and highly regarded package to prepare, model and visualise textual data. The session will be mainly practical but I will also provide a short theoretical introduction to topic models and present some applications of
stm we can currently find in the literature. In this session, we will walk the length of a standard pipeline for textual analysis. We will explore different techniques and methods to import texts and associated metadata into R from a variety of sources such as PDFs, webpages, spreadsheets and APIs. We will prepare the data and clean the text using Hadley Wickham’s “tidy” approach. We will compute document term matrices and discuss the benefits of different weighting techniques. Finally, we will estimate, evaluate and visualise topic models to facilitate their interpretation and to communicate the result of the analysis.
Requirements for attendance
- Bring your own laptop;
- Preinstall R and RStudio;
- Make sure you have installed all the packages listed here;
- The day before the session, download the entire repository of the session materials from GitHub to have the most recent version of the scripts and data, uncompress the archive file
ws-201812-master.zip and open
ws-201812.Rproj with RStudio to load the project.
- Munksgaard, R., & Demant, J. (2016). Mixing politics and crime – The prevalence and decline of political discourse on the cryptomarket. International Journal of Drug Policy, 35, 77–83. doi.org/10.1016/j.drugpo.2016.04.021
- Roberts, M., Stewart, B., & Tingley, D. (forthcoming). stm: R Package for structural topic models. Journal of Statistical Software. doi.org/10.18637
- Robinson, J. S. D. (2017). Text mining with R: A tidy approach. Sebastopol, CA: O’Reilly. tidytextmining.com.
- Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245–265. doi.org/10.1080/19312458.2017.1387238
- Wickham, H., & Grolemund, G. (2017). R for data science. Sebastopol, CA: O’Reilly. r4ds.had.co.nz and tidyverse.org
Additional readings on text analysis
- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York, NY: Cambridge University Press. nlp.stanford.edu/IR-book
Data used in the workshop
9-10.30 Block 1: Preparing your text
10.30-11 Morning tea
11-12.30 Block 2: Model your text
Oriental Room S204, The Quadrangle, University of Sydney
||+61 2 8627 6895