Large text document collections are increasingly important in a variety of domains; examples of such collections include news articles, streaming social media, scientific research papers, and digitized literary documents. Existing methods for searching and exploring these collections focus on surface-level matches to user queries, ignoring higher-level thematic structure. Probabilistic topic models are a machine learning technique for finding themes that recur across a corpus, but there has been little work on how they can support end users in exploratory analysis. In this talk I will survey the topic modeling literature and describe our ongoing work on using topic models to support digital humanities research. In the second half of the talk, I will describe TopicViz, an interactive environment that combines traditional search and citation-graph exploration with a dust-and-magnet layout that links documents to the latent themes discovered by the topic model.
This work is in collaboration with:
Polo Chau, Jaegul Choo, Niki Kittur, Chang-Hyun Lee, Lauren Klein, Jarek Rossignac, Haesun Park, Eric P. Xing, and Tina Zhou
Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on social media analysis, discourse, and latent variable models. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.