The optimize_topics.sh Bash Script

optimize_topics.sh is a script that runs under the bash shell on Unix-like operating systems. It automates multiple topic modeling runs of the MALLET package with the aim of assisting a user with finding an optimal number of topics for a given search term. MALLET is a machine-learning toolkit, released by the College of Information and Computer Science at the University of Massachusetts, Amherst, that deals with natural language; it can perform multiple functions, but the one that I'm concerned with is topic modeling, a process that aims to algorithmically identify topics in passages of natural-language texts. (If you would like a short introduction to what topic modeling does and how it works, there is a short list of suggested reading below.)

Topic modeling is a machine-reading process that allows a computer to identify bags of related words that tend to co-occur in the texts in question, and is an entirely unsupervised process once the texts have been imported into the program; that is to say, the program does not necessarily identify the same topics that a human scholar would identify in the text, and does not require that (or provide an opportunity for) humans to pre-identify topics for which it's searching. One of the parameters that a human does need to specify before MALLET starts searching for topics, however, is how many topics the algorithm will search for. MALLET is perfectly happy to search for as many topics as you would like it to within a particular text.

The question How many topics should the topic-modeling algorithm search for? is still a question that does not necessarily have a clearly specified answer. Though this choice has a number of implications for the results of the modeling run, it's hard to say in advance what the optimal number of topics to search for in a given text might be, and a common approach is to try runs with different numbers of topics to see what kinds of results obtain under different conditions. Lisa Rhody has given a quick explanation of this problem and the standard response to it:

The optimize_topics.sh script is an attempt to deal with this problem by brute-forcing my way through it: the script runs MALLET multiple times for each number of topics in a given (user-specifiable) range, saving the results of each run, then providing some summary statistics about each individual run that are intended to be a useful overview of which runs are likely to represent a given search term in the way that the user is looking for.

optimize_topics.sh does not attempt to determine what an optimum number of topics for a given search term actually means, but simply presents some summary statistics related to where the specified search term actually appears in the results of these runs. The statistics it outputs are, hopefully, helpful in providing a starting point for further investigation of this issue; however, they are actually rather limited at the moment. Theoretically, the script could give different types of information, but would need to be extended in order to do so.

Files and Downloading

You can download optimize_topics.sh from its SourceForge project page. You should probably read through this documentation at least once to determine whether this script is likely to be helpful to you. Anyone interested in collaborating is encouraged to contact me through the SourceForge project site.

Status and Limitations

This script is a quick hack that's intended to provide a first approach to a thorny practical problem. It's a less-than-ideal approach in a lot of ways, but may be useful for some people despite this. If you have suggestions for how it can better do its job, or would like to collaborate, please let me know by getting in touch with me through the SourceForge project page.

It is particularly worth noting that the script currently has the following limitations, some of which are rather substantial:

Installation

Download the optimize_topics.sh script from the SourceForge project page. You should save it in the directory that is the current working directory when you execute MALLET in your terminal. (For instance, I have installed MALLET in ~/bin/mallet, which is my current working directory when I execute bin/mallet in my terminal, so I would save optimize_topics.sh in ~/bin/mallet/.) Make it executable (chmod +x optimize_topics.sh). Once you've imported a set of texts into a .mallet file, you can then run it by typing ./optimize_topics.sh, with an appropriate set of command-line parameters, from that directory.

If you are not already familiar with the basic usage of MALLET, you may find it helpful to take a look at the MALLET quick start guide.

Usage

First, import a text with a command along the lines of bin/mallet --import-dir [something]. If you're dealing with multiple texts, you'll need to find a way to combine them into a single file first; from a Linux terminal, something along the lines of cat *txt > all-files.txt might work before you import all-files.txt into MALLET. Once you've created a .mallet file from your text(s), you can then run optimize_topics.sh by typing something like:

This will run MALLET 41 times over the file input.mallet, producing runs which search for 20 topics, 21 topics, 22 topics ... 60 topics. It will create a folder, based on the name of the input file, in the current directory (e.g., on my system, this is ~/bin/mallet/input in ~/bin/mallet, because optimize_topics.sh is located in ~/bin/mallet and the name of the input .mallet file is input.mallet). This folder will contain a series of subfolders, input/20/, input/21/, input/22/ ... input/60/. Each of these subfolders contains the results of a MALLET run for that number of topics. That is, the folder input/60/ contains the results of the topic modeling run over input.mallet that searches for 60 topics, each of which contains the files input_composition.txt, input_keys.txt, and input-state.bz that are generated by each MALLET run. (Again, if you're not familiar with basic usage of MALLET for topic modeling, you may find it helpful to look at the MALLET quick start guide.)

optimize_topics.sh then goes through the results of each topic modeling run, searching for the term specified with the -t parameter. It creates a spreadsheet (well, really, a tab-separated values file), input_results.tsv, that can be read by Excel or OpenOffice Calc or LibreOffice Calc. This file contains, for each each time the search term was found as one of the top 19 words in a topic run, the following information:

Command-line Arguments

Mandatory arguments

Optional parameters

A brief help message can be seen by typing ./optimize_topics.sh --help from the working directory in which optimize_topics.sh is located.

Feature Wishlist

There are a number of ways in which optimize_topics.sh is a quick and suboptimal hack, and a number of things that I would like it to do differently. It's unlikely that any of these will happen in the immediate future unless someone else finds the script helpful and want to contribute, though.

Version and Development History

The current version of optimize_topics.sh is revision 3. This version, and previous versions, can be found at the SourceForge project page.

Revision history

(Very) Selected Bibliography

Background Reading on Topic Modeling

Burton, Matt. The Joy of Topic Modeling. 21 May 2013. Web. 5 Nov. 2014. < http://mcburton.net/blog/joy-of-tm/ >

Goldstone, Andrew, and Ted Underwood. Quiet Transformations: A Topic Model of Literary Studies Journals. N.p., n.d. Web. 5 Nov. 2014. < http://www.rci.rutgers.edu/~ag978/quiet/ >

Jockers, Matthew L. The LDA Buffet Is Now Open; Or, Latent Dirichlet Allocation for English Majors. 29 Sept. 2011. Web. 5 Nov. 2014. < http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/ >

Templeton, Clay. Topic Modeling in the Humanities: An Overview. Maryland Institute for Technology in the Humanities 1 Aug. 2011. Web. 5 Nov. 2014. < http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/ >

Underwood, Ted. Topic Modeling Made Just Simple Enough. The Stone and the Shell 7 Apr. 2012. Web. 5 Nov. 2014.< http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/ >

Weingart, Scott. Topic Modeling for Humanists: A Guided Tour. the scottbot irregular. N.p., 25 July 2012. Web. 5 Nov. 2014. < http://www.scottbot.net/HIAL/?p=19113 >

Software Referenced

McCallum, Andrew Kachites. MALLET: A Machine Learning for Language Toolkit. N.p., 2002. Web. 3 Nov. 2014 < http://mallet.cs.umass.edu/ >

Mooney, Patrick. optimize_topics.sh. 29 Apr. 2015. Web. 30 Apr. 2015. < https://sourceforge.net/projects/optimize-topics/ >

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see < http://www.gnu.org/licenses/ >

The `optimize_topics.sh` Script

Overview

Files and Downloading

Status and Limitations

Installation

Usage

Command-line Arguments

Mandatory arguments

Optional parameters

Feature Wishlist

Version and Development History

Revision history

(Very) Selected Bibliography

Background Reading on Topic Modeling

Software Referenced

Copyright Notice

The optimize_topics.sh Script

Overview

Files and Downloading

Status and Limitations

Installation

Usage

Command-line Arguments

Mandatory arguments

Optional parameters

Feature Wishlist

Version and Development History

Revision history

(Very) Selected Bibliography

Background Reading on Topic Modeling

Software Referenced

Copyright Notice

The `optimize_topics.sh` Script