The `optimize_topics.sh` Script
===============================

Overview
--------

`optimize_topics.sh` is a script that runs under the
[`bash`](http://en.wikipedia.org/wiki/Bash_(Unix_shell)) shell on
Unix-like operating systems. It automates multiple topic modeling runs
of [the MALLET package](http://mallet.cs.umass.edu/download.php) with
the aim of assisting a user with finding an optimal number of topics for
a given search term. MALLET is a machine-learning toolkit, released by
the [College of Information and Computer
Science](https://www.cs.umass.edu/) at the University of Massachusetts,
Amherst, that deals with natural language; it can perform multiple
functions, but the one that I'm concerned with is topic modeling, a
process that aims to algorithmically identify “topics” in passages of
natural-language texts. (If you would like a short introduction to what
topic modeling does and how it works, there is [a short list of
suggested reading](#intro-reading) below.)

Topic modeling is a machine-reading process that allows a computer to
identify “bags” of related words that tend to co-occur in the texts in
question, and is an entirely unsupervised process once the texts have
been imported into the program; that is to say, the program does not
necessarily identify the same “topics” that a human scholar would
identify in the text, and does not require (or provide an opportunity
for) humans to pre-identify topics for which it's searching. One of
the parameters that a human *does* need to specify before MALLET starts
searching for topics, however, is how many topics the algorithm will
search for. MALLET is perfectly happy to search for as many topics as
you would like it to within a particular text.

The question “How many topics should the topic-modeling algorithm search
for?” is still a question that does not necessarily have a clearly
specified answer. Though this choice has a number of implications for
the results of the modeling run, it's hard to say in advance what the
optimal number of topics to search for in a given text might be, and a
common approach is to try runs with different numbers of topics to see
what kinds of results obtain under different conditions. Lisa Rhody has
given [a quick explanation of this
problem](http://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/#topic-modeling-and-figurative-language-by-lisa-m-rhody-n-2)
and the standard response to it:

> The process of determining the number of topics to tell the model to
> use is not, as of yet, a standardized procedure. The measure for the
> “right” topic number is often derived through trial and error. After
> starting with one number (usually between 40 and 60) one determines
> how “actionable” and “coherent” the topics that the model produces
> are, adjusting up and down in subsequent iterations until there is
> agreement that the best model has been produced.

The `optimize_topics.sh` script is an attempt to deal with this problem
by brute-forcing my way through it: the script runs MALLET multiple
times for each number of topics in a given (user-specifiable) range,
saving the results of each run, then providing some summary statistics
about each individual run that are intended to be a useful overview of
which runs are likely to represent a given search term in the way that
the user is looking for.

`optimize_topics.sh` does not attempt to determine what “an optimum
number of topics for a given search term” actually means, but simply
presents some summary statistics related to where the specified search
term actually appears in the results of these runs. The statistics it
outputs are, hopefully, helpful in providing a starting point for
further investigation of this issue; however, they are actually rather
limited at the moment. Theoretically, the script could give different
types of information, but would need to be extended in order to do so.

Files and Downloading
---------------------

You can download `optimize_topics.sh` from its [SourceForge project
page](https://sourceforge.net/projects/optimize-topics/). You should
probably read through this documentation at least once to determine
whether this script is likely to be helpful to you. Anyone interested in
collaborating is encouraged to contact me through the SourceForge
project site.

This documentation file is available [from my personal
website](http://patrickbrianmooney.nfshost.com/~patrick/projects/optimize_topics/README.html)
and [from
SourceForge](http://sourceforge.net/projects/optimize-topics/files/README.html/download).
There is [a stub overview of the
project](http://patrickbrianmooney.nfshost.com/~patrick/projects/optimize_topics/)
on my personal website, but this file that you're reading has much more
information. This README document is also available as [plain
text](http://sourceforge.net/projects/optimize-topics/files/README.txt/download)
and
[markdown](http://sourceforge.net/projects/optimize-topics/files/README.markdown/download)
at the SourceForge project page.

Status and Limitations
----------------------

This script is a quick hack that's intended to provide a first approach
to a thorny practical problem. It's a less-than-ideal approach in a lot
of ways, but may be useful for some people despite this. If you have
suggestions for how it can better do its job, or would like to
collaborate, please let me know by getting in touch with me through [the
SourceForge project
page](https://sourceforge.net/projects/optimize-topics/).

It is particularly worth noting that the script currently has the
following limitations, some of which are rather substantial:

-   The script will only run under Linux and other Unix-like operating
    systems. (It has been developed and tested under Linux; I would love
    to hear about your experiences using it under BSD or OS X or other
    Unix-like operating systems.) In particular, it will probably never
    work under Windows (except perhaps under a compatibility layer such
    as [Cygwin](https://www.cygwin.com/)); much of the work of the
    script is actually done by standard Unix-like system utilities.
-   The script requires that you have the `bash` shell installed;
    `optimize_topics.sh` needs to be interpreted by this particular
    shell. (It is quite likely that your Unix-like operating system has
    `bash` installed, but not completely certain. Notably, users of OS X
    do not necessarily have `bash` installed by default.) Again, if you
    are able to use the script under another shell or under other
    circumstances than those specified, I would love to hear about it.
-   The script does *not* deal directly with text (or HTML) files; it
    requires that you use MALLET to preprocess the files and create a
    .mallet file from them before you can use the script.
-   The current version of the script *only* handles .mallet files that
    are created from a *single* input text file; it cannot effectively
    handle .mallet files that are generated from multiple input files.
    *This is admittedly a serious limitation* to the script's usability,
    but is not a trivial problem to solve. I would like to say again
    that collaborators are welcome.
-   The script doesn't perform particularly sophisticated analysis of
    the results of individual topic runs; it just finds and spits out
    some basic statistics that are already available from the topic runs
    themselves. More sophisticated analysis is possible but hasn't been
    implemented yet.
-   This kind of brute-force approach is substantially less practical
    when analyzing large corpora, when each run will take a long time.
    There are probably ways to ameliorate this, such as running the
    script in small increments and combining the intermediate results
    files. Another approach might be to run the script over a subset of
    the corpus in question in order to use those results as a starting
    point for the investigation of the full corpus. Other suggestions
    for improvement of this matters (or other matters) would be welcome.

Installation
------------

[Download](https://sourceforge.net/projects/optimize-topics/files/) the
`optimize_topics.sh` script from [the SourceForge project
page](https://sourceforge.net/projects/optimize-topics/). You should
save it in the directory that is the current working directory when you
execute MALLET in your terminal. (For instance, I have installed MALLET
in `~/bin/mallet`, which is my current working directory when I execute
`bin/mallet` in my terminal, so I would save `optimize_topics.sh` in
`~/bin/mallet/`.) Make it executable (`chmod +x optimize_topics.sh`).
Once you've imported a set of texts into a .mallet file, you can then
run it by typing `./optimize_topics.sh`, with an appropriate set of
command-line parameters, from that directory.

If you are not already familiar with the basic usage of MALLET, you may
find it helpful to take a look at [the MALLET quick start
guide](http://mallet.cs.umass.edu/quick-start.php).

Usage
-----

First, import a text with a command along the lines of
`bin/mallet --import-dir [something]`. If you're dealing with multiple
texts, you'll need to find a way to combine them into a single file
first; from a Linux terminal, something along the lines of
`cat *txt > all-files.txt` might work before you import `all-files.txt`
into MALLET. Once you've created a .mallet file from your text(s), you
can then run `optimize_topics.sh` by typing something like:

> `./optimize_topics.sh -i input.mallet -t "target phrase" -l 20 -u 60`

This will run MALLET 41 times over the file `input.mallet`, producing
runs which search for 20 topics, 21 topics, 22 topics ... 60 topics. It
will create a folder, based on the name of the input file, in the
current directory (e.g., on my system, this is `~/bin/mallet/input` in
`~/bin/mallet`, because `optimize_topics.sh` is located in
`~/bin/mallet` and the name of the input .mallet file is
`input.mallet`). This folder will contain a series of subfolders,
`input/20/`, `input/21/`, `input/22/` ... `input/60/`. Each of these
subfolders contains the results of a MALLET run for that number of
topics. That is, the folder `input/60/` contains the results of the
topic modeling run over `input.mallet` that searches for 60 topics, each
of which contains the files `input_composition.txt`, `input_keys.txt`,
and `input-state.bz` that are generated by each MALLET run. (Again, if
you're not familiar with basic usage of MALLET for topic modeling, you
may find it helpful to look at [the MALLET quick start
guide](http://mallet.cs.umass.edu/quick-start.php).)

`optimize_topics.sh` then goes through the results of each topic
modeling run, searching for the term specified with the `-t` parameter.
It creates a spreadsheet (well, really, a tab-separated values file),
`input_results.tsv`, that can be read by Excel or OpenOffice Calc or
LibreOffice Calc. This file contains, for each each time the search term
was found as one of the top 19 words in a topic run, the following
information:

-   The number of topics searched for in that run.
-   The number of the topic in that run in which the search term was
    found.
-   The percentage of the topic in the whole corpus, for the run
    currently being described, occupied by the topic in which the search
    term was found.
-   The Dirichlet parameter for the topic in which the search term was
    found in the run being described.
-   The top words in the topic in which the search term was found.

### Command-line Arguments

#### Mandatory arguments

`-i filename.mallet`  
specifies the file generated by MALLET when importing textual data. It
**HAS TO** end with a .mallet extension.

`-t target_phrase `  
specifies the target phrase for which to search as the script iterates
over its runs of MALLET.

`-l lower_bound`  
specifies the smallest number of topics to model.

`-u upper_bound`  
specifies the largest number of topics to model.

#### Optional parameters

`-k`  
makes the script keep the temporary files it generates instead of
deleting them. This is probably only useful if you're trying to debug
this script.

`-v`  
makes optimize\_topics give more information about what it is doing. You
can specify it multiple times (currently, there's no point to specifying
it more than twice) to make the script even more verbose.

`-y`  
specifies a number to pass to MALLET as the hyperoptimization parameter.
If not specified, it defaults to 20.

A brief help message can be seen by typing `./optimize_topics.sh --help`
from the working directory in which `optimize_topics.sh` is located.

Feature Wishlist
----------------

There are a number of ways in which `optimize_topics.sh` is a quick and
suboptimal hack, and a number of things that I would like it to do
differently. It's unlikely that any of these will happen in the
immediate future unless someone else finds the script helpful and want
to contribute, though.

-   Probably the most useful change over the long term would be to
    re-code the script in a language more amenable to processing text in
    columns than `bash` scripting. Python seems ideal, but I'm not even
    a sufficiently good Python coder to claim to be a bad Python coder.
    Doing this would lay the groundwork for making the script capable of
    doing more sophisticated analysis and would likely go a long way
    toward creating a script that's useful to Windows users. I initially
    wrote the script as a `bash` script because I know my way around the
    language much better than I know Python. Still, making meaningful
    changes and extensions to this script will become prohibitively
    complex unless the script is re-written in a language that has
    better built-in numeric and text-processing capabilities.
-   Another high-priority change would be to make the script deal
    effectively with .mallet files that have been created from multiple
    input files. This is itself a feature that would be complex enough
    to implement in a `bash` script to justify re-coding in another
    language on its own. It would also involve a number of decisions
    about how to report (certain) summary statistics for runs over these
    more complex .mallet files (what is the overall Dirichlet parameter
    for topics in these files, how is it determined?). Still, it would
    potentially make the script much more useful and flexible.
-   There are a number of other ways that results could be analyzed,
    theoretically, but the script doesn't currently make any attempt to
    do this. Again, opening up other types of analyses would be much
    easier if the script were re-coded in Python or another language
    that better supported text and numeric processing.
-   Other suggestions would be welcome. Interested volunteers who would
    like to contribute to this project would also be welcome to do so.

Version and Development History
-------------------------------

The current version of `optimize_topics.sh` is revision 3. This version,
and previous versions, can be found at [the SourceForge project
page](https://sourceforge.net/projects/optimize-topics).

`optimize_topics.sh` was written by [Patrick
Mooney](http://patrickbrianmooney.nfshost.com/~patrick/), a graduate
student in the [Department of English](http://english.ucsb.edu) at [the
University of California, Santa Barbara](http://www.ucsb.edu). It was
developed as part of my work as a research assistant for the
[WhatEvery1Says](http://4humwhatevery1says.pbworks.com/) project.
Questions and problem reports are best submitted (and conversations are
best initiated) through SourceForge, but you can also [find me in other
places online](http://patrickbrianmooney.nfshost.com/~patrick/personal.html#other-web)
or reach me by email at patrick DOT brian DOT mooney AT gmail DOT com.

### Revision history

revision 3  
A small update that cleans up a few lines of code from revision 2

revision 2  
Cleans up some of the unbearable uglinesses of the first version; it
prints a better help message, uses GNU getopts to process command-line
parameters, and allows some additional parameters to be specified on the
command line instead of requiring the user to edit the script to
configure these issues.

revision 1  
Initial write-up. An even quicker and uglier hack than the current
version.

Have I mentioned that interested collaborators would be welcome?

(Very) Selected Bibliography
----------------------------

### Background Reading on Topic Modeling

Burton, Matt. “The Joy of Topic Modeling.” 21 May 2013. Web. 5 Nov.
2014.
\< [http://mcburton.net/blog/joy-of-tm/](http://mcburton.net/blog/joy-of-tm/) \>

Goldstone, Andrew, and Ted Underwood. “Quiet Transformations: A Topic
Model of Literary Studies Journals.” N.p., n.d. Web. 5 Nov. 2014.
\< [http://www.rci.rutgers.edu/\~ag978/quiet/](http://www.rci.rutgers.edu/~ag978/quiet/) \>

Jockers, Matthew L. “The LDA Buffet Is Now Open; Or, Latent Dirichlet
Allocation for English Majors.” 29 Sept. 2011. Web. 5 Nov. 2014.
\< [http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/) \>

Templeton, Clay. “Topic Modeling in the Humanities: An Overview.”
Maryland Institute for Technology in the Humanities 1 Aug. 2011. Web. 5
Nov. 2014.
\< [http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/](http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/) \>

Underwood, Ted. “Topic Modeling Made Just Simple Enough.” The Stone and
the Shell 7 Apr. 2012. Web. 5 Nov.
2014.\< [http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/](http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) \>

Weingart, Scott. “Topic Modeling for Humanists: A Guided Tour.” the
scottbot irregular. N.p., 25 July 2012. Web. 5 Nov. 2014.
\< [http://www.scottbot.net/HIAL/?p=19113](http://www.scottbot.net/HIAL/?p=19113) \>

### Software Referenced

McCallum, Andrew Kachites. MALLET: A Machine Learning for Language
Toolkit. N.p., 2002. Web. 3 Nov. 2014
\< [http://mallet.cs.umass.edu/](http://mallet.cs.umass.edu/) \>

Mooney, Patrick. optimize\_topics.sh. 29 Apr. 2015. Web. 30 Apr. 2015.
\< [https://sourceforge.net/projects/optimize-topics/](https://sourceforge.net/projects/optimize-topics/) \>

Copyright Notice
----------------

`optimize_topics.sh` is copyright © 2015 Patrick Mooney.

This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.

You should have received a copy of the GNU General Public License along
with this program. If not, see
\< [http://www.gnu.org/licenses/](http://www.gnu.org/licenses/) \>

This page copyright © 2015 by Patrick Mooney. [This README file](#) was
last updated 3 May 2015.