How My Automated Irish Literature Social Media Accounts Are Generated

If you're looking for non-technical information about Irish Lit AutoTweets, you should go here, instead. The document you're reading right now contains technical information about how the tweets are generated. If you'd like a good overview on how Markov chains work, Jeff Atwood has written a highly accessible introduction.

The Short Version

I exported all of the course-related email that I wrote during the quarter into a folder on my local hard drive, saved the relevant web pages as text in the same folder, copied and pasted the course Twitter stream into a text file in that folder, and saved a text-only version of student paper comments (yes, into the same folder). Then I massaged the data in various ways to make it more amenable to text processing and to remove sensitive information (primarily student names and grades).

The tweets themselves (and the longer Tumblr discourses) are generated using my Markov chain-based text generator (forked from Harry R. Schwartz's text generator), which generates text based on a model of an existing text that analyzes the structure of that existing text by transforming creating Markov chains that model it. I opened a new Twitter account, wrote a script to generate sentences based on that compiled data and upload individual selections to that Twitter account, installed it as a cron job on my laptop that runs six times a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.

After this account had been running for about two years, I went through one of my periodic cycles of looking comparatively closely at what the account was doing as it ran along on its merry way, and it occurred to me that it's generating an awful lot of text that simply disappears into the void, mostly because the text-generation process often generates text that doesn't fit into Twitter's 140-character limit. So I arranged for the wasted text to be captured by another script, which periodically posts longer discourses based on the same textual corpus to a new Tumblr account that I opened.

You can download and use either script, if you'd like, subject to certain restrictions; read on for details. If you find it useful, I'd be grateful if you'd Flattr me. And you should tweet at me (or find another way to reach me) if you do anything interesting with it or have input or ideas.

The (Much) Longer Version

This script is run under Crunchbang++, a Linux distribution based on Debian. (But this process for generating Twitter content should work, with minimal adaptations, on any Unix-like operating system.)

Preparing the Base Corpus

The text generator needs something to analyze and use as the basis of the text it generates; we'll call this input text that the generator models the base corpus. Text used as the base corpus for this project consisted of all of the writing I did that I had captured during the quarter, and it came from several sources:

All of these documents were aggregated into a single folder, then concatenated with the standard POSIX cat command to produce a the first draft of the base corpus. After that, there was a long and boring period of searching and replacing to eliminate undesirable text that I didn't want used as the basis of the text that the script generates (URLs, student names, smart quotes, grade-related information, carriage returns, double spaces, etc. etc. etc.) Searching, replacing, and other editing were largely done with Bluefish, which is a pretty good text editor that supports searching for regular expressions and understands POSIX backslash sequences. I called the single-file base corpus 150.txt because the course for which I did the teaching was, of course, English 150.

I compiled the textual corpus into a set of Markov chains: ./sentence_generator.py -i 150.txt -c 20 -m 2 -o 2chains.dat outputs 2chains.dat, which is then the precompiled Markov chains that the script uses to generate chunks of text during its regular runs several times a day. (This isn't, strictly speaking, necessary; it's possible to just have the script re-model the entire base corpus from scratch each time it runs. But, since the base corpus is almost 625K of text, it saves some time on each script run to prepare the set of Markov chains once, in advance, instead of redoing the same work multiple times per day.)

Automating Text Generation

Once the textual corpus was assembled, I signed up for a new Twitter account for the project, logged in, and set up the profile. Then I wrote and tested a Python script that uses my Markov chain-based text generator to generate chunks of text until the generated chunk is acceptable, then sends that acceptable chunk to Twitter. Interacting with Twitter in Python is easy using any of several libraries; I decided to go with tweepy. It's not installed by default, but pip install tweepy takes care of that easily enough. The script runs through a few steps every time it's run:

  1. Parses the command line that's passed to it. It understands several command-line options; try running it as ./generate.py --help to see an explanation.
  2. Generates a chunk of text.
  3. Checks to see if the resulting text is within acceptable length parameters. Of course, Twitter's famous 140-character limit is a hard upper limit that winds up being kind of annoying sometimes, but I also enforce a minimal limit (currently 46 characters) because very short tweets tend to be rather dull, in my opinion. (Early tests of the script produced the tweet History. about one out of every five times.) If the automatically generated tweet isn't an acceptable length, it just keeps trying until it generates one that is. (Since v1.2, if the script is run with the -x or --extra-material flags, it saves rejected tweets in a separate file.)
  4. Checks to see if the tweet has been posted before. The random tweets generated by this script are in theory repeatable, so this script keeps a list of tweets that have been posted before (I refer to this as the tweet archive). If the tweet's not new, the script starts over and tries again. (Repeated tweets that are generated are not saved to the extra material archive, even if too-long and and too-short tweets are being saved.)
  5. Sends the resulting new tweet of acceptable length to Twitter.
  6. Saves the new tweet to the tweet archive. As far as I can tell, there's no good reason to sort the tweet archive — it doesn't, for instance, seem to speed up the search that determines whether the tweet has been posted before — but sorting alphabetically is possible by running ./generate.py --sort-archive, if you're the kind of person who likes things to be sorted whether there's a benefit to it or not.

You can see the script on GitHub, and previous versions are available as branches; I (in)tend to tweak it from time to time. You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under a the GNU GPL v3 or, at your option, any later version. You'll need to make it executable (chmod +x generate.py), of course, or else to invoke the script name as an argument to Python on the command line, perhaps as python3 generate.py. If you find this script useful, I'd be grateful if you'd Flattr me, though this is (of course!) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements.

Automating the Script's Runs

Once the script was set up and worked effectively when I ran it manually, I installed it as a cron job (crontab -e) so that it runs periodically (I decided on five times a day, and later bumped that up to six). Here's the line from my cron file that I use as of the time of this writing:

45 0,4,8,12,16,20 * * * python3 /150/generate.py -v -v -v -a /150/tweets.txt -x /150/extras.txt

Which is to say that the script runs on my laptop at 12:45, 4:45, and 8:45 a.m., and 12:45, 4:45, and 8:45 p.m., posting a new tweet each time, saving that tweet to the tweet archive at /150/tweets.txt, and saving rejected material to /150/extras.txt. Provided that my laptop is on and connected to the Internet, of course. Once in a while, the extra material that's being collected is posted to the Automated Irish Lit Discourses Tumblr account ... about which I will say more in a bit.

I also installed it as an anacron job to make sure that it runs at least once a day if my laptop is turned on. Here is the relevant line from /etc/anacrontab:

1 20 IrishLitTweets.daily /150/generate.py

I often specify full paths in the script because cron jobs may not have a properly set-up environment that guarantees environment variables, such as $PATH, are properly set. I find it easier to just specify full paths than to keep a crontab PATH declaration in sync with one that's maintained in a .bashrc.

Longer Discourses on Tumblr

Periodically, the extra material that builds up in the extra material archive gets posted to Automated Irish Literature Discourses on Tumblr. This is accomplished by another Python script, discourse-generate.py, which is also available on GitHub.

discourse-generate.py lives in /IrishLitDiscourses on my hard drive; it's a simpler, less-robust, less-developed script that just looks at /150/extras.txt twice a day (4 a.m. and 4 p.m., Pacific time) and makes a decision about whether to post it. The script doesn't take command-line arguments or do enough error-checking; it just rolls the dice and, sometimes, posts. My initial observation was that it took too long to build up material for posting to the Tumblr account, so what I did was tweak the Twitter script so that (as of v1.3), instead of asking for a single sentence, it generates anywhere between one and six sentences when the -x or --extra-material-archive switches are specified. This means that tweets of two or more sentences now sometimes appear in the Twitter stream; but it also has the side effect of dumping much more material into the extra material archive, since collections of two to six sentences are more likely to be too long for Twitter than single sentences are.

More specifically, the Tumblr script decides whether to post by comparing a random number to a probability calculated based on the current length of the extra material archive. The script uses an exponential decay curve to affect how likely it is that the script posts: the more material is built up, the more likely it is that the material will be posted when the script rolls the dice to determine whether or not the material gets shipped out. If there are fewer than three thousand characters waiting to go out, the chances of them doing so on any iteration of the script are zero, but the probability doesn't actually reach certainty until there are a bit over 1.5 million characters waiting to go out. (Practically speaking, of course, running the script, and therefore rolling the dice, twice a day with gradually increasing probability of posting means that the extra material archive should never get anywhere near that large.)

When the script decides to post, it decides on a title (currently Discourse of [Date]), a slug (the end of the Tumblr post's URL), the text of a Twitter annoucement, and a set of tags for the post. Then it sends them off to Tumblr using the pytumblr library.

Overall, then, the process looks like this:

  1. The the Twitter script, /150/generate.py, runs six times a day, posting tweets to @IrishLitTweets. Extra material that's rejected for being the wrong length accumulates on my hard drive in /150/extras.txt.
  2. Twice a day, assuming my laptop is on and connected to the Internet at the right times, the Tumblr script runs and decides whether a random number between zero and one is or is not less than 1 − e (len − 3000) × 14000, where len is the current length of the extra material archive, in bytes. If the random number is less than the number from the probability calculation, then:
    1. The accumulated text gets shipped off to the Tumblr account with an appropriate title, slug, and set of tags.
    2. A tweet is posted to @IrishLitTweets that points to the new long discourse on Tumblr.
    3. The extra material archive gets emptied out.
    4. The extra material archive gradually gets filled up again as the Twitter script runs. The process repeats.

Here's the crontab line that runs the script twice a day:

0 4,16 * * * /home/patrick/.Enthought/User/bin/python /IrishLitDiscourses/discourse-generate.py

And here's the line from /etc/anacrontab line that makes sure the script runs at least once on any day that my laptop is on for at least half an hour:

1 30 IrishLitDiscourses.daily /home/patrick/.Enthought/User/bin/python /IrishLitDiscourses/discourse-generate.py

Guest Lecturers

I'm planning on posting other DadaDodo-generated discourses from time to time based on statistical analyses of other texts about Irish literature. It will be a bit before this happens, because there are other tweaks that I want to make first, but there's a lot of material out there that could be run through DadaDodo and used as the basis for these guest lectures by historical figures. It's likely, for instance, that an early set of guest lectures will be based on Matthew Arnold's The Study of Celtic Literature. Suggestions are welcome, especially if you can point to source texts for analysis; let me know on Twitter if you've got a good idea.

Reservations About the Current Setup

All of which is to say, again, that I'd love feedback if people have thoughts or ideas about how this could be done better.

Change History

All updates by Patrick Mooney. The latest stable version of the tweet-producing script is always on the master branch on the GitHub project. The version of the script actually producing the text is always the highest version-numbered branch on GitHub (it is pushed to GitHub when it first goes into testing; currently this is v1.3). The same is true of the master branch and highest-numbered branch (currently v1.1) of the script posting longer sections of text on Tumblr.