How the The Worst of Bad Lovecraft Tumblr Account's Content Is Generated

If you're looking for non-technical information about The Worst of Bad Lovecraft, you should go here, instead. The document you're reading right now contains technical information about how the blog posts are generated.

The Short Version

In overview, this project operates much like my earlier Irish Lit AutoTweets account, and if you get the basic idea there, you probably get the basic idea here, too. Still, here's an overview of what happens.

I downloaded a freely available collection of H.P. Lovecraft's stories from Project Gutenberg. I generated compiled probability data from this corpus using my Python-based text analysis tool, saved the compiled data, opened a new Tumblr account, wrote a script to generate stories based on that compiled data and upload individual selections to that Tumblr account, installed it as a cron job on my laptop that runs twice a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.

You can download and use the script if you'd like, subject to certain restrictions; read on for details. If you find it useful or entertaining, I'd be grateful if you'd Flattr me.

The (Much) Longer Version

The basic setup practice was very similar to that described for the Irish Lit AutoTweets project, and this page primarily documents differences from that project. You might want to take a look at the earlier project for some background information.

All of this happens under Crunchbang++, or whatever Linux distro I'm currently using. With some minimal adaptation, though, this should work on more or less any POSIX-compliant operating system.

I took the collection of Lovecraft stories and converted it to plain text, then parsed it with my Python-based Markov text generator, which I've developed in tandem with another of my automatic text blogs. I initially performed practically no pre-processing to the text, unlike with the Irish Lit AutoTweets account, mostly because (a) the amount of text is huge, and (b) the corpus didn't contain any information that I would consider to be private, such as student names, grades, etc. (As the blog continues to produce content, though, the corpus is getting gradually edited; you can see the source texts for the blog on GitHub, if you'd like.)

I moved the corpus (which I renamed CollectedStories.html) into a folder, created a symbolic link with a short name (/lovecraft) on the root level of my drive to save myself some typing, and moved to /lovecraft in my terminal. Then I created a symbolic link to the Markov chain–based generator inside the /lovecraft folder so I can easily use the text generator from the script that interacts with Tumblr. Once that was done, I compiled the textual corpus into a useful statistical set with:

text_generator.py -m 2 -c 5 -i corpora/entire-corpus.txt -o corpora/chains.dat

After that, it was a matter of writing and testing a Python 3 script that generates chunks of text that are within acceptable length parameters (currently defined as 25 to 70 sentences, though that's subject to change) and sends them to Tumblr. I signed up for a new Tumblr account for the purpose, downloaded several out-of-copyright images from Wikipedia (full credits here), and created several about this blog documents on the blog.

Some of the work of communicating with Tumblr is done by my social media–handling module, but that module itself relies on Michael Helmick's python-tumblpy module. It also relies on a set of authentication constants that I store in yet another importable Python module, which I don't make public; they're my private authentication constants, after all, and sharing them with the world would allow anyone to post to the account.

In more detail, the script:

  1. Uses text_generator.py to generate a title of appropriate length. Lovecraft's own titles in the corpus that I've analyzed are between 2 and 55 characters, but in the past I've found that allowing for short titles tends to result in very short titles that pop up over and over, so I lengthened the requirement to 40 and 90 characters, which seemed to be about right. (And since 11 June 2014, the script checks its title against a list of all titles that have been used before, so this is no longer a problem, and from that date the script generates titles of 10 to 70 characters.)
  2. Picks a number of sentences that the story should contain (currently, between 25 and 55, though this may change in the future), and calls text_generator.py to generate the story.
  3. Sends the resulting story, its title, and a set of tags defined in the script to Tumblr using python-tumblpy.
  4. Adds the title for the story to the list of story titles that have been used so far, titles.txt.
  5. Along the way, it echoes the title, story, and some debugging information to standard output, in case I'm running the script manually.

Note that running this script can actually take quite a while — up to a minute, on my laptop. This is substantially longer than running the generate.py script for the Irish Lit AutoTweets account.

Once I'd written the script, I installed it as a cron job (crontab -e) so that it runs periodically (I decided on twice a day). Here's the line from my cron file that I use as of the time of this writing:

0 9,21 * * * /lovecraft/generate.py

9 a.m. and 9 p.m. in my time zone, of course, are noon and midnight where Lovecraft lived for most of his life.

I also install an anacron line in an attempt to make sure the script runs at least once a day if the computer's turned on, even if the computer's not turned on at the times when the cron job is supposed to run. The line in /etc/anacrontab looks like this:

1 25 cron.daily /lovecraft/generate.py

You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under the GNU GPL license, either version 3 or, at your option, any later version. You'll need to make it executable (chmod +x generate.py), of course; you'll also need to download several other modules I've written on which it depends; URLs for these modules are in source code comments. If you find the script useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements.

Historical note: the underlying process generating the stories has changed greatly since the first post on the blog: content used to be written by Jamie Zawinski's DadaDodo and shipped to Tumblr via email and/or the tumblr-rb rubygen. The glue code used to be a Linux bash script rather than Python. In fact, really the only thing that hasn't changed since the beginning is that this underlying operating system is a Linux distribution. If you want to trace back through the history of changes to the toolchain generating the blog content, read through Change History, below.

Future Improvements; Other Thoughts

Change History

All updates by Patrick Mooney. The latest stable version of the tweet-producing script is always available on GitHub.