How the The Worst of Bad Lovecraft Tumblr Account's Content WAS Generated

If you're looking for non-technical information about The Worst of Bad Lovecraft, you should go here, instead. The document you're reading right now contains technical information about how the blog posts are generated.

The Short Version

In overview, this project operates much like my earlier Irish Lit AutoTweets account, and if you get the basic idea there, you probably get the basic idea here, too. Still, here's an overview of what happens.

I downloaded a freely available collection of H.P. Lovecraft's stories from Project Gutenberg. I generated compiled probability data from this corpus using DadaDodo, saved the compiled data, opened a new Tumblr account, wrote a script to generate "stories" based on that compiled data and upload individual selections to that Tumblr account, installed it as a cron job on my laptop that runs twice a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.

You can download and use the script if you'd like, subject to certain restrictions; read on for details. If you find it useful, I'd be grateful if you'd Flattr me.

The (Much) Longer Version

Due to a (no longer quite so) recent change, much of this information is no longer accurate. I've struck through the parts that were already outdated when I made this announcement, but there are other inconsistencies as of this writing (8 October 2015). I hope to have a chance to write up a new description soon. Short version: I'm not using the tumblr-rb rubygem from a bash script anymore; I've rewritten the script in Python, and it interacts with Tumblr via the pytumblr library.

The basic setup practice was very similar to that described for the Irish Lit AutoTweets project, and this page primarily documents differences from that project. You might want to take a look at the earlier project for some background information.

All of this happens under Crunchbang Linux. I tend to use the Openbox window manager and Tilda terminal emulator, though that's not really germane. Really, with some minimal modification, this should work on more or less any POSIX-compliant operating system.

I took the collection of Lovecraft stories and parsed it with DadaDodo, which I already had installed. DadaDodo can process HTML as well as plain text. Unlike with the Irish Lit AutoTweets account, I did practically no pre-processing, mostly because (a) the amount of text is huge, and (b) the corpus didn't contain any information that I would consider to be private, such as student names, grades, etc. A long-term goal is to massage this corpus in various ways, though I have not actually done any real work in this regard yet.

I moved the corpus (which I renamed CollectedStories.html) into a folder, created a symbolic link with a short name (/lovecraft) on the root level of my drive to save myself some typing, and moved to /lovecraft in my terminal. Once that was done, I compiled the textual corpus into a useful statistical set with:

dadadodo -w 10000 -o chains.dat -c 1 CollectedStories.html

As part of my long-term goal to make this process more flexible, I created a sub-folder, corpora, and made a symbolic link (/lovecraft/chains.dat, which points to /lovecraft/corpora/chains.dat). I've imported the CollectedStories.html file into LibreOffice and am gradually spell-checking and otherwise massaging it, though, again, this ongoing process hasn't yet borne fruit.

After that, it was a matter of writing and testing a script that generates chunks of text that are within acceptable length parameters and sends them to Tumblr. I signed up for a new Tumblr account for the purpose, downloaded several out-of-copyright images from Wikipedia (full credits here), and created several about this blog documents.

The actual work of communicating with Tumblr is done by a RubyGem called tumblr-rb. Tumblr-rb sends a YAML file to Tumblr, as specified in the Tumblr API. YAML (which, annoyingly, is a recursive acronym for YAML Ain't Markup Language, as if anyone still thought that something Scott Adams was mocking as overused twenty years ago is edgy, and in order to point out that the syntax of the language marks up not documents, but data) is simply a plain-text Unicode file that allows for a limited amount of markup to highlight the data syntax.

To install tumblr-rb and rubygems:

sudo apt-get install rubygems sudo gem install tumblr-rb

Easy enough, right?

The first time tumblr-rb is called, it needs to be called this:

tumblr authorize

to get it to store an API key in ~/.tumblr/. Unlike with Irish Lit AutoTweets, I'm fairly sure that I don't want to use tumblr-rb for anything other than this script, so I haven't bothered to go through the process of moving the API key elsewhere and specifying it in the script's tumblr-rb command line, though this would be easy enough to do.

Once all the pieces were in place, all I needed to do was to put the finishing touches on the script, which is just a quick Bash script that does a few things:

  1. Calls dadadodo to generate a title of appropriate length. Lovecraft's own titles in the corpus that I've analyzed are between 2 and 55 characters, but in the past I've found that allowing for short titles means that DadaDodo repeats titles a great deal, so I lengthened the requirement to 40 and 90 characters, which seemed to be about right. Since 11 June 2014, the script checks its title against a list of all titles that have been used before, so this is no longer a problem, and from that date generates titles of 10 to 70 characters. After generating an acceptable new title, the script uses sed to strip out the leading whitespace that dadadodo inserts.
  2. Creates the first part of a YAML file (post.txt) with some metadata:

    --- type: text tags: H.P. Lovecraft,automatically generated text,Patrick Mooney,dadadodo title: {THE POST'S TITLE}

    Yes, there are two blank lines at the end of that front matter.

  3. Picks a number of sentences that the story should contain (currently, between 25 and 55, though this may change in the future), and calls dadadodo to generate the story, copying it to the end of the temporary file containing the YAML markdown for the post.
  4. Calls sed again to strip out leading whitespace from the post.
  5. Sends the resulting story and its metadata to Tumblr using tumblr-rb.
  6. Adds the title for the story to the list of story titles that have been used so far, titles.txt.
  7. Along the way, it echoes the title, title length (in characters), and story length (in sentences) to standard output, in case I'm running the script manually.
  8. Next time the script runs, it overwrites the post.txt file with a new one.

Note that running this script can actually take quite a while — up to several minutes, on my laptop. This is substantially longer than running the script for the Irish Lit AutoTweets account, because DadaDodo takes a much longer time to generate a bigger chunk of text based on a bigger corpus.

You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under a Creative Commons Attribution-ShareAlike 4.0 International license. You'll need to make it executable (chmod +x, of course. If you find it useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements. I'm happy to answer questions, but can't provide extensive support or bring you up to speed if the explanation here is unintelligible to you.

Once I'd written the script, I installed it as a cron job (crontab -e) so that it runs periodically (I decided on twice a day). Here's the line from my cron file that I use as of the time of this writing:

0 9,21 * * * /lovecraft/

9 a.m. and 9 p.m. in my time zone, of course, is noon and midnight where Lovecraft lived for most of his life.

I also install an anacron line in an attempt to make sure the script runs at least once a day if the computer's turned on. The line in /etc/anacrontab looks like this:

1 25 cron.daily /lovecraft/

Some Notes on the Script

I often specify full paths in the script because cron jobs may not have a properly set-up environment that guarantees environment variables, such as $PATH, are properly set.

Switches for the the dadadodo invocation (that generates the story):

-l /lovecraft/chains.dat
Don't use the original corpus; use the manually compiled statistical data instead. This is faster, especially when generating up to seventy sentences based on 3 MB of HTML-encoded Lovecraft stories.
Generate the number of sentences previously stored in the variable $LENGTH.
-w 10000
Use a really wide text wrap amount to keep DadaDodo from wrapping text.

Switches for the tumblr (i.e., the tumblr-rb executable) invocation:

Create a post (as opposed to doing one of the other things that the tumblr executable can do.)
Specifies which Tumblr account to connect to.
Specifies where the Tumblr API key is.
This last parameter gives the location of the YAML file.

Some Thoughts on the Stories

Fix me: write this section.

Future Improvements; Other Thoughts

Change History