How the The Worst of Bad Lovecraft Tumblr Account's Content Is Generated

If you're looking for non-technical information about The Worst of Bad Lovecraft, you should go here, instead. The document you're reading right now contains technical information about how the blog posts are generated.

The Short Version

In overview, this project operates much like my earlier Irish Lit AutoTweets account, and if you get the basic idea there, you probably get the basic idea here, too. Still, here's an overview of what happens.

I downloaded a freely available collection of H.P. Lovecraft's stories from Project Gutenberg. I generated compiled probability data from this corpus using DadaDodo, saved the compiled data, opened a new Tumblr account, wrote a script to generate "stories" based on that compiled data and upload individual selections to that Tumblr account, installed it as a cron job on my laptop that runs twice a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.

You can download and use the script if you'd like, subject to certain restrictions; read on for details. If you find it useful, I'd be grateful if you'd Flattr me.

The (Much) Longer Version

Due to a (no longer quite so) recent change, much of this information is no longer accurate. I've struck through the parts that were already outdated when I made this announcement, but there are other inconsistencies as of this writing (8 October 2015). I hope to have a chance to write up a new description soon. Short version: I'm not using the tumblr-rb rubygem from a bash script anymore; I've rewritten the script in Python, and it interacts with Tumblr via the pytumblr library.

The basic setup practice was very similar to that described for the Irish Lit AutoTweets project, and this page primarily documents differences from that project. You might want to take a look at the earlier project for some background information.

All of this happens under Crunchbang Linux. I tend to use the Openbox window manager and Tilda terminal emulator, though that's not really germane. Really, with some minimal modification, this should work on more or less any POSIX-compliant operating system.

I took the collection of Lovecraft stories and parsed it with DadaDodo, which I already had installed. DadaDodo can process HTML as well as plain text. Unlike with the Irish Lit AutoTweets account, I did practically no pre-processing, mostly because (a) the amount of text is huge, and (b) the corpus didn't contain any information that I would consider to be private, such as student names, grades, etc. A long-term goal is to massage this corpus in various ways, though I have not actually done any real work in this regard yet.

I moved the corpus (which I renamed CollectedStories.html) into a folder, created a symbolic link with a short name (/lovecraft) on the root level of my drive to save myself some typing, and moved to /lovecraft in my terminal. Once that was done, I compiled the textual corpus into a useful statistical set with:

dadadodo -w 10000 -o chains.dat -c 1 CollectedStories.html

As part of my long-term goal to make this process more flexible, I created a sub-folder, corpora, and made a symbolic link (/lovecraft/chains.dat, which points to /lovecraft/corpora/chains.dat). I've imported the CollectedStories.html file into LibreOffice and am gradually spell-checking and otherwise massaging it, though, again, this ongoing process hasn't yet borne fruit.

After that, it was a matter of writing and testing a script that generates chunks of text that are within acceptable length parameters and sends them to Tumblr. I signed up for a new Tumblr account for the purpose, downloaded several out-of-copyright images from Wikipedia (full credits here), and created several about this blog documents.

The actual work of communicating with Tumblr is done by a RubyGem called tumblr-rb. Tumblr-rb sends a YAML file to Tumblr, as specified in the Tumblr API. YAML (which, annoyingly, is a recursive acronym for YAML Ain't Markup Language, as if anyone still thought that something Scott Adams was mocking as overused twenty years ago is edgy, and in order to point out that the syntax of the language marks up not documents, but data) is simply a plain-text Unicode file that allows for a limited amount of markup to highlight the data syntax.

To install tumblr-rb and rubygems:

sudo apt-get install rubygems sudo gem install tumblr-rb

Easy enough, right?

The first time tumblr-rb is called, it needs to be called this:

tumblr authorize

to get it to store an API key in ~/.tumblr/. Unlike with Irish Lit AutoTweets, I'm fairly sure that I don't want to use tumblr-rb for anything other than this script, so I haven't bothered to go through the process of moving the API key elsewhere and specifying it in the script's tumblr-rb command line, though this would be easy enough to do.

Once all the pieces were in place, all I needed to do was to put the finishing touches on the script, which is just a quick Bash script that does a few things:

Calls dadadodo to generate a title of appropriate length. Lovecraft's own titles in the corpus that I've analyzed are between 2 and 55 characters, but in the past I've found that allowing for short titles means that DadaDodo repeats titles a great deal, so I lengthened the requirement to 40 and 90 characters, which seemed to be about right. Since 11 June 2014, the script checks its title against a list of all titles that have been used before, so this is no longer a problem, and from that date generate.sh generates titles of 10 to 70 characters. After generating an acceptable new title, the script uses sed to strip out the leading whitespace that dadadodo inserts.
Creates the first part of a YAML file (post.txt) with some metadata:

--- type: text tags: H.P. Lovecraft,automatically generated text,Patrick Mooney,dadadodo title: {THE POST'S TITLE}

Yes, there are two blank lines at the end of that front matter.
Picks a number of sentences that the story should contain (currently, between 25 and 55, though this may change in the future), and calls dadadodo to generate the story, copying it to the end of the temporary file ~~containing the YAML markdown for the post.~~
Calls sed again to strip out leading whitespace from the post.
Sends the resulting story and its metadata to Tumblr using tumblr-rb.
Adds the title for the story to the list of story titles that have been used so far, titles.txt.
Along the way, it echoes the title, title length (in characters), and story length (in sentences) to standard output, in case I'm running the script manually.
Next time the script runs, it overwrites the post.txt file with a new one.

Note that running this script can actually take quite a while — up to several minutes, on my laptop. This is substantially longer than running the generate.sh script for the Irish Lit AutoTweets account, because DadaDodo takes a much longer time to generate a bigger chunk of text based on a bigger corpus.

You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under a Creative Commons Attribution-ShareAlike 4.0 International license. You'll need to make it executable (chmod +x generate.sh), of course. If you find it useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements. I'm happy to answer questions, but can't provide extensive support or bring you up to speed if the explanation here is unintelligible to you.

Once I'd written the script, I installed it as a cron job (crontab -e) so that it runs periodically (I decided on twice a day). Here's the line from my cron file that I use as of the time of this writing:

0 9,21 * * * /lovecraft/generate.sh

9 a.m. and 9 p.m. in my time zone, of course, is noon and midnight where Lovecraft lived for most of his life.

I also install an anacron line in an attempt to make sure the script runs at least once a day if the computer's turned on. The line in /etc/anacrontab looks like this:

1 25 cron.daily /lovecraft/generate.sh

Some Notes on the Script

I often specify full paths in the script because cron jobs may not have a properly set-up environment that guarantees environment variables, such as $PATH, are properly set.

Switches for the the dadadodo invocation (that generates the story):

-l /lovecraft/chains.dat: Don't use the original corpus; use the manually compiled statistical data instead. This is faster, especially when generating up to seventy sentences based on 3 MB of HTML-encoded Lovecraft stories.
-c $LENGTH: Generate the number of sentences previously stored in the variable $LENGTH.
-w 10000: Use a really wide text wrap amount to keep DadaDodo from wrapping text.

Switches for the tumblr (i.e., the tumblr-rb executable) invocation:

post: Create a post (as opposed to doing one of the other things that the tumblr executable can do.)
--host=autolovecraft.tumblr.com: Specifies which Tumblr account to connect to.
--credentials=/lovecraft/.tumblr: Specifies where the Tumblr API key is.
/lovecraft/post.txt: This last parameter gives the location of the YAML file.

Some Thoughts on the Stories

Fix me: write this section.

Future Improvements

DadaDodo really doesn't really understand English grammar as well as I'd like. I've written about this more extensively in relation to the Irish Lit AutoTweets account, so I'll just say again here that if anyone has a suggestion for a Linux program that can be run from a bash script and does a better job of doing what DadaDodo does (i.e: generating text more or less grammatically, but randomly, and based on an existing corpus of text), I'd love to hear about it.
Massaging the text to make it more amendable to processing by DadaDodo is a long-term goal, and this would likely involve, as a first step, normalizing and standardizing Lovecraft's spelling. One currently problem is that DadaDodo doesn't "understand" that, for instance, the use of nonstandard diction is limited to the speech of certain characters, and so these words appear at random in the program's output. One way to do this would be to spell-check the text, and I'm working through this, but it's going to take quite a while to spell-check nearly 700 pages of text. A downside is that this would likely move Lovecraft's text more toward contemporary American usage, which is kind of a shame; but I suspect that it's better than random irruptions of dialect associated with characters like Zadok Allen in The Shadow over Innsmouth.
It would also be helpful to remove Roman numeral headings, which DadaDodo is currently parsing as if they were words (you can see an example of the result of this problem in the first sentence of this story).
I'd like to find a way to prevent DadaDodo from repeating the same word or set of words over and over and over, as it does occasionally (for example; another example).
Eventually, my goal is to be able to easily switch the .dat file to which the symbolic link points (i.e., the corpus read by the script) to be a .dat file generated from a subset of the whole collection, so that I can announce, e.g., Call of Cthulhu week, in which all stories are based on that particularly famous Lovecraft story; or perhaps the more broadly defined Great Old Ones week, based on a large subset; or Arthur Jermyn week, in which all stories are based on that particular story, which I've taught and which appears on my dissertation prospectus. (Have I mentioned that I'm a grad student in English?) But this is a long-term goal and won't be happening any time soon.
Again, I'd love feedback if people have thoughts or ideas about how this could be done better.

Change History

2014-01-04: First post on The Worst of Bad Lovecraft.
2014-06-11: generate.sh updated to make sure that post titles aren't repeated.
2014-06-15: generate.sh updated to adjust the allowable title length: now titles can be 10 to 70 characters. Allowable story length has also been shortened: stories are now 25 to 55 sentences.
2014-10-21: Last post successfully made by the old version of the script.
2014-11-04: Admin post about technical problems.
2014-12-01: The script updated to post via email. An admin post was made.
2014-12-01:
Several updates were made to the main corpus before it was recompiled into Markov chains:
- The text was massaged in several presentational ways. This doesn't affect the way that it compiles, because I export it to plain text to do so anyway, and that erases styling information, but it does make it easier for me to work with. For instance, I turned off widow/orphan control on all paragraphs.
- Roman and Arabic numeral section numbering was removed.
- Titles were moved to title case.
Hopefully, this all has a small but perceptible increase on the quality of the output generated by the algorithm.
2014-12-01: Announced Arthur Jermyn week in an admin post.
2014-12-01: Found a couple of small errors that were causing problems: the first write to the temporary file wasn't being overwritten (needed bash redirection << instead of < on the first write to the script, which didn't happen with the new version); and needed to wait for data to be written before emailing it off (added ; sync to the last write to the temporary file before emailing).
2014-12-02: I've still been getting occasional blank posts to the blog (this is a recurring problem since I started posting by email), so I tweaked the script again: now, after writing, the script runs sync, waits 30 seconds, and runs sync again. This is probably overkill, but it's really a small amount of extra time to have a script that runs background script wait, and hopefully it eliminates the problem.
2014-12-05: I think I've finally figured out the actual reason why I've been getting blank posts: the script was insufficiently specific about file location. I've tweaked it a bit to actually fix the problem. (I think. But then, I've said this before.) Here's the story: (what is now) line 58 of the script is the line that actually posts the generated story to Tumblr; it currently reads cat /lovecraft/post.txt | mail -s "$TITLE" [your post-to-Tumblr-by-email email address] (yes, I've redacted my own email address here). What's been tweaked in this script? The full pathname to post.txt has been added (by adding /lovecraft/). The problem causing the blank posts, I think, was that the script, when executed as a cron job, was not being run with /lovecraft as the current working directory ... as it typically has been when I test the script in a terminal.
Facepalm. Hopefully, I've actually fixed the problem now. Let's see. If I don't get any blank posts over the next few days, I'll remove some of the other solution attempts gradually and make sure I don't reintroduce the problem.
2014-12-12: Announced a week of posts based on Lovecraft's novella The Shadow over Innsmouth. (I've taught Innsmouth in my summer 2013 course on Science Fiction; you can slide shows here and here.) Because I felt that Innsmouth is such a rich text (and because, you know, end of the quarter in grad school), I let Innsmouth week run until 27 December.
2014-12-20: Removed some of the superfluous code in the script that was added when I was trying to find the cause of the empty posts. No behavior changes should have happened in relation to this.
2014-12-27: Announced a week (or whatever; we'll see) of posts based on Lovecraft's story The Doom that Came to Sarnath, the first of the Dream Cycle stories to have its own week.
2015-01-06: Announced a week (or so) of stories based on Lovecraft's stories The Terrible Old Man and The Strange High House in the Mist.
2015-01-07: Discovered that Tumblr tags have a maximum length ... at least, Tumblr posts that are created by sending an email to your automatic post-to-Tumblr address. The length requirement seems to not apply to posts created directly in Tumblr's web interface. This limitation is, of course (a) totally sensible, and (b) extremely well documented. As one would, of course, expect from Tumblr. Incidentally, that maximum length is 32 characters, not counting any opening hashtag/pound sign.
Tweeted about this. Twice.

Anyway, this is why all posts since the announcement of Terrible Old Man week have ended with a trailing st at the end of each post: they were tagged #The Strange High House in the Mist, which is 34 characters, not including the opening hashtag/pound sign, so Tumblr just silently chopped off the last two characters from the hashtag, converting it silently to #The Strange High House in the Mi, and adding the left-over st to the end of the story itself. Which is, once again, totally sensible, what with Tumblr being Tumblr. So I altered the script so that future stories written during this period would be tagged #Strange High House in the Mist instead of #The Strange High House in the Mi and manually altered all of the automatically generated stories written so far (1; 2; 3; 4; 5; 6; 7; 8), and the initial announcement so that they're consistent. Which was totally not a problem at all.
2015-06-07: Announced a week (or so) of stories based on Lovecraft's story The Thing on the Doorstep. I'll try not to let this week last five months, but, you know, dissertation.
2015-10-07: Announced a week (or so) of stories based on Lovecraft and R.H. Barlow's story Till A' the Seas. Also, summer totally happened and I played with this.
2015-10-08: Announced that a new script is generating this content, which means that this page that you're reading right now is pretty badly out of date. A note to that effect was placed at the top. This document will hopefully be revised soon.
2015-10-08: This post is the last that will be made on the document describing the AutoLovecraft posting mechanism based on the bash script; that document is being relegated to archival status. Updates from this point forward will be posted in the updated technical description.