How the The Worst of Bad Lovecraft Tumblr Account's Content Is Generated

If you're looking for non-technical information about The Worst of Bad Lovecraft, you should go here, instead. The document you're reading right now contains technical information about how the blog posts are generated.

The Short Version

In overview, this project operates much like my earlier Irish Lit AutoTweets account, and if you get the basic idea there, you probably get the basic idea here, too. Still, here's an overview of what happens.

I downloaded a freely available collection of H.P. Lovecraft's stories from Project Gutenberg. I generated compiled probability data from this corpus using my Python-based text analysis tool, saved the compiled data, opened a new Tumblr account, wrote a script to generate stories based on that compiled data and upload individual selections to that Tumblr account, installed it as a cron job on my laptop that runs twice a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.

You can download and use the script if you'd like, subject to certain restrictions; read on for details. If you find it useful or entertaining, I'd be grateful if you'd Flattr me.

The (Much) Longer Version

The basic setup practice was very similar to that described for the Irish Lit AutoTweets project, and this page primarily documents differences from that project. You might want to take a look at the earlier project for some background information.

All of this happens under Crunchbang++, or whatever Linux distro I'm currently using. With some minimal adaptation, though, this should work on more or less any POSIX-compliant operating system.

I took the collection of Lovecraft stories and converted it to plain text, then parsed it with my Python-based Markov text generator, which I've developed in tandem with another of my automatic text blogs. I initially performed practically no pre-processing to the text, unlike with the Irish Lit AutoTweets account, mostly because (a) the amount of text is huge, and (b) the corpus didn't contain any information that I would consider to be private, such as student names, grades, etc. (As the blog continues to produce content, though, the corpus is getting gradually edited; you can see the source texts for the blog on GitHub, if you'd like.)

I moved the corpus (which I renamed CollectedStories.html) into a folder, created a symbolic link with a short name (/lovecraft) on the root level of my drive to save myself some typing, and moved to /lovecraft in my terminal. Then I created a symbolic link to the Markov chain–based generator inside the /lovecraft folder so I can easily use the text generator from the script that interacts with Tumblr. Once that was done, I compiled the textual corpus into a useful statistical set with:

text_generator.py -m 2 -c 5 -i corpora/entire-corpus.txt -o corpora/chains.dat

After that, it was a matter of writing and testing a Python 3 script that generates chunks of text that are within acceptable length parameters (currently defined as 25 to 70 sentences, though that's subject to change) and sends them to Tumblr. I signed up for a new Tumblr account for the purpose, downloaded several out-of-copyright images from Wikipedia (full credits here), and created several about this blog documents on the blog.

Some of the work of communicating with Tumblr is done by my social media–handling module, but that module itself relies on Michael Helmick's python-tumblpy module. It also relies on a set of authentication constants that I store in yet another importable Python module, which I don't make public; they're my private authentication constants, after all, and sharing them with the world would allow anyone to post to the account.

In more detail, the script:

Uses text_generator.py to generate a title of appropriate length. Lovecraft's own titles in the corpus that I've analyzed are between 2 and 55 characters, but in the past I've found that allowing for short titles tends to result in very short titles that pop up over and over, so I lengthened the requirement to 40 and 90 characters, which seemed to be about right. (And since 11 June 2014, the script checks its title against a list of all titles that have been used before, so this is no longer a problem, and from that date the script generates titles of 10 to 70 characters.)
Picks a number of sentences that the story should contain (currently, between 25 and 55, though this may change in the future), and calls text_generator.py to generate the story.
Sends the resulting story, its title, and a set of tags defined in the script to Tumblr using python-tumblpy.
Adds the title for the story to the list of story titles that have been used so far, titles.txt.
Along the way, it echoes the title, story, and some debugging information to standard output, in case I'm running the script manually.

Note that running this script can actually take quite a while — up to a minute, on my laptop. This is substantially longer than running the generate.py script for the Irish Lit AutoTweets account.

Once I'd written the script, I installed it as a cron job (crontab -e) so that it runs periodically (I decided on twice a day). Here's the line from my cron file that I use as of the time of this writing:

0 9,21 * * * /lovecraft/generate.py

9 a.m. and 9 p.m. in my time zone, of course, are noon and midnight where Lovecraft lived for most of his life.

I also install an anacron line in an attempt to make sure the script runs at least once a day if the computer's turned on, even if the computer's not turned on at the times when the cron job is supposed to run. The line in /etc/anacrontab looks like this:

1 25 cron.daily /lovecraft/generate.py

You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under the GNU GPL license, either version 3 or, at your option, any later version. You'll need to make it executable (chmod +x generate.py), of course; you'll also need to download several other modules I've written on which it depends; URLs for these modules are in source code comments. If you find the script useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements.

Historical note: the underlying process generating the stories has changed greatly since the first post on the blog: content used to be written by Jamie Zawinski's DadaDodo and shipped to Tumblr via email and/or the tumblr-rb rubygen. The glue code used to be a Linux bash script rather than Python. In fact, really the only thing that hasn't changed since the beginning is that this underlying operating system is a Linux distribution. If you want to trace back through the history of changes to the toolchain generating the blog content, read through Change History, below.

Future Improvements; Other Thoughts

Massaging the text to make it more amenable to processing by the text generator is a long-term goal, and the first step involves normalizing and standardizing Lovecraft's spelling. One current problem is that the Markov chain-based parser doesn't understand that, for instance, the use of nonstandard diction is limited to the speech of certain characters, and so these words appear at random in the program's output. One way to do this would be to spell-check the text, and I'm working through this, but it's going to take quite a while to spell-check nearly 700 pages of text. A downside is that this would likely move Lovecraft's text more toward contemporary American usage, which is kind of a shame; but I suspect that it's better than random irruptions of dialect associated with characters like Zadok Allen in The Shadow over Innsmouth.
There used to be more stuff in this list, but those ideas have been implemented.
Again, I'd love feedback if people have thoughts or ideas about how better automatic Lovecraft stories could be generated.

Change History

2014-01-04: First post on The Worst of Bad Lovecraft.
2014-06-11: generate.sh updated to make sure that post titles aren't repeated.
2014-06-15: generate.sh updated to adjust the allowable title length: now titles can be 10 to 70 characters. Allowable story length has also been shortened: stories are now 25 to 55 sentences.
2014-10-21: Last post successfully made by the initial version of the generate.sh script, which was a bash script that posted to Tumblr via the tumblr-rb rubygem.
2014-11-04: Admin post about technical problems.
2014-12-01: The script updated to post via email instead of by using the tumblr-rb rubygem. An admin post was made.
2014-12-01: Several updates were made to the main corpus before it was recompiled into Markov chains:
- The text was massaged in several presentational ways. This doesn't affect the way that it compiles, because I export it to plain text to do so anyway, and that erases styling information, but it does make it easier for me to work with. For instance, I turned off widow/orphan control on all paragraphs.
- Roman and Arabic numeral section numbering was removed.
- Titles were moved to title case.
Hopefully, this all has a small but perceptible increase on the quality of the output generated by the algorithm.
2014-12-01: Announced Arthur Jermyn week in an admin post.
2014-12-01: Found a couple of small errors in the generate.sh script that were causing problems: the first write to the temporary file wasn't being overwritten (needed bash redirection << instead of < on the first write to the script, which didn't happen with the new version); and needed to wait for data to be written before emailing it off (added ; sync to the last write to the temporary file before emailing).
2014-12-02: I've still been getting occasional blank posts to the blog (this is a recurring problem since I started posting by email), so I tweaked the script again: now, after writing, the script runs sync, waits 30 seconds, and runs sync again. This is probably overkill, but it's really a small amount of extra time to have a script that runs in the background wait, and hopefully it eliminates the problem.
2014-12-05: I think I've finally figured out the actual reason why I've been getting blank posts: the script was insufficiently specific about file location. I've tweaked it a bit to actually fix the problem. (I think. But then, I've said this before.) Here's the story: (what is now) line 58 of the script is the line that actually posts the generated story to Tumblr; it currently reads cat /lovecraft/post.txt | mail -s "$TITLE" [your post-to-Tumblr-by-email email address] (yes, I've redacted my own email address here). What's been tweaked in this script? The full pathname to post.txt has been added (by adding /lovecraft/). The problem causing the blank posts, I think, was that the script, when executed as a cron job, was not being run with /lovecraft as the current working directory ... as it typically has been when I test the script in a terminal.
Facepalm. Hopefully, I've actually fixed the problem now. Let's see. If I don't get any blank posts over the next few days, I'll remove some of the other solution attempts gradually and make sure I don't reintroduce the problem.
2014-12-12: Announced a week of posts based on Lovecraft's novella The Shadow over Innsmouth. (I've taught Innsmouth in my summer 2013 course on Science Fiction; you can see slide shows here and here.) Because I felt that Innsmouth is such a rich text (and because, you know, end of the quarter in grad school), I let Innsmouth week run until 27 December.
2014-12-20: Removed some of the superfluous code in the script that was added when I was trying to find the cause of the empty posts. No behavior changes should have happened in relation to this.
2014-12-27: Announced a week (or whatever; we'll see) of posts based on Lovecraft's story The Doom that Came to Sarnath, the first of the Dream Cycle stories to have its own week.
2015-01-06: Announced a week (or so) of stories based on Lovecraft's stories The Terrible Old Man and The Strange High House in the Mist.
2015-01-07: Discovered that Tumblr tags have a maximum length ... at least, Tumblr posts that are created by sending an email to your automatic post-to-Tumblr address do. The length requirement seems to not apply to posts created directly in Tumblr's web interface. This limitation is, of course, as one would expect from Tumblr, (a) totally sensible, and (b) extremely well documented. Incidentally, that maximum length is 32 characters, not counting any opening hashtag/pound sign.
Tweeted about this. Twice.

Anyway, this is why all posts since the announcement of Terrible Old Man week have ended with a trailing st at the end of each post: they were tagged #The Strange High House in the Mist, which is 34 characters, not including the opening hashtag/pound sign, so Tumblr just silently chopped off the last two characters from the hashtag, converting it silently to #The Strange High House in the Mi, and adding the left-over st to the end of the story itself. Which is, once again, totally sensible, what with Tumblr being Tumblr. So I altered the script so that future stories written during this period would be tagged #Strange High House in the Mist instead of #The Strange High House in the Mi and manually altered all of the automatically generated stories written so far (1; 2; 3; 4; 5; 6; 7; 8), and the initial announcement so that they're consistent. Which was totally not a problem at all.
2015-06-07: Announced a week (or so) of stories based on Lovecraft's story The Thing on the Doorstep. I'll try not to let this week last five months, but, you know, dissertation.
2015-10-07: Announced a week (or so) of stories based on Lovecraft and R.H. Barlow's story Till A' the Seas. Also, summer totally happened and I played with this.
2015-10-08: Announced that a new script is generating this content, which means that the technical description is pretty badly out of date. A note to that effect was placed at the top. This document will hopefully be revised soon.
2015-10-08: This post is the last that will be made on the document describing the AutoLovecraft posting mechanism based on the bash script; that document is being relegated to archival status. Updates from this point forward will be posted in the updated technical description.
2015-10-16: Announced a week (or so) of stories based on Lovecraft's story The Tomb.
2015-10-24: Announced a week (or so) of stories based on Lovecraft's story The Temple.
2015-10-31: Announced a week (or whatever) of stories based on Lovecraft's story The Dunwich Horror.
2015-11-13: Announced a month (or whatever) of stories based on what is probably Lovecraft's most famous story, The Call of Cthulhu.
2015-11-24: Re-tagged twenty-nine stories that had been tagged The Call of Cthulhu but not The Call of Cthulhu month (an error on my part in setting up the script). This is the (chronologically) first story that needed to be re-tagged; this is the last. Also re-marked the HTML in those blog posts, because Tumblr takes the HTML generated by the script and displays it properly, but it gets munged if it's edited without re-marking paragraphs with <p> tags. Because Tumblr's interface is increasingly terrible, you know.
2015-11-25: Under-the-hood updates to the Python script that generates the posts. Hopefully, no changes are visible. However, here are the details: it was converted to run under Python 3.X instead of 2.X. Because the Pytumblr library doesn't run under Python 3, the script now uses the comparable Tumblpy library, instead. More substantially, though, there's a change to the text-handling logic: individual paragraphs are now wrapped with <p> tags, which I've never bothered to do before. The motive for this is yesterday's ninety-minute experience in editing tags: it turned out to be necessary to wrap each paragraph in paragraph tags manually, because Tumblr recognizes double-returns as marking paragraphs when the post is initially shipped off, but without paragraph tags, the story gets munged into one long paragraph if it's edited again unless the paragraph tags are manually added. So now the script wraps paragraphs properly in <p> tags in case future editing is necessary.
2015-12-13: Announced a week (or whatever) of stories based on Lovecraft's 1927 story The Very Old Folk.
2015-12-26: Announced a week (or whatever) of stories based on Lovecraft's 1935 story The Haunter in the Dark.
2016-02-04: Corrected about 90 stories that had been incorrectly tagged The Haunter in the Dark Week week. Sigh.
2016-02-04: Announced that the engine generating the text will henceforth be my Markov chain–based text generator instead of DadaDodo. This is a big change, and I'd like feedback on it.
Simply because this is such a big change, I'm also retiring the technical write-up I was maintaining at the time, marking it of historical interest (it describes a viable way to maintain the AutoLovecraft blog, but not the one I'm currently using) and creating a new version. In any case, the current version of this document is out of date, so this is a good time to make it current.
2016-02-09: Announced the rapid-fire posting of twenty stories based on Lovecraft's very short story Azathoth. It turned out not to be so rapid-fire: the story is so short that there wasn't enough variation in the sentence structure for the algorithm to find plausible, unused titles. Title length restrictions eventually had to be loosened substantially, though they've now been restored to what they were 24 hours earlier.
2016-02-09: Announced a week (or whatever) of stories based on Lovecraft's Beyond the Wall of Sleep.
2016-02-16: Announced a month (or whatever) of stories based on Lovecraft's novella At the Mountains of Madness.
2016-04-22: At the Mountains of Madness month having been so much fun, I let it stretch on; now I'm announcing a week (or whatever) of stories based on Lovecraft's Cool Air.
2016-08-18: Finally announced a new source text: Dagon. There have been some technical problems, briefly discussed in that same announcement.
2016-09-06: Finally announced a new source text: Ex Oblivione.
2016-09-14: Announced a new source text: From Beyond.
2016-10-15: Announced a new source text: The Nameless City.
2017-01-03: Announced a new source text: Nyarlathotep.
2017-01-18: Announced a new source text: The Colour out of Space.
2017-03-23: Announced a new source text: The Festival.
2017-04-10: Announced a new source text: The Unnamable.
2017-04-12: Increasing the Markov chain length to 3 as an experiment. Not bothering to announce the change on the blog. (But that's what motivated posting this story, this story, and this story.) And yes, we're still working with The Unnamable.
2017-04-13: Finally adding the source texts to the GitHub repository so changes can be easily tracked. Also, finally added hAtom markup to this page's update list.
2017-04-19: Announced a new source text: The Dreams in the Witch-House.
2017-04-21: I'm experimenting with using a new version of the text generator to generate the text on the blog. Didn't announce this in an admin post; hopefully, the change will be non-detectable without reading through these notes.
2017-04-27: Announced a new source text: The Case of Charles Dexter Ward.
2017-05-05: Announced a new source text: The Dream-Quest of Unknown Kadath.
2017-05-20: Announced a pair of new source texts: The Silver Key and its sequel, Through the Gates of the Silver Key.
2017-06-05: Announced a new source text: Polaris.
2017-06-19: Saw that the blog has not been posting for the last two and a half days or so: it turns out on investigation that Polaris, like Azathoth before it, is very very short, and the minimalist Markov chains that it generates have few branches, and all of the story titles that could meet basic required parameters for the title had already been generated: there were five copies of the script running, each trying to produce a title. Loosened the title requirements temporarily and generated enough stories to make up for the missing days.
2017-06-19: Announced a new source text: The Quest of Iranon.
2017-07-12: Announced a new source text: The Cats of Ulthar.
2017-09-22: Announced a new source text: The Other Gods.
2017-12-04: Announced a new source text: The Hound.
2018-04-02: Announced a new source text: Celephaïs.
2018-06-03: Announced a new source text: The White Ship.
2018-07-18: Announced a new source text: The Alchemist. (Relevant Git commit.)
2018-12-31: Announced a new source text: Memory. (Relevant Git commit.)
2020-05-06: Announced a new source text: In the Vault. (Relevant Git commit.)
2023-12-05: Announced a new source text: Memory. (Relevant Git commit.)