The Worst of Bad LovecraftTumblr Account's Content Is Generated
If you're looking for non-technical information about The Worst of Bad Lovecraft, you should go here, instead. The document you're reading right now contains technical information about how the blog posts are generated.
In overview, this project operates much like my earlier Irish Lit AutoTweets account, and if you get the basic idea there, you probably get the basic idea here, too. Still, here's an overview of what happens.
I downloaded a freely available collection of H.P. Lovecraft's stories from Project Gutenberg. I generated compiled probability data from this corpus using my Python-based text analysis tool, saved the compiled data, opened a new Tumblr account, wrote a script to generate
stories based on that compiled data and upload individual selections to that Tumblr account, installed it as a
cron job on my laptop that runs twice a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.
The basic setup practice was very similar to that described for the Irish Lit AutoTweets project, and this page primarily documents differences from that project. You might want to take a look at the earlier project for some background information.
All of this happens under Crunchbang++, or whatever Linux distro I'm currently using. With some minimal adaptation, though, this should work on more or less any POSIX-compliant operating system.
I took the collection of Lovecraft stories and converted it to plain text, then parsed it with my Python-based Markov text generator, which I've developed in tandem with another of my automatic text blogs. I initially performed practically no pre-processing to the text, unlike with the Irish Lit AutoTweets account, mostly because (a) the amount of text is huge, and (b) the corpus didn't contain any information that I would consider to be private, such as student names, grades, etc. (As the blog continues to produce content, though, the corpus is getting gradually edited; you can see the source texts for the blog on GitHub, if you'd like.)
I moved the corpus (which I renamed
CollectedStories.html) into a folder, created a symbolic link with a short name (
/lovecraft) on the root level of my drive to save myself some typing, and moved to
/lovecraft in my terminal. Then I created a symbolic link to the Markov chain–based generator inside the
/lovecraft folder so I can easily use the text generator from the script that interacts with Tumblr. Once that was done, I compiled the textual corpus into a useful statistical set with:
text_generator.py -m 2 -c 5 -i corpora/entire-corpus.txt -o corpora/chains.dat
After that, it was a matter of writing and testing a Python 3 script that generates chunks of text that are within acceptable length parameters (currently defined as
25 to 70 sentences, though that's subject to change) and sends them to Tumblr. I signed up for a new Tumblr account for the purpose, downloaded several out-of-copyright images from Wikipedia (full credits here), and created several
about this blog documents on the blog.
Some of the work of communicating with Tumblr is done by my social media–handling module, but that module itself relies on Michael Helmick's python-tumblpy module. It also relies on a set of authentication constants that I store in yet another importable Python module, which I don't make public; they're my private authentication constants, after all, and sharing them with the world would allow anyone to post to the account.
In more detail, the script:
text_generator.pyto generate a title of appropriate length. Lovecraft's own titles in the corpus that I've analyzed are between 2 and 55 characters, but in the past I've found that allowing for short titles tends to result in very short titles that pop up over and over, so I lengthened the requirement to 40 and 90 characters, which seemed to be about right. (And since 11 June 2014, the script checks its title against a list of all titles that have been used before, so this is no longer a problem, and from that date the script generates titles of 10 to 70 characters.)
text_generator.pyto generate the story.
Note that running this script can actually take quite a while — up to a minute, on my laptop. This is substantially longer than running the
generate.py script for the Irish Lit AutoTweets account.
Once I'd written the script, I installed it as a
cron job (
crontab -e) so that it runs periodically (I decided on twice a day). Here's the line from my cron file that I use as of the time of this writing:
0 9,21 * * * /lovecraft/generate.py
I also install an
anacron line in an attempt to make sure the script runs at least once a day if the computer's turned on, even if the computer's not turned on at the times when the
cron job is supposed to run. The line in
/etc/anacrontab looks like this:
1 25 cron.daily /lovecraft/generate.py
You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under the GNU GPL license, either version 3 or, at your option, any later version. You'll need to make it executable (
chmod +x generate.py), of course; you'll also need to download several other modules I've written on which it depends; URLs for these modules are in source code comments. If you find the script useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements.
Historical note: the underlying process generating the stories has changed greatly since the first post on the blog: content used to be written by Jamie Zawinski's DadaDodo and shipped to Tumblr via email and/or the
tumblr-rb rubygen. The glue code used to be a Linux
bash script rather than Python. In fact, really the only thing that hasn't changed since the beginning is that this underlying operating system is a Linux distribution. If you want to trace back through the history of changes to the toolchain generating the blog content, read through Change History, below.
understandthat, for instance, the use of nonstandard diction is limited to the speech of certain characters, and so these words appear at random in the program's output. One way to do this would be to spell-check the text, and I'm working through this, but it's going to take quite a while to spell-check nearly 700 pages of text. A downside is that this would likely move Lovecraft's text more toward contemporary American usage, which is kind of a shame; but I suspect that it's better than random irruptions of dialect associated with characters like Zadok Allen in The Shadow over Innsmouth.
generate.shupdated to make sure that post titles aren't repeated.
generate.shupdated to adjust the allowable title length: now titles can be 10 to 70 characters. Allowable story length has also been shortened: stories are now 25 to 55 sentences.
generate.shscript, which was a
bashscript that posted to Tumblr via the
tumblr-rbrubygem. An admin post was made.
Hopefully, this all has a small but perceptible increase on the quality of the output generated by the algorithm.
Arthur Jermynweek in an admin post.
generate.shscript that were causing problems: the first write to the temporary file wasn't being overwritten (needed bash redirection
<on the first write to the script, which didn't happen with the new version); and needed to wait for data to be written before emailing it off (added
; syncto the last write to the temporary file before emailing).
sync, waits 30 seconds, and runs
syncagain. This is probably overkill, but it's really a small amount of extra time to have a script that runs in the background wait, and hopefully it eliminates the problem.
cat /lovecraft/post.txt | mail -s "$TITLE" [your post-to-Tumblr-by-email email address](yes, I've redacted my own email address here). What's been tweaked in this script? The full pathname to
post.txthas been added (by adding
/lovecraft/). The problem causing the blank posts, I think, was that the script, when executed as a
cronjob, was not being run with
/lovecraftas the current working directory ... as it typically has been when I test the script in a terminal.
Facepalm. Hopefully, I've actually fixed the problem now. Let's see. If I don't get any blank posts over the next few days, I'll remove some of the other solution attempts gradually and make sure I don't reintroduce the problem.
Anyway, this is why all posts since the announcement of Terrible Old Man week have ended with a trailing
st at the end of each post: they were tagged #The Strange High House in the Mist, which is 34 characters, not including the opening hashtag/pound sign, so Tumblr just silently chopped off the last two characters from the hashtag, converting it silently to #The Strange High House in the Mi, and adding the left-over
st to the end of the story itself. Which is, once again, totally sensible, what with Tumblr being Tumblr. So I altered the script so that future stories written during this period would be tagged #Strange High House in the Mist instead of #The Strange High House in the Mi and manually altered all of the automatically generated stories written so far (1; 2; 3; 4; 5; 6; 7; 8), and the initial announcement so that they're consistent. Which was totally not a problem at all.
weeklast five months, but, you know, dissertation.
Simply because this is such a big change, I'm also retiring the technical write-up I was maintaining at the time, marking it
of historical interest (it describes a viable way to maintain the AutoLovecraft blog, but not the one I'm currently using) and creating a new version. In any case, the current version of this document is out of date, so this is a good time to make it current.
At the Mountains of Madnessmonth having been so much fun, I let it stretch on; now I'm announcing a week (or whatever) of stories based on Lovecraft's Cool Air.
hAtommarkup to this page's update list.
Azathothbefore it, is very very short, and the minimalist Markov chains that it generates have few branches, and all of the story titles that could meet basic required parameters for the title had already been generated: there were five copies of the script running, each trying to produce a title. Loosened the title requirements temporarily and generated enough stories to make up for the missing days.