The Worst of Bad LovecraftTumblr Account's Content Is Generated
If you're looking for non-technical information about The Worst of Bad Lovecraft, you should go here, instead. The document you're reading right now contains technical information about how the blog posts are generated.
In overview, this project operates much like my earlier Irish Lit AutoTweets account, and if you get the basic idea there, you probably get the basic idea here, too. Still, here's an overview of what happens.
I downloaded a freely available collection of H.P. Lovecraft's stories from Project Gutenberg. I generated compiled probability data from this corpus using DadaDodo, saved the compiled data, opened a new Tumblr account, wrote a script to generate "stories" based on that compiled data and upload individual selections to that Tumblr account, installed it as a cron
job on my laptop that runs twice a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.
You can download and use the script if you'd like, subject to certain restrictions; read on for details. If you find it useful, I'd be grateful if you'd Flattr me.
Due to a (no longer quite so) recent change, much of this information is no longer accurate. I've struck through the parts that were already outdated when I made this announcement, but there are other inconsistencies as of this writing (8 October 2015). I hope to have a chance to write up a new description soon. Short version: I'm not using the tumblr-rb
rubygem from a bash script anymore; I've rewritten the script in Python, and it interacts with Tumblr via the pytumblr library.
The basic setup practice was very similar to that described for the Irish Lit AutoTweets project, and this page primarily documents differences from that project. You might want to take a look at the earlier project for some background information.
All of this happens under Crunchbang Linux. I tend to use the Openbox window manager and Tilda terminal emulator, though that's not really germane. Really, with some minimal modification, this should work on more or less any POSIX-compliant operating system.
I took the collection of Lovecraft stories and parsed it with DadaDodo, which I already had installed. DadaDodo can process HTML as well as plain text. Unlike with the Irish Lit AutoTweets account, I did practically no pre-processing, mostly because (a) the amount of text is huge, and (b) the corpus didn't contain any information that I would consider to be private, such as student names, grades, etc. A long-term goal is to massage this corpus in various ways, though I have not actually done any real work in this regard yet.
I moved the corpus (which I renamed CollectedStories.html
) into a folder, created a symbolic link with a short name (/lovecraft
) on the root level of my drive to save myself some typing, and moved to /lovecraft
in my terminal. Once that was done, I compiled the textual corpus into a useful statistical set with:
dadadodo -w 10000 -o chains.dat -c 1 CollectedStories.html
As part of my long-term goal to make this process more flexible, I created a sub-folder, corpora
, and made a symbolic link (/lovecraft/chains.dat
, which points to /lovecraft/corpora/chains.dat
). I've imported the CollectedStories.html file into LibreOffice and am gradually spell-checking and otherwise massaging it, though, again, this ongoing process hasn't yet borne fruit.
After that, it was a matter of writing and testing a script that generates chunks of text that are within acceptable length parameters and sends them to Tumblr. I signed up for a new Tumblr account for the purpose, downloaded several out-of-copyright images from Wikipedia (full credits here), and created several about this blog
documents.
The actual work of communicating with Tumblr is done by a RubyGem called tumblr-rb. Tumblr-rb sends a YAML file to Tumblr, as specified in the Tumblr API. YAML (which, annoyingly, is a recursive acronym for YAML Ain't Markup Language,
as if anyone still thought that something Scott Adams was mocking as overused twenty years ago is edgy, and in order to point out that the syntax of the language marks up not documents, but data) is simply a plain-text Unicode file that allows for a limited amount of markup to highlight the data syntax.
To install tumblr-rb
and rubygems
:
sudo apt-get install rubygems sudo gem install tumblr-rb
Easy enough, right?
The first time tumblr-rb is called, it needs to be called this:
tumblr authorize
to get it to store an API key in ~/.tumblr/
. Unlike with Irish Lit AutoTweets, I'm fairly sure that I don't want to use tumblr-rb for anything other than this script, so I haven't bothered to go through the process of moving the API key elsewhere and specifying it in the script's tumblr-rb command line, though this would be easy enough to do.
Once all the pieces were in place, all I needed to do was to put the finishing touches on the script, which is just a quick Bash script that does a few things:
dadadodo
to generate a title of appropriate length. Lovecraft's own titles in the corpus that I've analyzed are between 2 and 55 characters, but in the past I've found that allowing for short titles means that DadaDodo repeats titles a great deal, so I lengthened the requirement to 40 and 90 characters, which seemed to be about right. Since 11 June 2014, the script checks its title against a list of all titles that have been used before, so this is no longer a problem, and from that date generate.sh
generates titles of 10 to 70 characters. After generating an acceptable new title, the script uses sed
to strip out the leading whitespace that dadadodo
inserts.post.txt
) with some metadata:
--- type: text tags: H.P. Lovecraft,automatically generated text,Patrick Mooney,dadadodo title: {THE POST'S TITLE}
Yes, there are two blank lines at the end of that front matter.
dadadodo
to generate the story, copying it to the end of the temporary file sed
again to strip out leading whitespace from the post.tumblr-rb
.titles.txt
.post.txt
file with a new one.Note that running this script can actually take quite a while — up to several minutes, on my laptop. This is substantially longer than running the generate.sh
script for the Irish Lit AutoTweets account, because DadaDodo takes a much longer time to generate a bigger chunk of text based on a bigger corpus.
You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under a Creative Commons Attribution-ShareAlike 4.0 International license. You'll need to make it executable (chmod +x generate.sh
), of course. If you find it useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements. I'm happy to answer questions, but can't provide extensive support or bring you up to speed if the explanation here is unintelligible to you.
Once I'd written the script, I installed it as a cron
job (crontab -e
) so that it runs periodically (I decided on twice a day). Here's the line from my cron file that I use as of the time of this writing:
0 9,21 * * * /lovecraft/generate.sh
9 a.m. and 9 p.m. in my time zone, of course, is noon and midnight where Lovecraft lived for most of his life.
I also install an anacron
line in an attempt to make sure the script runs at least once a day if the computer's turned on. The line in /etc/anacrontab
looks like this:
1 25 cron.daily /lovecraft/generate.sh
I often specify full paths in the script because cron
jobs may not have a properly set-up environment that guarantees environment variables, such as $PATH, are properly set.
Switches for the the dadadodo
invocation (that generates the story):
-l /lovecraft/chains.dat
-c $LENGTH
-w 10000
Switches for the tumblr
(i.e., the tumblr-rb
executable) invocation:
post
tumblr
executable can do.)--host=autolovecraft.tumblr.com
--credentials=/lovecraft/.tumblr
/lovecraft/post.txt
Fix me: write this section.
DadaDodo
really doesn't really understand English grammar as well as I'd like. I've written about this more extensively in relation to the Irish Lit AutoTweets account, so I'll just say again here that if anyone has a suggestion for a Linux program that can be run from a bash script and does a better job of doing what DadaDodo does (i.e: generating text more or less grammatically, but randomly, and based on an existing corpus of text), I'd love to hear about it.Call of Cthulhuweek, in which all stories are based on that particularly famous Lovecraft story; or perhaps the more broadly defined
Great Old Onesweek, based on a large subset; or
Arthur Jermynweek, in which all stories are based on that particular story, which I've taught and which appears on my dissertation prospectus. (Have I mentioned that I'm a grad student in English?) But this is a long-term goal and won't be happening any time soon.
generate.sh
updated to make sure that post titles aren't repeated.generate.sh
updated to adjust the allowable title length: now titles can be 10 to 70 characters. Allowable story length has also been shortened: stories are now 25 to 55 sentences.Hopefully, this all has a small but perceptible increase on the quality of the output generated by the algorithm.
Arthur Jermynweek in an admin post.
<<
instead of <
on the first write to the script, which didn't happen with the new version); and needed to wait for data to be written before emailing it off (added ; sync
to the last write to the temporary file before emailing).sync
, waits 30 seconds, and runs sync
again. This is probably overkill, but it's really a small amount of extra time to have a script that runs background script wait, and hopefully it eliminates the problem.cat /lovecraft/post.txt | mail -s "$TITLE" [your post-to-Tumblr-by-email email address]
(yes, I've redacted my own email address here). What's been tweaked in this script? The full pathname to post.txt
has been added (by adding /lovecraft/
). The problem causing the blank posts, I think, was that the script, when executed as a cron
job, was not being run with /lovecraft
as the current working directory ... as it typically has been when I test the script in a terminal.
Facepalm. Hopefully, I've actually fixed the problem now. Let's see. If I don't get any blank posts over the next few days, I'll remove some of the other solution attempts gradually and make sure I don't reintroduce the problem.
Anyway, this is why all posts since the announcement of Terrible Old Man week have ended with a trailing st
at the end of each post: they were tagged #The Strange High House in the Mist
, which is 34 characters, not including the opening hashtag/pound sign, so Tumblr just silently chopped off the last two characters from the hashtag, converting it silently to #The Strange High House in the Mi
, and adding the left-over st
to the end of the story itself. Which is, once again, totally sensible, what with Tumblr being Tumblr. So I altered the script so that future stories written during this period would be tagged #Strange High House in the Mist instead of #The Strange High House in the Mi and manually altered all of the automatically generated stories written so far (1; 2; 3; 4; 5; 6; 7; 8), and the initial announcement so that they're consistent. Which was totally not a problem at all.
weeklast five months, but, you know, dissertation.