The Worst of Bad LovecraftTumblr Account's Content Is Generated
If you're looking for non-technical information about The Worst of Bad Lovecraft, you should go here, instead. The document you're reading right now contains technical information about how the blog posts are generated.
In overview, this project operates much like my earlier Irish Lit AutoTweets account, and if you get the basic idea there, you probably get the basic idea here, too. Still, here's an overview of what happens.
I downloaded a freely available collection of H.P. Lovecraft's stories from Project Gutenberg. I generated compiled probability data from this corpus using my Python-based text analysis tool, saved the compiled data, opened a new Tumblr account, wrote a script to generate
stories based on that compiled data and upload individual selections to that Tumblr account, installed it as a
cron job on my laptop that runs twice a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.
The basic setup practice was very similar to that described for the Irish Lit AutoTweets project, and this page primarily documents differences from that project. You might want to take a look at the earlier project for some background information.
All of this happens under Linux Lite, or whatever Linux distro I'm currently using. I tend to use the Openbox window manager and Tilda terminal emulator, though that's not really germane. Really, with some minimal modification, this should work on more or less any POSIX-compliant operating system.
I took the collection of Lovecraft stories and converted it to plain text, then parsed it with the Python-based Markov text generator, which I've developed in tandem with another automatic text blog. I've done practically no pre-processing, unlike with the Irish Lit AutoTweets account, mostly because (a) the amount of text is huge, and (b) the corpus didn't contain any information that I would consider to be private, such as student names, grades, etc. A long-term goal is to massage this corpus in various ways, though this process has been going very slowly (I tend to pre-process one story immediately before switching to using it as the basis for new stories, and often the necessity to do this is a strong motivator for me to let the current setup run longer than I had intended).
I moved the corpus (which I renamed
CollectedStories.html) into a folder, created a symbolic link with a short name (
/lovecraft) on the root level of my drive to save myself some typing, and moved to
/lovecraft in my terminal. Then I created a symbolic link to the Markov chain–based generator inside the
/lovecraft folder so I can easily use the text generator from the script that interacts with Tumblr. Once that was done, I compiled the textual corpus into a useful statistical set with:
markov_sentence_generator/sentence_generator.py -m 2 -c 5 -i corpora/entire-corpus.txt -o corpora/chains.dat
As part of my long-term goal to make this process more flexible, I created a sub-folder,
corpora, and made a symbolic link (
/lovecraft/chains.dat, which points to
/lovecraft/corpora/chains.dat). I've imported the CollectedStories.html file into LibreOffice and am gradually spell-checking and otherwise massaging it, though, again, this ongoing process hasn't yet borne much fruit.
After that, it was a matter of writing and testing a Python 3 script that generates chunks of text that are within acceptable length parameters (currently defined as
25 to 55 sentences, though that's subject to change) and sends them to Tumblr. I signed up for a new Tumblr account for the purpose, downloaded several out-of-copyright images from Wikipedia (full credits here), and created several
about this blog documents on the blog.
Some of the work of communicating with Tumblr is done by my social media–handling module, but that module itself relies on Michael Helmick's python-tumblpy module. It also relies on a set of authentication constants that I store in yet another importable Python module, which I don't make public; they're my private authentication constants, after all, and sharing them with the world would allow anyone to post to the account.
In more detail, the script:
markov_sentence_generator/sentence_generator.pyto generate a title of appropriate length. Lovecraft's own titles in the corpus that I've analyzed are between 2 and 55 characters, but in the past I've found that allowing for short titles tends to result in very short titles that pop up over and over, so I lengthened the requirement to 40 and 90 characters, which seemed to be about right. Since 11 June 2014, the script checks its title against a list of all titles that have been used before, so this is no longer a problem, and from that date the script generates titles of 10 to 70 characters.
sentence_generator.pyto generate the story.
Note that running this script can actually take quite a while — a minute or more, on my laptop. This is substantially longer than running the
generate.py script for the Irish Lit AutoTweets account. (You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under the GNU GPL license, either version 3 or, at your option, any later version. You'll need to make it executable (
chmod +x generate.py), of course; you'll also need to download several other modules I've written on which it depends; URLs for these modules are in source code comments. If you find the script useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements. I'm happy to answer questions, but can't provide extensive support or bring you up to speed if the explanation here is unintelligible to you.)
Once I'd written the script, I installed it as a
cron job (
crontab -e) so that it runs periodically (I decided on twice a day). Here's the line from my cron file that I use as of the time of this writing:
0 9,21 * * * /lovecraft/generate.py
I also install an
anacron line in an attempt to make sure the script runs at least once a day if the computer's turned on, even if the computer's not turned on at the times when the
cron job is supposed to run. The line in
/etc/anacrontab looks like this:
1 25 cron.daily /lovecraft/generate.py
generate.shupdated to make sure that post titles aren't repeated.
generate.shupdated to adjust the allowable title length: now titles can be 10 to 70 characters. Allowable story length has also been shortened: stories are now 25 to 55 sentences.
generate.shscript, which was a
bashscript that posted to Tumblr via the
tumblr-rbrubygem. An admin post was made.
Hopefully, this all has a small but perceptible increase on the quality of the output generated by the algorithm.
Arthur Jermynweek in an admin post.
generate.shscript that were causing problems: the first write to the temporary file wasn't being overwritten (needed bash redirection
<on the first write to the script, which didn't happen with the new version); and needed to wait for data to be written before emailing it off (added
; syncto the last write to the temporary file before emailing).
sync, waits 30 seconds, and runs
syncagain. This is probably overkill, but it's really a small amount of extra time to have a script that runs in the background wait, and hopefully it eliminates the problem.
2014-12-05: I think I've finally figured out the actual reason why I've been getting blank posts: the script was insufficiently specific about file location. I've tweaked it a bit to actually fix the problem. (I think. But then, I've said this before.) Here's the story: (what is now) line 58 of the script is the line that actually posts the generated story to Tumblr; it currently reads
cat /lovecraft/post.txt | mail -s "$TITLE" [your post-to-Tumblr-by-email email address] (yes, I've redacted my own email address here). What's been tweaked in this script? The full pathname to
post.txt has been added (by adding
/lovecraft/). The problem causing the blank posts, I think, was that the script, when executed as a
cron job, was not being run with
/lovecraft as the current working directory ... as it typically has been when I test the script in a terminal.
Facepalm. Hopefully, I've actually fixed the problem now. Let's see. If I don't get any blank posts over the next few days, I'll remove some of the other solution attempts gradually and make sure I don't reintroduce the problem.
2015-01-07: Discovered that Tumblr tags have a maximum length ... at least, Tumblr posts that are created by sending an email to your automatic post-to-Tumblr address do. The length requirement seems to not apply to posts created directly in Tumblr's web interface. This limitation is, of course, as one would expect from Tumblr, (a) totally sensible, and (b) extremely well documented. Incidentally, that maximum length is 32 characters, not counting any opening hashtag/pound sign.
Anyway, this is why all posts since the announcement of Terrible Old Man week have ended with a trailing
st at the end of each post: they were tagged #The Strange High House in the Mist, which is 34 characters, not including the opening hashtag/pound sign, so Tumblr just silently chopped off the last two characters from the hashtag, converting it silently to #The Strange High House in the Mi, and adding the left-over
st to the end of the story itself. Which is, once again, totally sensible, what with Tumblr being Tumblr. So I altered the script so that future stories written during this period would be tagged #Strange High House in the Mist instead of #The Strange High House in the Mi and manually altered all of the automatically generated stories written so far (1; 2; 3; 4; 5; 6; 7; 8), and the initial announcement so that they're consistent. Which was totally not a problem at all.
weeklast five months, but, you know, dissertation.
Simply because this is such a big change, I'm also retiring the technical write-up I was maintaining at the time, marking it
of historical interest (it describes a viable way to maintain the AutoLovecraft blog, but not the one I'm currently using) and creating a new version. In any case, the current version of this document is out of date, so this is a good time to make it current.
At the Mountains of Madnessmonth having been so much fun, I let it stretch on; now I'm announcing a week (or whatever) of stories based on Lovecraft's Cool Air.
Code on this page is licensed under the GPL, either v3 or, at your option, any later version. All non-code work is copyright © 2014–16 by Patrick Mooney. Short link to this page: http://is.gd/AutoLovecraftTechnical. This HTML file was last updated 15 October 2016.