How My Automated Irish Literature Social Media Accounts Are Generated

If you're looking for non-technical information about Irish Lit AutoTweets, you should go here, instead. The document you're reading right now contains technical information about how the tweets are generated. If you'd like a good overview on how Markov chains work, Jeff Atwood has written a highly accessible introduction.

The Short Version

I exported all of the course-related email that I wrote during the quarter into a folder on my local hard drive, saved the relevant web pages as text in the same folder, copied and pasted the course Twitter stream into a text file in that folder, and saved a text-only version of student paper comments (yes, into the same folder). Then I massaged the data in various ways to make it more amenable to text processing and to remove sensitive information (primarily student names and grades).

The tweets themselves (and the longer Tumblr discourses) are generated using a program called DadaDodo, which generates text based on an analysis of existing text that transforms sentences into Markov chains. This involves two steps: first, I generated compiled probability data from the corpus using DadaDodo and saved the compiled data; then, I opened a new Twitter account, wrote a script to generate sentences based on that compiled data and upload individual selections to that Twitter account, installed it as a cron job on my laptop that runs six times a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.

After this account had been running for about two years, I went through one of my periodic cycles of looking comparatively closely at what the account was doing as it ran along on its merry way, and it occurred to me that it's generating an awful lot of text that simply disappears into the void, mostly because the text-generation process often generates text that doesn't fit into Twitter's 140-character limit. So I arranged for the wasted text to be captured by another script, which periodically posts longer discourses based on the same textual corpus to a new Tumblr account that I opened.

You can download and use either script, if you'd like, subject to certain restrictions; read on for details. If you find it useful, I'd be grateful if you'd Flattr me. And you should tweet at me (or find another way to reach me) if you do anything interesting with it or have input or ideas.

The (Much) Longer Version

This script is run under Linux Mint 17.1 (Rebecca), though it should be possible to make this methodology for generating Twitter content work, with necessary adaptations, on any Unix-like operating system, and should work more or less as described under any Debian-derived Linux system (Linux Mint is one; so is Ubuntu).

Assembling the Base Corpus

DadaDodo needs something to analyze and use as the basis of the text it generates; I refer to this as the base corpus. Text used as the base corpus for this project consisted of all of the writing I did that I had captured during the quarter, and it came from several sources:

Email: Every quarter, I save all of my course-related emails, sent and received, in a specific folder designated for that purpose in my email client, Mozilla Thunderbird. I sorted the folder by sender, selected the 460 messages that I wrote myself during the quarter, and exported them using the ImportExportTools extension for Thunderbird.
Section website: I took all of the documents that I myself wrote during the quarter and converted them to plain text; there were few enough that doing this one at a time in Firefox was quicker than reading manpages for HTML-to-text conversion tools, none of which I know well enough to use without consulting documentation. Note that, during that particular quarter, much of the content for the course website was generated by the students themselves; none of this was incorporated into the base corpus.
Section Twitter stream: Copied and pasted into a new plain text document.
Comments on student papers: These were originally typed in OpenOffice. They were exported as plain text files in order to roll them into the base corpus ... after an awful lot of picking over with a fine-toothed comb to be absolutely sure I wouldn't leak any confidential information.

All of these documents were aggregated into a single folder, then concatenated with the standard POSIX cat command to produce a the first draft of the base corpus. After that, there was a long and boring period of searching and replacing to eliminate undesirable text that I didn't want used as the basis of the text that the script generates (URLs, student names, smart quotes, grade-related information, carriage returns, double spaces, etc. etc. etc.) Searching, replacing, and other editing were largely done with Bluefish, which is a pretty good text editor that supports searching for regular expressions and understands POSIX backslash sequences. I installed DadaDodo (sudo apt-get install dadadodo under Linux Mint). I called the single-file base corpus 150.txt because the course for which I did the teaching was, of course, English 150.

I created a folder in my file system at an appropriate place in the filesystem hierarchy, several folders beneath my Documents folder, then created a symbolic link with a short name to that folder at the root level of my file system (it's at /150) to save myself some typing. Then I moved the base corpus so that it was /150/150.txt. Once that was done, I compiled the textual corpus into a set of Markov chains with DadaDodo: dadadodo -w 10000 -o chains.dat -c 1 150.txt outputs chains.dat, which is then the precompiled Markov chains that the script uses to generate chunks of text during its regular runs several times a day. (This isn't, strictly speaking, necessary; it's possible to just have DadaDodo re-analyze the entire base corpus from scratch each time it runs. But, since the base corpus is almost 650K of text, it saves some time on each script run to prepare the set of Markov chains once, in advance, instead of redoing the same work multiple times per day.)

Automating Text Generation

Once the textual corpus was assembled, I signed up for a new Twitter account for the project, logged in, and edited the profile; the website listed on the profile is a short link to this page, the one that you're reading now. Then I wrote and tested a script that calls DadaDodo repeatedly to generate chunks of text until the generated chunk is acceptable, then sends that acceptable chunk to Twitter. The first version of the script was a Unix-style shell script that executed under bash; you can see the description and write-up for this particular script here, if you'd like. It worked quite well for all of the nearly two years some version of it was in use and was, at least at the time that it was retired, still a viable solution.

Nevertheless, on 15 September 2015, I replaced the shell script with a Python 2.X script (more specifically, it was written under 2.7.6) that does the same thing. Part of my motive was simply to give myself some practice using Python; part of my motive was a desire to develop the script a bit further (and Bash shell scripts are a real pain to write; a complete re-write in Python is a small investment compared to the aggregate time of maintaining and further developing a Bash script over a longer period of time. And part of my motive was to remove the dependency on TTYtter, which I was previously using to send the tweets to Twitter. TTYtter is a command-line Twitter client that can be scripted; it's a great program, but there were two problems with using it. One is that it's overkill; it does much more than I actually need to do with it. Depending on an 8000-line Perl script seemed unnecessary when all I want to do can be accomplished with fewer than a dozen lines of Python. Too, an early-2014 email from TTYtter developer Cameron Kaiser in response to a question I'd asked him said that he no longer used Twitter and was unlikely to continue to do substantial work on TTYtter. Depending on a project that may be gradually drifting toward orphanhood seemed a bad move to me; in the long term, I needed either to find a replacement that could be used from a Bash script or learn to interact with Twitter in Python using OAuth. Since I'm learning Python anyway, and thought a project would be useful, that's the way I went. Interacting with Twitter in Python is easy using any of several libraries; I decided to go with tweepy. It's not installed by default, but pip install tweepy takes care of that easily enough.

Once all the pieces were in place, I rewrote the previous Bash script (again, details are here) in Python (I'm running 2.7.6 for this script; I haven't yet tried moving it to Python 3, although that will probably happen eventually), as generate.py, and tested it. It runs through a few steps every time it's run:

Parses the command line that's passed to it. It understands several command-line options; try running it as ./generate.py --help to see an explanation.
Calls dadadodo, stripping the leading and trailing spaces that the program tends to include.
Checks to see if the resulting text is within acceptable length parameters. Of course, Twitter's famous 140-character limit is a hard upper limit that winds up being kind of annoying sometimes, but I also enforce a minimal limit (currently 46 characters) because the very short tweets that DadaDodo generates from the base corpus tend to be rather dull, in my opinion. (Early tests of the script produced the tweet History. about one out of every five times.) If the automatically generated tweet isn't an acceptable length, it just keeps trying until it generates one that is. (Since v1.2, if the script is run with the -x or --extra-material flags, it saves rejected tweets in a separate file.)
Checks to see if the tweet has been posted before. DadaDodo repeats itself from time to time, so this script keeps a list of tweets that have been posted before (I refer to this as the tweet archive). If the tweet's not new, the script starts over and tries again. (Repeated tweets that are generated are not saved to the extra material archive, even if long and short tweets are being saved.)
Sends the resulting new tweet of acceptable length to Twitter.
Saves the new tweet to the tweet archive. As far as I can tell, there's no good reason to sort the tweet archive — it doesn't, for instance, seem to speed up the search that determines whether the tweet has been posted before — but sorting alphabetically is possible by running ./generate.py --sort-archive, if you're the kind of person who likes things to be sorted whether there's a benefit to it or not.

You can see the script on GitHub, and previous versions are available as branches; I (in)tend to tweak it from time to time. You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under a the GNU GPL v3 or, at your option, any later version. You'll need to make it executable (chmod +x generate.py), of course, or else to invoke the script name as an argument to Python on the command line, perhaps as python /150/generate.py. If you find this script useful, I'd be grateful if you'd Flattr me, though this is (of course!) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements.

Automating the Script's Runs

Once the script was set up and worked effectively when I ran it manually, I installed it as a cron job (crontab -e) so that it runs periodically (I decided on five times a day, and later bumped that up to six). Here's the line from my cron file that I use as of the time of this writing:

45 0,4,8,12,16,20 * * * /home/patrick/.Enthought/User/bin/python /150/generate.py -v -v -v -a /150/tweets.txt -x /150/extras.txt

Which is to say that the script runs on my laptop at 12:45, 4:45, and 8:45 a.m., and 12:45, 4:45, and 8:45 p.m., posting a new tweet each time, saving that tweet to the tweet archive at /150/tweets.txt, and saving rejected material to /150/extras.txt. Provided that my laptop is on and connected to the Internet, of course. Once in a while, the extra material that's being collected is posted to the Automated Irish Lit Discourses Tumblr account ... about which I will say more in a bit.

I also installed it as an anacron job to make sure that it runs at least once a day if my laptop is turned on. Here is the relevant line from /etc/anacrontab:

1 20 IrishLitTweets.daily /150/generate.py

Some Notes on the Twitter Script

I often specify full paths in the script because cron jobs may not have a properly set-up environment that guarantees environment variables, such as $PATH, are properly set. I find it easier to just specify full paths than to keep a crontab PATH declaration in sync with one that's maintained in a .bashrc.

Here's what the switches for the the dadadodo invocation mean:

-c 1: Just generate one sentence.
-l /150/chains.dat: Don't use the original corpus; use the manually compiled statistical data instead. This is faster, though the script still runs fairly quickly without it. Still, I'd rather avoid wasting processor time unnecessarily; I'm often doing other things on my laptop when this script runs.
-w 10000: Use a really wide text wrap amount to make sure that DadaDodo doesn't wrap text at all.

Longer Discourses on Tumblr

Periodically, the extra material that builds up in the extra material archive gets posted to Automated Irish Literature Discourses on Tumblr. This is accomplished by another Python script, discourse-generate.py, which is also available on GitHub (and which was also written under Python 2.7.6).

discourse-generate.py lives in /IrishLitDiscourses on my hard drive; it's a simpler, less-robust, less-developed script that just looks at /150/extras.txt twice a day (4 a.m. and 4 p.m., Pacific time) and makes a decision about whether to post it. The script doesn't take command-line arguments or do enough error-checking; it just rolls the dice and, sometimes, posts. My initial observation was that it took too long to build up material for posting to the Tumblr account, so what I did was tweak the Twitter script so that (as of v1.3), instead of asking for a single sentence, it generates anywhere between one and six sentences when the -x or --extra-material-archive switches are specified. This means that tweets of two or more sentences now sometimes appear in the Twitter stream; but it also has the side effect of dumping much more material into the extra material archive, since collections of two to six sentences are more likely to be too long for Twitter than single sentences are.

More specifically, the Tumblr script decides whether to post by comparing a random number to a probability calculated based on the current length of the extra material archive. The script uses an exponential decay curve to affect how likely it is that the script posts: the more material is built up, the more likely it is that the material will be posted when the script rolls the dice to determine whether or not the material gets shipped out. If there are fewer than three thousand characters waiting to go out, the chances of them doing so on any iteration of the script are zero, but the probability doesn't actually reach certainty until there are a bit over 1.5 million characters waiting to go out. (Practically speaking, of course, running the script, and therefore rolling the dice, twice a day with gradually increasing probability of posting means that the extra material archive should never get anywhere near that large.)

When the script decides to post, it decides on a title (currently Discourse of [Date]), a slug (the end of the Tumblr post's URL), the text of a Twitter annoucement, and a set of tags for the post. Then it sends them off to Tumblr using the pytumblr library.

Overall, then, the process looks like this:

The the Twitter script, /150/generate.py, runs six times a day, posting tweets to @IrishLitTweets. Extra material that's rejected for being the wrong length accumulates on my hard drive in /150/extras.txt.
Twice a day, assuming my laptop is on and connected to the Internet at the right times, the Tumblr script runs and decides whether a random number between zero and one is or is not less than 1 − e ^{(len − 3000) × ^{−¹⁄₄₀₀₀}}, where len is the current length of the extra material archive, in bytes. If the random number is less than the number from the probability calculation, then:
1. The accumulated text gets shipped off to the Tumblr account with an appropriate title, slug, and set of tags.
2. A tweet is posted to @IrishLitTweets that points to the new long discourse on Tumblr.
3. The extra material archive gets emptied out.
4. The extra material archive gradually gets filled up again as the Twitter script runs. The process repeats.

Here's the crontab line that runs the script twice a day:

0 4,16 * * * /home/patrick/.Enthought/User/bin/python /IrishLitDiscourses/discourse-generate.py

And here's the line from /etc/anacrontab line that makes sure the script runs at least once on any day that my laptop is on for at least half an hour:

1 30 IrishLitDiscourses.daily /home/patrick/.Enthought/User/bin/python /IrishLitDiscourses/discourse-generate.py

Guest Lecturers

I'm planning on posting other DadaDodo-generated discourses from time to time based on statistical analyses of other texts about Irish literature. It will be a bit before this happens, because there are other tweaks that I want to make first, but there's a lot of material out there that could be run through DadaDodo and used as the basis for these guest lectures by historical figures. It's likely, for instance, that an early set of guest lectures will be based on Matthew Arnold's The Study of Celtic Literature. Suggestions are welcome, especially if you can point to source texts for analysis; let me know on Twitter if you've got a good idea.

Reservations About the Current Setup

DadaDodo's understanding of English grammar is really only OK, from my perspective. This is on its own amazing, arguably, because (assuming I understand the documentation correctly) it doesn't really understand grammar at all; it just pays attention to which words follow which other words, and models the relationships it finds; that is to say, what it actually does it traverse Markov chains that express probability relationships between words. Still, despite how well (all things considered) it models English grammar, there are a number of semantically and structurally meaningful elements that it ignores in this process; this is suboptimal if your goal is to generate text that seems meaningful:
- The first thing I noticed about the text was that DadaDodo doesn't pay attention to the distinction between capital and lowercase letters — at least, not in a way that necessarily results in the production of natural-sounding text. Take a look at this tweet for an example:
  He talked after Getting a pair; going to say in.
  
  — Irish Lit AutoTweets (@IrishLitTweets) December 31, 2013
  
  What's actually going on here? Though I don't read C and can't evaluate the source code here, my best guess, based on DadaDodo's documentation, is that the code has stripped out punctuation and noticed that the word Getting, with that capitalization, after punctuation has been stripped, follows the word after at least once in the original corpus, and so, while traversing the chains generated from the base corpus and landing on after, it rolls the dice and selects Getting, with that particular capitalization. What I'd actually prefer, as a first-order approximation, is for the program to notice that some words are always capitalized and treat them as proper nouns, and only to capitalize other words when they appear at the beginnings of sentences. Of course, this is hardly an ideal solution for a lot of reasons, so I'd also be happy to have the program treat various types of capitalization as completely different words and pay more attention to when various words occur in the original corpus — noticing, for instance, that some words only ever occur at the beginning of sentences. One way or another, though, the current system would benefit from finding a way to avoid producing inappropriately capitalized words in the middle of sentences.
- There are a number of other ways in which I wish DadaDodo were more cognizant of English grammar:
  - I wish punctuation were handled more intelligently. For instance, take a look at this tweet:
    Not taking his consciousness and you'll want me on Tuesday, morning!
    
    — Irish Lit AutoTweets (@IrishLitTweets) December 30, 2013
    
    Nowhere in the base corpus are the consecutive words Tuesday and morning separated by a comma; DadaDodo is either treating commas (and perhaps other punctuation?) as semantic elements on their own (i.e., is treating them as spaces on which the generator can land and from which it can depart), or is using some heuristic to evaluate when to throw punctuation into the mix. But the rules for comma usage in English, though elaborate, are actually rather straightforward, and there is no reason to throw a comma in here; from the point of view of a strictly grammatical reading, doing so makes the clause itself into meaningless gibberish. Again, fixing this would be an incredibly complex issue, and would require that DadaDodo come closer to understanding what the sentences it generates mean, or at least that it have some sense of grammar while generating the tweets, and it is not immediately obvious to me how this could be easily accomplished while simply traversing a Markov chain. (In fact, I doubt that it does; see the note below on memorylessness for more.)
  - I wish that DadaDodo had a better sense of the fact that English has different parts of speech, and would select an appropriate part of speech based on what the sentence is currently doing. Of course, I also understand that to wish for this is to wish that Markov chains were not stateless, which is a basic violation of the presuppositions of the current approach.
  - All of the last few bullet points boil down to this: though the content of the tweets generated through the current toolchain is amusing and sometimes insightful, there are fundamental problems with the current approach that limit its overall utility: it's never going to get all that much better at generating text that's taken to be naturalistic while the current approach is the approach that's used, because getting much better would require that the algorithm have a better sense of grammar. But this is precisely what's incompatible with the current approach, because having a better sense of grammar would require that, as the algorithm generates text at various scales (phrase, sentence, paragraph ...), it has some sense of what has and hasn't yet happened in the current piece of text being generated. But this is precisely what Markov chains give up through their simplicity: they are by definition memoryless, so what's already happened in the current unit of language isn't an influence on what happens when any particular decision is made by the algorithm.
    This means that the current approach will really never get substantially better at writing natural text based on the current base corpus. Part of the reason for this is that the base corpus is rather heterogenous in terms of the kinds of linguistic structures that it contains: for instance, it includes both the tweets I made about the class on the section Twitter stream during that quarter, which tend to have a simple and direct syntactic structure mandated by common Twitter community practice and by the length limitations of individual tweets; but it also includes a substantial amount of more syntactically complex—and sometimes much more syntactically complex — kinds of language. Notably, I wrote much more email during fall quarter of 2013 than I usually do, in part because I was experimenting with a structure for my discussion sections that was new to me. But, in contrast to the simple declarative syntax I tended to use in tweets, the syntax I use in emails tends to be much more complex: promises, conditionals, the subjunctive mood, and complex subordination are all much more complex structures that appear much more frequently in my emails. Sentences on the Twitter stream never exceed 140 characters. Sentences in my emails frequently do. Mixing both types of constructions in the base corpus drags down its usability for producing naturalistic blocks of text because it conflates different types of discourse. Adding in discourse in other registers complicates the problem. Arguably, it would be best to pick one document type and use it exclusively as the base corpus; this is one of the reasons why my related Tumblr account, The Worst of Bad Lovecraft, produces (what I take to be) more natural-sounding text: the underlying textual corpus is more homogeneous.
- All of the above is to say: if anyone has a suggestion for another method of generating text more or less grammatically, but randomly, and based on an existing corpus of text, I'd love to hear about it. (I may experiment with rolling my own Markov chain-based text generator in the future.)
Twitter's 140-character limit really constricts what can be done here. Playing around with DadaDodo and the existing corpus shows (well, in my own opinion, anyway) that longer sections of text feel more like real paragraphs. For examples, take a look at some of the longer discourses that get posted to the Automated Irish Literature Discourses Tumblr account: they're generated in the exact same way, and are only rejected for not falling within the length parameters that the Twitter script expects. Looking at them, I suspect you'll notice that the same grammatical problems occur in the longer paragraphs as in the shorter tweets, they're less noticeable, and don't impair the perceived meaning of the paragraphs to the same degree. I suspect that the reason for this is that the reader's models of understanding kick in in a different way in longer pieces of text, because longer constructions demand that the reader evaluate the connections between individual statements and thereby activate models of understanding that more strongly involve logical progression, narrative structure, and other more complex connections between words. Which is to say that part of our basic literacy training for those of us who are experienced readers teaches us to ask questions about relationships between statements and that, because we're experienced readers, we ask these questions pretty much automatically. But the fact of the matter is that, even when statements are automatically generated, the reader's mind will distinguish some logical pattern at least some of the time, and that perceiving such a pattern in a paragraph — even to the extent of having a sense of what the paragraph has said up to this point — helps to resolve interpretive questions and assists in the construction of an overall sense of meaning for the paragraph. This overall sense of meaning also tends to diminish the degree to which grammatical problems are perceived as causing hermeneutic difficulty. A single-sentence tweet, though — especially one restricted to 140 characters — doesn't (necessarily) activate the same hermeneutic requirement to evaluate the relationship between statements to the same degree, so there's nothing to buffer it if it doesn't stand on its own.
- Which is to say that doing this for Twitter using the current toolset is always going to be suboptimal (even though the end result is OK a fair percentake of the time). There are probably two basically different ways to respond to this problem: either replace DadaDodo with another text generator that does a better job of generating grammatically correct short blocks of text (I'm still pondering this), or else find a venue that supports publishing longer chunks of automatically generated text (this is precisely what the Automated Irish Literature Discourses Tumblr account is). Input on either approach would be welcome.

All of which is to say, again, that I'd love feedback if people have thoughts or ideas about how this could be done better.

Change History

2013-12-30: The first tweet posted by the script is posted. I documented that version of the script here.
Before 2014-12-20: Various other things happened. Lots of parameters were tweaked. I wasn't yet tracking changes; everything was too fluid.
2014-12-20: Modified the (bash) script that generates this account so that it will no longer generate the same tweet more than once. The strategy used is simply to compare the new tweet against a list of all tweets and retry if it's been generated before. So long, this tweet. It's been fun.
2015-09-15: The content of the Twitter account is now generated by the Python script described on this page that you're currently reading right now, rather than with the old (bash) script, which is described in the old write-up. The current version of the script is v1.
2015-09-22: Testing having been successful on a new version of the script, v1.1 was merged into the master branch. Changes are described in HISTORY.md.
2015-09-29: Testing began on a new version of the script, v1.2. Changes are described in HISTORY.md. Notably, text that's rejected after being generated is now being saved; I have plans for it.
2015-09-29: Extra material that's generated and rejected by the Twitter script, which has been saved since today's earlier changes to the script, is now intermittently posted to the Automated Irish Literature Discourses Tumblr account. The script generating the content for this Tumblr account is available on GitHub; it's currently much less developed than the Twitter script. The first post on the new Tumblr account is here.
2015-10-06: Testing of the new Twitter script having been judged successful, v1.2 was merged into the master branch as the latest stable version.
2015-10-06: This web page was updated substantially, mostly to describe the Python script that generates content for the Automated Irish Literature Discourses Tumblr account.
2015-10-06: This change history list was updated to also use the hAtom microformat. Details about various versions of the Twitter and Tumblr scrips has also been added retroactively to this list (I'd gotten behind on that).
2015-10-07: v1.21 of the Twitter script has a minor change: the logging function is abstracted to this library, since I'm winding up using it in multiple scripts. This also means that I need to add a $PYTHONPATH declaration to my crontab.
2015-10-07: v1.3 of the Twitter script was released; it has a change in logic so that when -x or its long-form equivalent --extra-material-archive is in effect, the script asks DadaDodo for a random number of sentences (up to six) instead of just asking for a single sentence. This means that, for the first time, multiple-sentence tweets may appear; it also implies that I can't process the archive automagically with the assumption that the end of a sentence is also the end of a tweet. (This aspect of the change was announced in a tweet.) It also has the effect of generating a lot more material for the extra material archive, and therefore increasing how quickly text is generated for the Automated Irish Literature Discourses account. Which is OK with me.
2015-10-07: v1.1 of the Tumblr script is put into testing; it abstracts the logging logic to the same external module as described in the earlier change to the Twitter stream's operation.
2015-10-10: v1.3 of the Twitter script has been slightly updated so that it requests 1–4 (instead of 1–6) sentences. This should help to keep quite so much content from accumulating in the Automated Irish Literature Discourses blog. Other minor changes, thanks to checking the script with the pylint utility. The new v1.3 script (no version number bump this time) is put into testing but not yet merged into the master branch. Similarly, v1.1 of the Tumblr script was updated slightly for PEP8 compliance thanks to pylint; it too has not yet had a version number bump; it is in testing without having yet been merged into the master branch.
2015-11-15: v1.3 of the Twitter script has been merged into the master branch; it's been in production use for over a month and I haven't noticed any problems yet.
2015-11-27: the Tumblr script was refactored: the part that actually interacts with Tumblr was separated into another unit, so that guest lecturer scripts can import that module (and I don't have to continually redact my login creds from each new guest lecturer script as I share it).
2016-02-08: the Twitter script was redesigned in an important way: the back-end text generation code is now my Markov chain–based text generator instead of DadaDodo. Which means this particular page is now out of date and requires revision. Sigh. Also, a change that happened somewhere in the interim: the refactoring scheme described in the update of 27 November has given way to a more general overhaul of my social media–handling code for several projects.
2016-08-18: Much of the source text was cleaned up to remove some garbage that the generator has been too eager to seize on. Mostly, the source text involved grading information that had been written down, and though it had already been scrubbed of names and was being remixed, effectively preventing leakage of confidential information, this random garbage was still leaking into the generated tweets, letting out tweets that have a number of meaningless numbers. It's not a leak of confidential information, but it is ugly, so I went through and cleaned it out, at least somewhat.
2017-04-13: Retiring this version of this document, as it has been outdated for over a year now. The new version will continue to be updated.