How the Irish Lit AutoTweets Twitter Account Is Generated

If you're looking for non-technical information about Irish Lit AutoTweets, you should go here, instead. The document you're reading right now contains technical information about how the tweets are generated.

The Short Version

I exported all of the course-related email that I wrote during the quarter into a folder on my local hard drive, saved the relevant web pages as text in the same folder, copied and pasted the course Twitter stream into a text file in that folder, and saved a text-only version of student paper comments (yes, into the same folder). Then I massaged the data in various ways to make it more amenable to text processing. I generated compiled probability data from the corpus using DadaDodo, saved the compiled data, opened a new Twitter account, wrote a script to generate sentences based on that compiled data and upload individual selections to that Twitter account, installed it as a cron job on my laptop that runs five times a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.

You can download and use the script if you'd like, subject to certain restrictions; read on for details. If you find it useful, I'd be grateful if you'd Flattr me.

The (Much) Longer Version

All of this happens under Bodhi Linux. I tend to use the MATE desktop environment and Tilda terminal emulator, though that's not really relevant. Really, with some minimal modification, this should work on more or less any POSIX operating system.

I save all of my course-related emails, sent and written, for a quarter in a specific folder in Mozilla Thunderbird, my email client. I sorted the folder by sender and exported the 460 messages that I wrote using the ImportExportTools extension for Thunderbird. I saved all of the website documents as plain text with Firefox in the same folder, copied and pasted the Twitter stream into a text file that got saved in folder, and exported the document in which I wrote summary comments as plain text (yes, into the same folder). Then I concatenated all of them using the standard POSIX cat command to produce a the first draft of the textual corpus.

Then there was a long and boring period of searching and replacing to eliminate URLs, student names, smart quotes, carriage returns, double spaces, etc. etc. etc. Really, it's a pretty tedious process. Searching, replacing, and other editing were largely done with Bluefish, which is a pretty good text editor that supports searching for regular expressions and understands POSIX backslash sequences. I installed DadaDodo (sudo apt-get install dadadodo under Bodhi Linux, because it's an Ubuntu derivative).

I moved the corpus (150.txt, because the course for which I did the work was English 150) into a folder, created a symbolic link with a short name (/150) on the root level of my drive to save myself some typing, and moved to /150 in my terminal. Once that was done, I compiled the textual corpus into a useful statistical set with:

dadadodo -w 10000 -o chains.dat -c 1 150.txt

After that, it was a matter of writing and testing a script that generates chunks of text that are within acceptable length parameters and sends them to Twitter. I signed up for a new Twitter account for the purpose and edited the profile to point back here. It used to be easily possible to post to Twitter from a terminal (emulator) using programs like cURL (here's how you used to be able to do it), but Twitter stopped allowing basic authentication and requiring OAuth with API version 1.1, so it's now a pain in the ass to update Twitter from a script necessary to use a more full-featured terminal-based client. Doing a bit of research led me to choose TTYtter, more because of its ease of use than for its amusing name.

This involved saving the TTYtter script to folder that's in my $PATH, then making it executable:

chmod +x ~/.scripts/ttytter

Running TTYtter for the first time causes it to go through its authentication procedure and create a token in my home directory called .ttytterkey. I moved this to /150 and call TTYtter from the script with -keyf=/150/.ttytterkey so that I can also use TTYtter without the -keyf switch to manage my personal Twitter account. (Yes, there are other ways to handle this issue, but this method is quick and it works.)

Once all the pieces were in place, all I needed to do was to put the finishing touches on the script, which is just a quick Bash script that accomplishes a few things:

Calls dadadodo, using sed to strip the leading whitespace that dadadodo tends to put at the beginning of the generated text.
Checks to see if the resulting text is within acceptable length parameters. Of course, Twitter's famous 140-character limit is a hard upper limit that winds up being kind of annoying here, but I also enforce a minimal limit (currently 46 characters) because the very short tweets that DadaDodo generates tend to be rather dull, in my opinion. (Early tests of the script produced the tweet History. about one out of every five times.) If the automatically generated tweet isn't an acceptable length, it just keeps trying until it generates one that is.
Sends the resulting tweet of acceptable length to Twitter using TTYtter.
Outputs the resulting tweet and its length to standard output. I primarily found this helpful while writing the script, but decided to leave it in because I may want to run the script manually from the terminal emulator from time to time.

You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under a Creative Commons Attribution-ShareAlike 4.0 International license. You'll need to make it executable (chmod +x generate.sh), of course. If you find it useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements. I'm happy to answer questions, but can't provide extensive support or bring you up to speed if the explanation here is unintelligible to you.

Once I'd done that, I installed it as a cron job (crontab -e) so that it runs periodically (I decided on five times a day). Here's the line from my cron file that I use as of the time of this writing:

45 4,9,12,17,22 * * * /150/generate.sh

Which is to say that the script runs on my laptop at 4:45 and 9:45 a.m. and 12:45, 5:45, and 10:45 p.m. Provided that my laptop is on and connected to the Internet, of course.

I also installed it as an anacron job so that it runs at least once a day if my laptop is turned on. Here is the relevant line from /etc/anacrontab:

1 20 cron.daily /150/generate.sh

Some Notes on the Script

I often specify full paths in the script because cron jobs may not have a properly set-up environment that guarantees environment variables, such as $PATH, are properly set.

Switches for the the dadadodo invocation:

-c 1: Just generate one sentence.
-l /150/chains.dat: Don't use the original corpus; use the manually compiled statistical data instead. This is faster, though the script still runs fairly quickly without it.
-w 10000: Use a really wide text wrap amount to keep DadaDodo from wrapping text.

Switches for the ttytter invocation:

-hold: If the Twitter rate limit is exceeded, keep trying until we can post. This would probably never happen with a five-time-a-day cron job, but is handy if I wind up posting manually from a terminal.
-keyf=/150/.ttytterkey: Use the TTYtter key specified in the absolute path rather than the default TTYtter key in my home directory. This allows me to use TTYtter to update my personal Twitter account when I'm not running this script, as well.

Reservations About the Current Setup

DadaDodo really doesn't have an understanding of English grammar that's up to snuff from my perspective. What it actually does it traverse Markov chains, but there are a number of things that it ignores in that process that are suboptimal:
- Perhaps most immediately obvious is the fact that DadaDodo doesn't pay attention to the distinction between capital and lowercase letters. Take a look at this tweet for an example:
  He talked after Getting a pair; going to say in.
  
  — Irish Lit AutoTweets (@IrishLitTweets) December 31, 2013
  
  What's actually going on here? This is worth examining in some detail, I think. Though I am not a coder and cannot speak authoritatively about what the program is actually doing, my best guess is that the code has stripped out punctuation and noticed that the word Getting, with that capitalization, after punctuation has been stripped, follows the word after at least once in the original corpus, and so, when "rolling the dice" while traversing the chains and landing on after, it selects Getting, with that particular capitalization. What I'd actually prefer, as a first-order approximation, is for the program to notice that some words are always capitalized and treat them as proper nouns, and only to capitalize other words when they appear at the beginnings of sentences. Of course, this is hardly an ideal solution for a lot of reasons, so I'd also be happy to have the program treat various types of capitalization as completely different words and pay more attention to when various words occur in the original corpus — noticing, for instance, that some words only ever occur at the beginning of sentences. The current system, though, would benefit from not producing randomly capitalized words in the middle of sentences, one way or another.
- There are a number of other ways in which I wish DadaDodo were more cognizant of English grammar:
  - I wish punctuation were handled more intelligently. For instance, take a look at this tweet:
    Not taking his consciousness and you'll want me on Tuesday, morning!
    
    — Irish Lit AutoTweets (@IrishLitTweets) December 30, 2013
    
    Nowhere in the original corpus are the consecutive words Tuesday and morning separated by a comma; DadaDodo is using some heuristic to evaluate when to throw punctuation into the mix. But the rules for comma usage in English, though elaborate, are actually rather straightforward, and there is no reason to throw a comma in here; it makes the clause itself into meaningless gibberish. Again, fixing this would be an incredibly complex issue, and would require that DadaDodo come closer to "understanding" what the sentences it generates mean, or at least that it have some sense of grammar while generating the tweets, and it is not immediately obvious to me how this could be easily accomplished while simply traversing a Markov chain. Still. If anyone has a suggestion for a better automatic text generator, I'd love to hear about it.
  - I wish that DadaDodo had a better sense of the fact that English has different parts of speech, and would select an appropriate part of speech based on what the sentence "is currently doing." Again, this is an incredibly complex task, but (as a non-coder) I think that it should be possible for code to incorporate a better sense of grammar into its generated text, and that progress is a worthy goal even if perfection remains elusive.
- All of the above is to say: if anyone has a suggestion for a Linux program that can be run from a bash script and does a better job of doing what DadaDodo does (i.e: generating text more or less grammatically, but randomly, and based on an existing corpus of text), I'd love to hear about it.
- Twitter's 140-character limit really constricts what can be done here. Playing around with DadaDodo and the existing corpus shows (well, in my own opinion, anyway) that longer sections of text "feel more like real paragraphs." For an example, take a look at some longer paragraphs generated from the same corpus but not restricted to 140 characters:
  
  That's fine and Godot from more analytically than of a good job this happens, you to be change be. Excellent sense, of or exam is needed to do you. Perfect and haven't yet a good opportunity to get back through to specific instances to Dexter one other people haven't yet and brief Notes.
  
  This matter and doesn't always, a particular texts, me, know if you're using as for the course edition of these would help and other pair's textual grammatical problems. The citation related to gesture toward an overview of expectations do you need one, is going on the midterm after thinking about how many small lucid and I'm sorry that your plans even more. I'll see a keystroke without engaging in pretty solid job here, though you discover that the grade. All students reading of suboptimal or points of anyone have if you should be leaving early stakes of course a lot of a narrow, rural, frequently describes the final, please let me with father, and Futurism, it's on Raglan Road too that they will re work to the West beginning of Stephen, and the opportunity to how to make discussions about the day as berries part of phrasing and that you might be refined a look at the class, isn't for the way that more effectively, not something genuinely Private.
  
  One way in some a is a third: of influence in the only not lift people to get to go over and though, you've been here although note, that a very welcome to create the experience of his grade B writing. Remember When grading rubric some amazing. Thank you realized that you can also choose a specific point, nothing Patrick Kavanagh on to think that time to survive and I text and only looking for someone this length before, I see whether you for your claims as from Ulysses; please let you and to think that you responding to do it is due to construct a little bit flat in a shame!
  
  Thank you would give him to try to text day for several of your writing is to structural issues, if many good student this is a strong essay, even stronger. Earlier period and not because I think that matters will change as a specific complex, set next two and nationalism and you won't have conceivably have more productive move from what you don't be one like, subject from your grade for you have her grad students attended and my all this a piece of his grade it's there are neither does Vladimir's line without me as the way that there: on specific points into this; information, allows them based on time period and I think too is for all in my comments in luck with grace and this leaves you too and was mentioned in terms to recite OK to get into Ulysses with a case that understanding of a up, for cues that you'll be interesting and insightful focused.
  
  While you'll note that the same grammatical problems occur in the longer paragraphs as in the shorter tweets, they're less noticeable, and don't impair the perceived meaning of the paragraphs to the same degree. I suspect that the reason for this is that the reader's models of understanding kick in in a different way in longer pieces of text, because multi-sentence constructions demand that the reader evaluate the connections between individual statements and thereby activate models of understanding that involve such things as logical progression, narrative structure, and so forth. Which is to say that part of our basic literacy training for those of us who are experienced readers teaches us to ask questions about relationships between statements and that, because we're experienced readers, we ask these questions pretty much automatically. But the fact of the matter is that, even when statements are automatically generated, the reader's mind will distinguish some logical pattern at least some of the time, and that perceiving such a pattern in a paragraph — even to the extent of having a sense of what the paragraph has said up to this point — helps to resolve interpretive questions and assists in the construction of an overall sense of meaning for the paragraph. This overall sense of meaning also tends to diminish the degree to which grammatical problems are perceived as causing hermeneutic difficulty. A single-sentence tweet — especially one restricted to 140 characters — though, doesn't (necessarily) activate the same hermeneutic requirement to evaluate the relationship between statements, so there's nothing to buffer it if it doesn't stand on its own.
  - Which is to say that doing this for Twitter using the current toolset is always going to be suboptimal (even though it works out OK a fair percentake of the time). There are probably two basically different ways to respond to this problem: either replace DadaDodo with another text generator that does a better job of generating grammatically correct short blocks of text, or else find a venue that supports publishing longer chunks of automatically generated text. I'm pondering both and would love input if anyone has any.

All of which is to say, again, that I'd love feedback if people have thoughts or ideas about how this could be done better.

Change History

Before 2014-12-20: Some stuff happened. I wasn't yet tracking changes; everything was too fluid.
2014-12-20: Modified the script that generates this account so that it will no longer generate the same tweet more than once. The strategy used is simply to compare the new tweet against a list of all tweets and retry if it's been generated before. So long, this tweet. It's been fun.
2015-09-15: This document was superseded; the Bash script was replaced with a Python script. The new version of this document can be found here.

Code on this page by Patrick Mooney is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at http://is.gd/IrishLitAutoTweetsTechnical.