Irish Lit AutoTweetsTwitter Account Is Generated
If you're looking for non-technical information about Irish Lit AutoTweets, you should go here, instead. The document you're reading right now contains technical information about how the tweets are generated.
I exported all of the course-related email that I wrote during the quarter into a folder on my local hard drive, saved the relevant web pages as text in the same folder, copied and pasted the course Twitter stream into a text file in that folder, and saved a text-only version of student paper comments (yes, into the same folder). Then I massaged the data in various ways to make it more amenable to text processing. I generated compiled probability data from the corpus using DadaDodo, saved the compiled data, opened a new Twitter account, wrote a script to generate sentences based on that compiled data and upload individual selections to that Twitter account, installed it as a cron
job on my laptop that runs five times a day, and wrote a couple of web pages to describe what happens. You're reading one of those web pages now.
You can download and use the script if you'd like, subject to certain restrictions; read on for details. If you find it useful, I'd be grateful if you'd Flattr me.
All of this happens under Bodhi Linux. I tend to use the MATE desktop environment and Tilda terminal emulator, though that's not really relevant. Really, with some minimal modification, this should work on more or less any POSIX operating system.
I save all of my course-related emails, sent and written, for a quarter in a specific folder in Mozilla Thunderbird, my email client. I sorted the folder by sender and exported the 460 messages that I wrote using the ImportExportTools extension for Thunderbird. I saved all of the website documents as plain text with Firefox in the same folder, copied and pasted the Twitter stream into a text file that got saved in folder, and exported the document in which I wrote summary comments as plain text (yes, into the same folder). Then I concatenated all of them using the standard POSIX cat
command to produce a the first draft of the textual corpus.
Then there was a long and boring period of searching and replacing to eliminate URLs, student names, smart quotes, carriage returns, double spaces, etc. etc. etc. Really, it's a pretty tedious process. Searching, replacing, and other editing were largely done with Bluefish, which is a pretty good text editor that supports searching for regular expressions and understands POSIX backslash sequences. I installed DadaDodo (sudo apt-get install dadadodo
under Bodhi Linux, because it's an Ubuntu derivative).
I moved the corpus (150.txt
, because the course for which I did the work was English 150) into a folder, created a symbolic link with a short name (/150
) on the root level of my drive to save myself some typing, and moved to /150
in my terminal. Once that was done, I compiled the textual corpus into a useful statistical set with:
dadadodo -w 10000 -o chains.dat -c 1 150.txt
After that, it was a matter of writing and testing a script that generates chunks of text that are within acceptable length parameters and sends them to Twitter. I signed up for a new Twitter account for the purpose and edited the profile to point back here. It used to be easily possible to post to Twitter from a terminal (emulator) using programs like cURL (here's how you used to be able to do it), but Twitter stopped allowing basic authentication and requiring OAuth with API version 1.1, so it's now a pain in the ass to update Twitter from a script necessary to use a more full-featured terminal-based client. Doing a bit of research led me to choose TTYtter, more because of its ease of use than for its amusing name.
This involved saving the TTYtter script to folder that's in my $PATH, then making it executable:
chmod +x ~/.scripts/ttytter
Running TTYtter for the first time causes it to go through its authentication procedure and create a token in my home directory called .ttytterkey
. I moved this to /150
and call TTYtter from the script with -keyf=/150/.ttytterkey
so that I can also use TTYtter without the -keyf
switch to manage my personal Twitter account. (Yes, there are other ways to handle this issue, but this method is quick and it works.)
Once all the pieces were in place, all I needed to do was to put the finishing touches on the script, which is just a quick Bash script that accomplishes a few things:
dadadodo
, using sed
to strip the leading whitespace that dadadodo tends to put at the beginning of the generated text. History.about one out of every five times.) If the automatically generated tweet isn't an acceptable length, it just keeps trying until it generates one that is.
You're welcome to download and adapt and use the script yourself for your own purposes if you'd like, subject to certain conditions; it's licensed under a Creative Commons Attribution-ShareAlike 4.0 International license. You'll need to make it executable (chmod +x generate.sh
), of course. If you find it useful, I'd be grateful if you'd Flattr me, though this is (of course) not necessary. I'd love to hear about it if you do anything interesting with it, or if you have suggestions for improvements. I'm happy to answer questions, but can't provide extensive support or bring you up to speed if the explanation here is unintelligible to you.
Once I'd done that, I installed it as a cron
job (crontab -e
) so that it runs periodically (I decided on five times a day). Here's the line from my cron file that I use as of the time of this writing:
45 4,9,12,17,22 * * * /150/generate.sh
Which is to say that the script runs on my laptop at 4:45 and 9:45 a.m. and 12:45, 5:45, and 10:45 p.m. Provided that my laptop is on and connected to the Internet, of course.
I also installed it as an anacron
job so that it runs at least once a day if my laptop is turned on. Here is the relevant line from /etc/anacrontab
:
1 20 cron.daily /150/generate.sh
I often specify full paths in the script because cron
jobs may not have a properly set-up environment that guarantees environment variables, such as $PATH, are properly set.
Switches for the the dadadodo
invocation:
-c 1
-l /150/chains.dat
-w 10000
Switches for the ttytter
invocation:
-hold
cron
job, but is handy if I wind up posting manually from a terminal.-keyf=/150/.ttytterkey
He talked after Getting a pair; going to say in.
— Irish Lit AutoTweets (@IrishLitTweets) December 31, 2013
What's actually going on here? This is worth examining in some detail, I think. Though I am not a coder and cannot speak authoritatively about what the program is actually doing, my best guess is that the code has stripped out punctuation and noticed that the word Getting
, with that capitalization, after punctuation has been stripped, follows the word after
at least once in the original corpus, and so, when "rolling the dice" while traversing the chains and landing on after
, it selects Getting
, with that particular capitalization. What I'd actually prefer, as a first-order approximation, is for the program to notice that some words are always capitalized and treat them as proper nouns, and only to capitalize other words when they appear at the beginnings of sentences. Of course, this is hardly an ideal solution for a lot of reasons, so I'd also be happy to have the program treat various types of capitalization as completely different words and pay more attention to when various words occur in the original corpus — noticing, for instance, that some words only ever occur at the beginning of sentences. The current system, though, would benefit from not producing randomly capitalized words in the middle of sentences, one way or another.
Not taking his consciousness and you'll want me on Tuesday, morning!
— Irish Lit AutoTweets (@IrishLitTweets) December 30, 2013
Nowhere in the original corpus are the consecutive words Tuesday
and morning
separated by a comma; DadaDodo is using some heuristic to evaluate when to throw punctuation into the mix. But the rules for comma usage in English, though elaborate, are actually rather straightforward, and there is no reason to throw a comma in here; it makes the clause itself into meaningless gibberish. Again, fixing this would be an incredibly complex issue, and would require that DadaDodo come closer to "understanding" what the sentences it generates mean, or at least that it have some sense of grammar while generating the tweets, and it is not immediately obvious to me how this could be easily accomplished while simply traversing a Markov chain. Still. If anyone has a suggestion for a better automatic text generator, I'd love to hear about it.
Twitter's 140-character limit really constricts what can be done here. Playing around with DadaDodo and the existing corpus shows (well, in my own opinion, anyway) that longer sections of text "feel more like real paragraphs." For an example, take a look at some longer paragraphs generated from the same corpus but not restricted to 140 characters:
That's fine and Godot from more analytically than of a good job this happens, you to be change be. Excellent sense, of or exam is needed to do you. Perfect and haven't yet a good opportunity to get back through to specific instances to Dexter one other people haven't yet and brief Notes.
This matter and doesn't always, a particular texts, me, know if you're using as for the course edition of these would help and other pair's textual grammatical problems. The citation related to gesture toward an overview of expectations do you need one, is going on the midterm after thinking about how many small lucid and I'm sorry that your plans even more. I'll see a keystroke without engaging in pretty solid job here, though you discover that the grade. All students reading of suboptimal or points of anyone have if you should be leaving early stakes of course a lot of a narrow, rural, frequently describes the final, please let me with father, and Futurism, it's on Raglan Road too that they will re work to the West beginning of Stephen, and the opportunity to how to make discussions about the day as berries part of phrasing and that you might be refined a look at the class, isn't for the way that more effectively, not something genuinely Private.
One way in some a is a third: of influence in the only not lift people to get to go over and though, you've been here although note, that a very welcome to create the experience of his grade B writing. Remember When grading rubric some amazing. Thank you realized that you can also choose a specific point, nothing Patrick Kavanagh on to think that time to survive and I text and only looking for someone this length before, I see whether you for your claims as from Ulysses; please let you and to think that you responding to do it is due to construct a little bit flat in a shame!
Thank you would give him to try to text day for several of your writing is to structural issues, if many good student this is a strong essay, even stronger. Earlier period and not because I think that matters will change as a specific complex, set next two and nationalism and you won't have conceivably have more productive move from what you don't be one like, subject from your grade for you have her grad students attended and my all this a piece of his grade it's there are neither does Vladimir's line without me as the way that there: on specific points into this; information, allows them based on time period and I think too is for all in my comments in luck with grace and this leaves you too and was mentioned in terms to recite OK to get into Ulysses with a case that understanding of a up, for cues that you'll be interesting and insightful focused.
While you'll note that the same grammatical problems occur in the longer paragraphs as in the shorter tweets, they're less noticeable, and don't impair the perceived meaning of the paragraphs to the same degree. I suspect that the reason for this is that the reader's models of understanding kick in in a different way in longer pieces of text, because multi-sentence constructions demand that the reader evaluate the connections between individual statements and thereby activate models of understanding that involve such things as logical progression, narrative structure, and so forth. Which is to say that part of our basic literacy training for those of us who are experienced readers teaches us to ask questions about relationships between statements and that, because we're experienced readers, we ask these questions pretty much automatically. But the fact of the matter is that, even when statements are automatically generated, the reader's mind will distinguish some logical pattern at least some of the time, and that perceiving such a pattern in a paragraph — even to the extent of having a sense of what the paragraph has said up to this point — helps to resolve interpretive questions and assists in the construction of an overall sense of meaning for the paragraph. This overall sense of meaning also tends to diminish the degree to which grammatical problems are perceived as causing hermeneutic difficulty. A single-sentence tweet — especially one restricted to 140 characters — though, doesn't (necessarily) activate the same hermeneutic requirement to evaluate the relationship between statements, so there's nothing to buffer it if it doesn't stand on its own.
All of which is to say, again, that I'd love feedback if people have thoughts or ideas about how this could be done better.
Code on this page by Patrick Mooney is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at http://is.gd/IrishLitAutoTweetsTechnical.