Suspect the presence of single letters as words comes from periods used in acronyms (and other non-sentence terminating uses of periods). I think that the best solution is probably to replace the offending periods with another character, probably one that looks like the period character (such as U+2024, “․”, “one dot leader”), then replace any instances of those that occur in the output text with a period before shipping to Tumblr.
This has not yet happened on ch. 1-6 of Ulysses, which have already been shipped out for the current set (#002), but I can start with Aeolus (which, off the top of my head, certainly contains the dotted acronyms “K.M.A.” and “K.M.R.I.A.”), move forward with other chapters, and hopefully remember to fix 1-6 on or before the next run’s start.
On reflection, an ideal search pattern might be:
\.\S
, period followed by non-whitespace, and selectively replace these periods with ․, which is U+2024.[A-Z]\.
, capital letter followed by period, and check to see if any of those periods should be replaced by U+2024.“Other processing” involved dedenting lines of song and formatting them as real sentences; correcting miscellaneous errors here and there; and correcting a few errors in paragraph breaks.
Keeping this modified 07.txt in the location /UlyssesRedux/corpora/joyce/ulysses/, created a directory /UlyssesRedux/corpora/joyce/ulysses/07, which now contains two files, headlines.txt and non-headlines.txt, which I am about to modify the script 7.py to use to construct Aeolus-like chapters for the Aeolus portion of the script.
OK, it seems that was basically successful. An “Aeolus” has been posted.
Tweaked the Markov generator code to treat U+2024 like other punctuation, as above. Seems to be working. My initial thought was to swap these out with real periods on the way out, but that didn’t happen with this chapter.
Tweaked the capitalization logic a bit, too.
Maybe the em dash should be treated as a token, too; that would allow for it to appear in the output, for one thing. It’s a noticeable feature of Ulysses. Perhaps one way to integrate other texts would be to pre-process them by replacing quote marks with em dashes … though probably not automatically, because they’re not perfectly interchangable. Sigh.
Anyway, Lestrygonians has been preprocessed. About to let that script run.
Adding the em dash to the list of punctuation tokens; it’ll make “Scylla and Charybdis” work better.
Should preprocess the texts by replacing multiple dashes with the em dash, probably. Adding that to the list from 5 Dec. results in:
Posted “Scylla and Charybdis.” Updated the description web page a bit. It’s still not linked to the rest of my web site; I’m consciously holding off on that until the blog is self-running and there are multiple editions of Ulysses Redux posted.
Posted “Wandering Rocks.” The algorithm generates Markov chains for each of the nineteen sections by training the model based on the equivalent section in Joyce’s “Wandering Rocks,” plus the sections before and after it (wrapping around if necessary). First run suggests that this just doesn’t give enough variety for the algorithm to avoid repeating itself – maybe add one additional section on each side (so target section plus/minus 2), perhaps weighting more heavily for the target section. Of course, mixing in another text will help to ameliorate this, as well.
The “Wandering Rocks” script froze twice out of four attempts to run it; once each directly and when run by daily-script.py. I haven’t get figured out why, but I’ll keep my eye on it.
Posted a “Sirens.” There was a chapter that didn’t need much preprocessing. The script is currently the same as most of the others, except for the text-description constants.
Posting a “Cyclops.” Lots of periods replaced with one-dot leaders; Joyce here loves his initials for names and titles.
Just realized, too, that I’ve been underestimating the length needed for each chapter: I’ve been misreading the stats files I’m generating using the generic script (currently all chapters through 12 except 7 and 10): I’ve been mistaking number of paragraphs for number of sentences. Sigh. However, I guess I’ll let it stand here on the theory that Ulysses Redux #002 is still a trial run, and it can be corrected later.Posting a “Cyclops.” Lots of periods replaced with one-dot leaders here; Joyce here loves his initials for names and titles. Just realized, too, that I’ve been underestimating the length needed for each chapter: I’ve been misreading the stats files I’m generating using the generic script (currently all chapters through 12 except 7 and 10): I’ve been mistaking number of paragraphs for number of sentences. Sigh. However, I guess I’ll let it stand here on the theory that Ulysses Redux #002 is still a trial run, and the scripts can be corrected later; this will of course make sure that they take longer to run.
Starting to suspect that a Markov length of three is too long to avoid coughing out entire sentences from Ulysses; Joyce’s text does an unusually strong job of avoiding repetition of grammatical structures that support generation of random text via Markov chains (i.e., with a chain length of three, there aren’t enough nodes that have more than one path out of them). We’ll see if this continues to be the case over the next few chapters.
Edited the “Nausicaa” source text (practically no prep needed on the source text). Posted a “Nausicaa” to the website.
I’m dreading “Oxen of the Sun” (because of its shifting vocabulary as the chapter progresses, and because of the anticipated difficulty of finding texts to pair with it) and “Circe” (because of its length and structure).
Decided to just go ahead with treating “Oxen” with the whole-chapter approach to see how it goes; not much pre-processing was needed.
Similarly, decided just to go ahead with processing “Circe” without regard to its dramatic structure, treating it as a textual blob, for this first pass. Probably a bad idea, but running the 15.py
script set up like this already takes nearly eleven and a half minutes. So posted: indeed, it’s a ridiculous (though not in an amusing way) mishmash, as expected. A particular problem in this chapter: parenthesis are getting dropped, which I suspect is why there are capital letters showing up in the middle of sentences here.
A “Eumaeus” has been posted. Irruptions of capital letters continue, in part because there are abbreviations I’m not catching with my regexes (e.g., on line 181 of the current version of “Eumaeus,” there is a reference to the “s. s. Lady Cairns of Swansea”, which doesn’t have capital letters but does have spaces after the periods – there’s something to follow up on). In any case, this is probably going to be a matter of tracking things down individually. Eventually.
I should really assemble a list of words to be search-and-replaced automatically for some of these problems. Next pass, maybe.
Posted an “Ithaca.” It turned up a bug I thought I’d fixed: the Markov sentence generator only regarded periods as ending sentences. Added ! and ? to that list, which should ease processing time in “Aeolus” as well (though it was the question half of the question-and-answer structure of “Ithaca” that really threw this into relief). I was certain I’d made this change more than a week ago, but apparently not.
Giving “Penelope” a shot. … Interesting. OK, that’s Ulysses Redux #002.
First big chunk of a script to (more or less) automatically wrap things up when an entire novel has been written (which will eventually be every three weeks); that script is utility-scripts/postprocess-set.py
(because it … post-processes a set). Basically, it transforms the HTML fragment at /UlyssesRedux/current-run/index.html
into a real HTML5 file with an appropriate name in the directory at /~patrick/projects/UlyssesRedux/contents/
so it can be posted to the web site. It draws data from /UlyssesRedux/current-run/data.csv
, prompting for anything it needs that’s missing; walks through syncing and committing a change, then pushing it to the appropriate branch on GitHub; constructs the HTML file, then runs it through HTML Tidy (currently v 5.1.32); and offers to sync the website’s data (all of it, currently, including the big chunks that have nothing to do with Ulysses Redux).
It doesn’t currently link the newly generated table of contents to the meta-table of contents at /~patrick/projects/UlyssesRedux/contents/index.html
; delete the file at /UlyssesRedux/current-run/index.html
; set up the new tags file at /UlyssesRedux/current-run/temporary-tags
; or otherwise set up for the next run, although it arguably should (and maybe will in the future). Still, at the moment, it’s a useful way to postprocess.
The daily script (daily-script.py
) was modified to read temporary tags from /UlyssesRedux/current-run/temporary-tags
instead of being hard-coded into the script.
I’m thinking about making this particular file (yes, the one I’m writing right now) available as a coding journal; it could be converted to HTML with pandoc
and output somewhere in /~patrick/projects/UlyssesRedux/
.
Starting a run on Ulysses Redux #003. At this point, I don’t anticipate any major changes to chapter scripts other than for “Circe,” but we’ll see. This run will again just be a remix of Joyce’s Ulysses, though hopefully a slightly better executed one, with small tweaks.
This involves preprocessing the first six chapters again, in the same order as mentioned on 8 Dec., but with additional steps:
\.\S
, period followed by non-whitespace, and selectively replace these periods with ․, which is U+2024.[A-Z]\.
, capital letter followed by period, and check to see if any of those periods should be replaced by U+2024.The new run has been set up (I think), and I’ll preprocess “Telemachus” and let ’er rip. Maybe I’ll get through “Nestor” tonight, too.
“Telemachus” is posted. Let’s preprocess “Nestor,” then. – Done, and posted. OK, “Proteus,” then. Processed and posted. Ditto “Calypso.” Also “Lotus Eaters.” Then, because I was distracted, “Hades,” but without preprocessing; though I’ve now processed the input text for “Hades” after that fact (I kept it since I like the first paragraph, you know). And I’m letting “Aeolus” run on my way to bed … and timing it.
“Aeolus” took just over 12 minutes, which is an improvement of maybe 11% due to treating ? and ! as sentence-enders. Posted a “Lestrygonians” (1m57.926s), “Scylla and Charybdis” (2m17.954s), “Wandering Rocks” (0m37.543s). (“Wandering Rocks” included its debugging output. Sigh. Edited. Thought I’d turned that off, but there was some Git fuckery that may have restored the old version. Turned it off again, for good this time, I hope.) Posted a “Sirens” (1m39.132s), “Cyclops” (2m20.456s), “Nausicaa” (0m28.154s), “Oxen of the Sun” (0m17.988s).
Some observations: the chapter scripts that run quickly have at least one of two characteristics: they have comparatively long paragraph lengths (and therefore are disproproportionately affected by the currently easy-to-fix-but-not-yet-fixed “number of paragraphs is treated as number of sentences” bug); or else they are generated by scripts that look at smaller chunks of text all at once than the whole chapter (and therefore generate much smaller Markov dictionaries).
Making a preliminary analysis of “Circe” (code in /UlyssesRedux/code/utility-scripts/analysis-1-ch-15.py
). Alongside this process, did some additional preprocessing on “Circe” that had not been done before, including reducing songs/poetry to prose form (as there is currently no algorithmic method in the scripts for dealing with poetry). Similar reductions of other multi-paragraph (i.e., multi-line) speaking bits to single paragraphs. (My intent has been to move stage directions that are actually part of a character’s lines into that paragraph, where as stage directions that are not part of a character’s lines get their own paragraph. I’m not perfectly certain that this was done perfectly, however.)
Developed a script that splits ‘Circe’ into multiple corpus files. There are more than three dozen, but they fall into three categories:
There’s also a structure file automatically generated by the same script that’s used to post the “Circe” chapters. Finally, there’s a new 15.py, the Circe script, that (at first blush) generates a pretty good text. Posted a Circe (it takes not quite 13 minutes for the script to run), then posted the last three chapters, concluding Ulysses Redux #003.
Wrote a script to set up a new run. Modified the postprocessing script to offer to call this setup script. As a test, running through an entire novel to create Ulysses Redux #004, with the same parameters as the previous novel; found the odd bug in tagging that I’m fixing along the way. Total took almost exactly 46 minutes (but, again, that’s with the paragraphs-for-sentences erroneous substitution). I’m about to correct that problem in the chapter scripts that have it, then re-run to produce Ulysses Redux #005. I’m also reducing the Markov chain length to 2 to see how that turns out. Parameters are otherwise identical; we’ll see if this has gotten various bugs out of the setup/postprocessing scripts. Produced a new script that produces 18 chapters in one go; I’ll produce #005 with this and see how long it takes with the new parameters. (Answer: just a hair over 48 minutes.) Produced Ulysses Redux #006 as well to see how well it worked (answer: apparently, quite well).
Have been letting daily-script.py
run as an anacron script to test for several days; it seems to be running well. Pushed the rest of the novel through to test the slightly new version of /UlyssesRedux/code/utility-scripts/postprocess-set.py
, which (hopefully) adds the new TOC to the meta-TOC.
Running Ulysses Redux #008 straight through to test the new system. “Wandering Rocks” seems to be the fastest script to run — probably because of the much-reduced corpus that it has relative to the length of the chapter as a whole. Once again, the chapters with the more involved script take the longest to run. (But, then, this really only matters when we’re producing an entire novel at once.) Time taken by the whole run was just under 50 minutes.
Refactored code so that all of the file location constants are in directory_structure.py
. Did some other clode cleanup. Ran Ulysses Redux #009 and Ulysses Redux #010 to test refactored code.
In “Scylla and Charybdis,” replaced the em dashes indicating song with quotation dashes (the only phrase this occurred in was on line 196 of the source, and it produced such output as “— ri— a in ex— cel— sis De— o”). This is another “punctuation marks with multiple meanings in English and the source text aren’t fully understood by the generator” problem, as with the ONE DOT LEADER workaround.
Milestone: the first version of Ulysses Redux written entirely by cron scripts went off well. Posting table of contents, etc. to the website.
Revised the architecture of the script system: it’s now properly a Python module package with __init.py__
files and everything. This implies a number of changes:
subprocess
module.
stdout
.directory_structure.py
, both to reflect new directory/module names and to avoid ending pathnames with a slash.
directory_structure.py
, changes happened primarily in daily_script.py
, though there were changes in a few of the chapter scripts, too.Also, the tagline at the end of “Penelope” now includes the current year.
Ran the entire rest of the current novel, which is Ulysses Redux #012.
Then I altered chapter scripts 1–6 and re-run an entire book for testing purposes. That made Ulysses Redux #013, which seemed to go well.
Revised daily_script.py
so that the actual posting is done by a new unit in my personal library, social_media.py
. This is part of a general move toward code reuse and functional abstraction. Several tests suggest that it’s working just fine. On the back end, this means I don’t have to maintain two repositories (one functioning, one for public use) in order to avoid giving up my authentication contants any more, which makes my life a lot simpler.
utility_scripts/sync-code.sh
has been removed, as there’s no need for it.utility_scripts/postprocess_set.py
has also been revised so that it doesn’t try to sync.Ulysses Redux #016 is done running. Holding off resetting it so that I can actually set up the ability to remix Ulysses with other texts, as the blog is actually intended to do. This will probably take a few days.
Wrote a pair of scripts, compare_texts.py
and reverse_compare_texts.py
, which are intended to help me evaluate how to match up mix-in texts with chapters of Ulysses. After all, what will make a good combination, especially one that can run more or less unsupervised? (Letting the blog run unsupervised during individual runs, for eighteen days at a time, is a medium-term goal: I’d like to just drop mix-in texts in appropriate folders and not have to think about it for two and a half weeks at a time.) The current criterion, which is clearly suboptimal, is vocabulary overlap: what % of the words that occur in, say, “Telemachus” occur in each of the texts that are under consideration? (This is what comparare_texts.py
evaluates, for each chapter of Ulysses.) Similarly, for any given text under consideration, which chapters of Ulysses share the most vocabulary with that text? (Of course, this is reverse_compare_texts.py
.) There is also a suggest_pairings.py
, but all it does at this time is run both of the previous scripts.
My best guess is that a better strategy than just looking at vocabulary overlap would involve looking at MARKOV STATE overlap at various chain lengths; perhaps the comparison could be done both ways, the scores of OVERLAP(a,b) and OVERLAP(b,a) could be multiplied together, and an ideal chain length could be found by trying to peg … what, average? root-mean-square? … scores for this matrix to some experimentally determined ideal value. It’s not merely a matter of maximizing overlap. Switching should be a comparative rarity, I suspect.
Setting up for #016; the current mix-in texts come from the King James bible.
Finally got to rewriting scripts to accommodate remixing. Ran a “Telemachus”; noticed that it didn’t have any tags. This probably extends back to much of Ulysses Redux #016, because the root of the problem is passing a list of tags to the social_media
unit instead of a comma-separated string, and this probably happened when I rewrote daily_script
to go through social_media
. Sigh. I need to go back through and manually tag (at least some of? most of? all of?) #016, but probably won’t get to it tonight.
Turns out that the last 17 chapters of #016 needed to be manually tagged. Done.
Finished altering scripts to run with mixin classes. Current problem: mixing in texts of different sizes than the size of the current chapter of Ulysses might wind up washing out Joyce’s voice, which should be the anchor point of each chapter, even if glossolalia occasionally means that other voices come to the forefront. Current solution, which is probably not optimal, involves repeating the chapter of Joyce currently being considered enough times to make it longer in length (currently by approximately 1.4x, and let’s see how that works) than the sum total of the other chapters. This will have the effect of greatly strengthening the Joyce-based chains and patterns, I think.
Many scripts are clearly running much slower than they used to: well, they’re dealing with much more text than before. So far, this is acceptable for a script that runs once a day, early in the morning, but let’s keep an eye on it.
Tried to run a whole rest-of-the-novel, immediately found two errors: everything after “Nestor” was based on “Nestor” instead of the appropriate chapter (aargh), and the fixing-the-tagging bit accidentally broke how tags were being recorded to current-run/index.html
(every letter in every tag was recorded as a separate tag, alas, which produced comparatively huge files). Both should have been fixed: by deleting back to Nestor, and manually editing index.html
.
Letting a rest-of-the-novel run continue as I go to bed. Will re-run tomorrow with identical parameters, I think, and time the whole execution to see what the total execution time is.
Letting it run overnight found a minor script error that made the engine choke. Re-running it in the morning found several more. Finished out the run; setting up for the next one. This time, I hope to let it run overnight and see just how long everything takes.
Here are the timing stats reported on my (admittedly rather low-powered) laptop:
05:22:07 patrick@linawesome code$ time for num in $(seq 1 18); do time ./daily_script.py; done INFO: About to run script chapter_scripts.ch01.py.
real 4m36.721s
user 4m31.925s
sys 0m2.320s
INFO: About to run script chapter_scripts.ch02.py.
real 2m24.674s
user 2m20.252s
sys 0m1.197s
INFO: About to run script chapter_scripts.ch03.py.
real 3m28.969s
user 3m25.342s
sys 0m1.465s
INFO: About to run script chapter_scripts.ch04.py.
real 4m46.893s
user 4m42.613s
sys 0m1.807s
INFO: About to run script chapter_scripts.ch05.py.
real 5m9.340s
user 5m4.519s
sys 0m2.047s
INFO: About to run script chapter_scripts.ch06.py.
real 10m18.253s
user 10m8.179s
sys 0m5.467s
INFO: About to run script chapter_scripts.ch07.py.
real 53m40.071s
user 53m28.927s
sys 0m1.753s
INFO: About to run script chapter_scripts.ch08.py.
real 18m50.095s
user 18m22.984s
sys 0m17.560s
INFO: About to run script chapter_scripts.ch09.py.
real 11m42.685s
user 11m29.556s
sys 0m6.688s
INFO: About to run script chapter_scripts.ch10.py.
real 7m21.747s
user 7m15.959s
sys 0m2.235s
INFO: About to run script chapter_scripts.ch11.py.
real 15m48.669s
user 15m35.874s
sys 0m8.179s
INFO: About to run script chapter_scripts.ch12.py.
real 25m48.812s
user 25m27.617s
sys 0m15.676s
INFO: About to run script chapter_scripts.ch13.py.
real 8m18.395s
user 8m9.342s
sys 0m5.063s
INFO: About to run script chapter_scripts.ch14.py.
real 8m19.324s
user 8m11.105s
sys 0m5.384s
INFO: About to run script chapter_scripts.ch15.py.
real 77m37.221s
user 77m19.195s
sys 0m4.189s
INFO: About to run script chapter_scripts.ch16.py.
real 8m25.231s
user 8m18.126s
sys 0m3.206s
INFO: About to run script chapter_scripts.ch17.py.
real 9m6.635s
user 8m59.839s
sys 0m2.519s
INFO: About to run script chapter_scripts.ch18.py.
real 0m10.410s
user 0m8.740s
sys 0m0.156s
real 275m54.152s
user 273m0.095s
sys 1m26.917s
09:58:18 patrick@linawesome code$
Postprocessing; uploading the table of contents for this run; etc.
Some changes made to scripts here, of course. Primarily, this involves prepping for the possibility of per-run, by-chapter tags and descriptions, which I’ll store in current-run/data.csv
. daily-script.py
has been updated with partial support for this, and I’ll try to finish this off before going to bed tonight. There’s a new unit, code/utility_scripts/gen_utils.py
, that’ll have the code to support this; various things are going to be moved over there.
Modified utility_scripts/suggest_pairings.py
so that it actually suggests pairings. Running out the current set (Ulysses Redux #019) so that I can test another JKV-based run with pairings that were auto-assigned by this script.
Have been running out Ulysses Redux #020 to test new scripts. Played with the “Penelope” script; trying out a model that looks at the length of the whole text instead of the number of sentences. (Of course, the problem is that Joyce himself only had 8 sentence, but they were very long; this basically worked as long as the only text on which the generator was trained for “Penelope” was “Penelope” itself. But mixing in other texts means that the algorithm can terminate sentences much more easily, so the last few “Penelopes” have been quite short.) The new formula is better than nothing but needs to be rethought. Some other day.
There had been a brief interruption as I reinstalled my operating system; forcing the end of the current novel today to compensate for this. Minor changes that are reflected only in Git commit messages accompany this. Running the first chapter of a new iteration immediately.
I’m on the road, traveling. In order to facilitate continued postings for UlyssesRedux (and my other automatic text projects) while my laptop moves around with me and is, most of the time, offline, I’ve moved them to a virtual machine that draws off of a programming/
folder in my Dropbox folder, so that the virtual machine, running on another Internet-connected laptop that I leave at home, continues to post, but I can still make any changes that might be necessary. So far, the “continues to post” has worked; I’m about to run the postprocess-set.py
script to set up another run. Let’s see how that goes.
Well, setting up a new run while in a different state from the computer doing the posting worked well; the new virtual machine appropriately read from the synced Dropbox folder and continued to operate normally. However, over the last few days, it turns out the virtual machine had crashed and I didn’t notice. I’ll keep a closer eye on that in the future.
Ran the chapters that should have run over the last four days, then completed the novel. I’m starting a new run using Samuel Butler’s translation of The Odyssey as a mix-in text.
Manually forcing the end of the novel by running two scripts manually so I can set up the next run before leaving town for the weekend. Homer really predominates in the chapters from this run that I looked at; where is Joyce’s voice? I’m going to re-run the Butler Odyssey as the mixin text for the next run, so let’s see how that goes.
The previous run was over for several days but had not been finalized; I was trying to figure out why Joyce’s own text was apparently dropping out entirely from some chapters before letting the blog continue. Turns out that the affected chapters all called generic_chapter.py
to write their code, though not all chapters that did so were affected. The problem: in trying to control the ratio of Joyce to non-Joyce, the algorithm calculates a ratio: an integer indicating how many times the Joycean elements should be repeated when constructed a mixed-source source text. However, it was possible in theory for the non-Joycean texts to be so much longer and/or for the joyce_ratio
to be small enough that the ratio rounds down to zero, causing the Joyce texts to drop out completely. Fixed that by pinning the joyce_scale_factor
to a minimum of one, which will hopefully eliminate the problem. Added a bunch of debugging code to trace what’s going on in buildMapping_withMixins()
; leaving it (it requires module-wide variable debugging
be set to True
to activate).
Wrapping up Ulysses Redux #027 and posting its TOC; setting up a new run; I’ll let it post the first six chapters when I go to be to make up for how long the blog didn’t get updated.
Lots of problems on this run, largely exacerbated by the fact that I wasn’t paying attention. The virtual machine had various failures (now hopefully fixed), and I was out of town and not watching what it was doing. Posting the TOC for Ulysses Redux #028 and setting up another run still based on the Samuel Butler translation of The Odyssey.
Letting a whole novel run as atonement for not paying more attention.
No, really, I think the problems with the virtual machine quitting are probably over now. Running the rest of a novel to compensate.
Tired of trying to deal with Arch breaking my boot process when I update software. Rebuilding the text-generation virtual machine using Ubuntu Server instead. Will probably produce a few chapters along the way. Starting a new run, BloomAndOdysseus VII.
The new virtual machine is set up and will soon be moved to one of my other laptops, which never leaves the house. Setting up a new set of source texts: Shakespeare’s 37 plays, excluding The Two Noble Kinsmen. Running Ulysses Redux #032 to the end so that I can start Ulysses Redux #033 with this set of texts.
Let UlyssesRedux #033 run out to test the new texts. Turns out (no surprise, really) that the 37 Shakespeare plays needed more pre-processing. Most notably (though not exclusively), not stripping line endings and de-capitalizing the first words of (most) lines resulted in a noticeable number of random capitalized words in the middle of sentences. Reworked my utility script poetry-to-prose.py
to use NLTK to guess whether a line-beginning word is a proper noun. The script is still a quick hack, but it’s better than nothing, and saved some work. Then there was about two hours worth of text cleaning … more could be done.
Still, I’m starting a new run with the modified Shakespeare sources, hoping it produces cleaner text this time.
Ulysses Redux #034 has been over for a week. I’ve been meaning to smooth out the capitalization of the Shakespeare input text, but haven’t had a chance to write and test scripts to do so, so I’m going to just go ahead and let Ulysses Redux #035 start without correcting capitalization.
Short description of the problem: NLTK wasn’t as good at working with sentence fragments as I’d hoped. My plan now is to run through the plays sentence by sentence, looking for capitalized things that it doesn’t think are proper nouns and asking if they should be corrected. But I’m going to let a new run start as-is for now.
Sigh.
Still haven’t had time to check capitalization in the Shakespeare plays. Starting a run of Ulysses Redux #035 anyway.
Finally postprocessing Ulysses Redux #036, because I’ve finally written a script (in my personal library, not the utility scripts for Ulysses Redux) that checks capitalization for input texts. It’s far from ideal, but it catches a fair number of problems, and it’s better than nothing. Starting Ulysses Redux #037 with edited Shakespeare texts: I’ve used the script to go over the first three days’ worth and will try to keep ahead of the daily script.
Have processed all 38 Shakespeare plays with the current version of the script, and will begin reworking the script to eliminate some of the false negatives. Next change anticipated: stop using the first character of a string as a heuristic for whether the string is capitalized. Next major change anticipated: try various truecasing solutions. Should be fun.
Postprocessing Ulysses Redux #037 and setting up the next run. Many (not all) of the Shakespeare plays have had their capitalization checked with the current version of check_capitalization.py
. Which catches a good bunch of errors, but is clearly still missing a lot of problems; trying drop-in truecasers and looking for what they get different is probably the next step. More specifically, in this run, the Shakespeare plays that had been validated with the script’s help were those associated with chapters 3 through 14.
Starting the next run; plays for each chapter should be validated with the current version of the script by the time each chapter is generated.
Finished the run for Ulysses Redux #038; setting up new run with same texts.
Finished up the run for Ulysses Redux #039 last night; running through all of Ulysses Redux #040 now, (a) because I haven’t yet timed a complete run on this machine, and (b) because I’m eager to move on to a new set of source texts. Which I’ll announce later.
Turns out it takes 74 minutes and 41 seconds to produce an entire novel on this laptop. Or it did for Ulysses Redux #040, anyway.
Postprocessing Ulysses Redux #045; finally corrected the oversight in postprocess-set.py
that prevented this coding journal from being formatted properly.
I’ve made major under-the-hood updates to how the text generator operates internally; this also changes its calling interface. Ironically, since the text generator was originally intended for Ulysses Redux, all of my other text-generation projects were updated first; but they seem to be working, so I’m starting to use it for Ulysses Redux instead. I’m running through the rest of this set to test out the effects.
Caught a problem in the run: the back-end changes resulted in the production of an extra newline, resulting in the production of lines ending in carriage return-close parenthesis in this run’s Circe
chapter. Tweaked the Circe script to avoid this in the future, I hope. In any case, a test run is promising.
Running a complete novel, but doing so on the (slower) computer that normally generates the posts rather than the (faster) computer I normally use for development, and timing the runs of each individual chapter with time for i in $(seq 1 18); do time ./daily_script.py ; done
to get info about how long a run actually takes on this old machine.
Resulting output:
INFO: About to run script chapter_scripts.ch01.py.
INFO: Attempting to post the content
real 1m6.744s
user 1m4.336s
sys 0m0.916s
INFO: About to run script chapter_scripts.ch02.py.
INFO: Attempting to post the content
real 0m43.563s
user 0m41.656s
sys 0m0.508s
INFO: About to run script chapter_scripts.ch03.py.
INFO: Attempting to post the content
real 0m50.018s
user 0m47.716s
sys 0m1.372s
INFO: About to run script chapter_scripts.ch04.py.
INFO: Attempting to post the content
real 1m1.584s
user 0m59.696s
sys 0m0.804s
INFO: About to run script chapter_scripts.ch05.py.
INFO: Attempting to post the content
real 1m6.221s
user 1m5.268s
sys 0m0.052s
INFO: About to run script chapter_scripts.ch06.py.
INFO: Attempting to post the content
real 2m15.550s
user 2m12.220s
sys 0m1.916s
INFO: About to run script chapter_scripts.ch07.py.
real 8m22.455s
user 8m16.388s
sys 0m5.020s
INFO: About to run script chapter_scripts.ch08.py.
INFO: Attempting to post the content
real 3m23.357s
user 3m18.944s
sys 0m2.772s
INFO: About to run script chapter_scripts.ch09.py.
INFO: Attempting to post the content
real 2m29.160s
user 2m25.508s
sys 0m2.024s
INFO: About to run script chapter_scripts.ch10.py.
real 1m20.417s
user 1m15.808s
sys 0m1.380s
INFO: About to run script chapter_scripts.ch11.py.
real 3m22.271s
user 3m17.628s
sys 0m2.964s
INFO: About to run script chapter_scripts.ch12.py.
real 4m27.946s
user 4m21.776s
sys 0m3.996s
INFO: About to run script chapter_scripts.ch13.py.
Traceback (most recent call last):
File "./daily_script.py", line 90, in <module>
the_status = social_media.tumblr_text_post(ulysses_client, the_tags, the_title, the_content)
File "/home/patrick/Documents/programming/python-library/social_media.py", line 24, in tumblr_text_post
tumblog_url = the_client.post('user/info')
File "/home/patrick/.local/lib/python3.4/site-packages/tumblpy/api.py", line 183, in post
extra_endpoints=extra_endpoints, params=params)
File "/home/patrick/.local/lib/python3.4/site-packages/tumblpy/api.py", line 164, in request
if content and (content.get('errors') or content.get('error')):
AttributeError: 'str' object has no attribute 'get'
real 1m57.605s
user 1m55.152s
sys 0m1.668s
INFO: About to run script chapter_scripts.ch13.py.
real 1m57.558s
user 1m54.360s
sys 0m1.724s
INFO: About to run script chapter_scripts.ch14.py.
real 1m23.690s
user 1m21.124s
sys 0m1.220s
INFO: About to run script chapter_scripts.ch15.py.
real 9m35.124s
user 9m26.976s
sys 0m5.680s
INFO: About to run script chapter_scripts.ch16.py.
real 1m49.092s
user 1m46.152s
sys 0m1.456s
INFO: About to run script chapter_scripts.ch17.py.
real 1m57.898s
user 1m55.108s
sys 0m1.080s
real 49m10.298s
user 48m5.816s
sys 0m36.552s
Whoops: ran this on the faster development laptop after all, instead of in a remote shell on the slower machine. Updated the tweet stores that the scripts use as mix-in texts, and trying again from the proper computer, this time.
Results:
clock time | user-process time | system time | |
---|---|---|---|
ch 01 | 01:27 | 00:42 | 00:02 |
ch 02 | 01:16 | 00:27 | 00:01 |
ch 03 | 01:04 | 00:30 | 00:01 |
ch 04 | 01:25 | 00:39 | 00:02 |
ch 05 | 00:59 | 00:41 | 00:01 |
ch 06 | 01:37 | 01:23 | 00:02 |
ch 07 | 12:31 | 06:00 | 00:11 |
ch 08 | 06:22 | 02:12 | 00:06 |
ch 09 | 03:14 | 01:36 | 00:05 |
ch 10 | 01:41 | 00:50 | 00:02 |
ch 11 | 04:56 | 02:06 | 00:04 |
ch 12 | 08:38 | 02:49 | 00:08 |
ch 13 | 01:27 | 01:14 | 00:02 |
ch 14 | 01:05 | 00:53 | 00:01 |
ch 15 | 15:21 | 06:27 | 00:20 |
ch 16 | 01:43 | 01:07 | 00:02 |
ch 17 | 01:33 | 01:15 | 00:02 |
ch 18 | 01:06 | 00:37 | 00:01 |
totals | 19:33 | 07:37 | 01:20 |
There was a connection time-out during the attempt to post the chapter:
INFO: Attempting to post the content
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/util/connection.py", line 83, in create_connection
raise err
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connectionpool.py", line 844, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connection.py", line 284, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0xb5b6ca4c>: Failed to establish a new connection: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.tumblr.com', port=443): Max retries exceeded with url: /v2/user/info (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0xb5b6ca4c>: Failed to establish a new connection: [Errno 110] Connection timed out',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/tumblpy/api.py", line 137, in request
response = func(url, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/requests/sessions.py", line 535, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.4/dist-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/requests/adapters.py", line 487, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.tumblr.com', port=443): Max retries exceeded with url: /v2/user/info (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0xb5b6ca4c>: Failed to establish a new connection: [Errno 110] Connection timed out',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/UlyssesRedux/scripts/daily_script.py", line 90, in <module>
the_status = social_media.tumblr_text_post(ulysses_client, the_tags, the_title, the_content)
File "/home/patrick/Dropbox/programming/python-library/social_media.py", line 24, in tumblr_text_post
tumblog_url = the_client.post('user/info')
File "/usr/local/lib/python3.4/dist-packages/tumblpy/api.py", line 183, in post
extra_endpoints=extra_endpoints, params=params)
File "/usr/local/lib/python3.4/dist-packages/tumblpy/api.py", line 139, in request
raise TumblpyError('An unknown error occurred.')
tumblpy.exceptions.TumblpyError: An unknown error occurred.
real 65m40.893s
user 53m46.612s
sys 1m25.600s
I’m going to go ahead and guess that the root cause of this problem is that my useless, apparently incompetent ISP provides really shitty Internet service, probably because they have a monopoly in my area.
More interesting is the fact that the wall-clock time in most of these runs is often twice the total of user-process and system time: something else is sucking up a lot of time on this computer. My best guess is that this is related to recent changes in Dropbox, which has recently started taking up more system resources on this computer.
To test this theory, I’m going to re-run the set of eighteen chapters, timing them again, with Dropbox disabled. We’ll see if there are any substantial differences.
Finally finished off this run and its timing stats. (I wound up running it in small batches, mostly three chapters at a time.) As expected, keeping Dropbox from running during the script’s run resulted in much faster run times:
chapter | clock time | user-process time | system time |
---|---|---|---|
ch 01 | 00:47:17 | 00:41:50 | 00:01:07 |
ch 02 | 00:30:02 | 00:27:20 | 00:00:44 |
ch 03 | 00:36:16 | 00:30:46 | 00:00:52 |
ch 04 | 00:43:08 | 00:39:20 | 00:01:02 |
ch 05 | 00:47:14 | 00:43:14 | 00:01:06 |
ch 06 | 01:33:52 | 01:26:18 | 00:02:14 |
ch 07 | 03:04:14 | 02:48:52 | 00:04:22 |
ch 08 | 02:19:54 | 02:06:58 | 00:03:24 |
ch 09 | 01:45:52 | 01:34:42 | 00:02:32 |
ch 10 | 00:55:38 | 00:50:01 | 00:01:22 |
ch 11 | 02:19:48 | 02:06:10 | 00:03:17 |
ch 12 | 03:01:32 | 02:44:06 | 00:04:41 |
ch 13 | 01:21:34 | 01:14:28 | 00:01:58 |
ch 14 | 01:16:34 | 00:55:36 | 00:02:19 |
ch 15 | 01:08:39 | 01:06:33 | 00:01:41 |
ch 16 | 00:12:08 | 00:11:46 | 00:00:18 |
ch 17 | 00:13:55 | 00:13:26 | 00:00:17 |
ch 18 | 00:19:03 | 00:18:25 | 00:00:24 |
totals | 22:56:42 | 20:39:52 | 00:33:38 |
Of course, since the script includes random elements, not all of the difference is definitively attributable to Dropbox’s system-hogging; but, over the course of all 18 chapters, the overall trend of much lower differences between clock time and user-process time is highly suggestive. (Dropbox, are you listening? Stop hogging my system resources, OK? They’re mine, not yours. Be respectful of my computer when I let you onto it.)
Since so much of the text since the inauguration has left such a foul taste in my mouth, I’m moving onto something much more pleasant: a crossover mashup involving one of my other automated text projects, The Worst of Bad Lovecraft. This mash-up won’t pull in all of Lovecraft’s writings, though I may return to that project in a few years; instead, we’re here remixing only the 23 stories and novella that constitute Lovecraft’s Dream Cycle. Some of these have not even yet been used as training texts for the Lovecraft blog!
I’m looking forward to this next set of Ulysses Redux novels. Enjoy.
Left on vacation while what seemed to be a productive run was going and it immediately broke, of course. As a bonus, Dropbox stopped running on the computer posting the novel, so even though I saw the problem, I was unable to fix it by correcting the error in the script. This is why there are ten or twelve Circe
chapters in this edition. Oh well: if Bloom’s going to get lost in the Dreamlands, it should be in Nighttown, I suppose. In any case, finished off the novel, and I’m setting up the next run: another iteration with the same source texts.
Forcing the current run (Ulysses Redux #057) to end early, because I want to experiment with dropping the Joyce ratio in the scripts. (There hasn’t been as much Lovecraft coming through as I’d like.) In some cases, these small adjustments will produce no perceptible changes at all (because the ratios are only roughly adhered to: each text is used an integral number of times). Nevertheless, here are the specific changes made:
ch07.py
: joyce_ratio
dropped from 1.4 to 1.2.generic_chapter.py
: joyce_ratio
dropped from 1.2 to 1.1.Finally noticed that there had been no post in five days. This turns out to be because a typo in the metadata for chapter 9 was causing the script to choke. Ran enough chapters to make up for the missed days.