Site Geographical Surveys

As a gesture of honesty and good faith, and as a way of experimenting with what can semantically be done on a website I myself control entirely in order to document how the website has grown and evolved, I periodically produce what I think of as geographical surveys of this site. (By geographical survey, I mean that I collect a quick compilation of links to all files on the site, plus a link to the Internet Archive's top-level page for the site when the survey is produced, which (hopefully) links to a complete Internet Archive copy of the site at that time.) My intent in doing so is that a complete list of documents on the site and a neutral third-party copy of a site snapshot will constitute an archive of what I was willing to publicly declare at that time—and that a record of the site at various points in time, one that I can't effectively disclaim or efface or otherwise manipulate, will help motivate me to stay honest and to stand behind things that I write, say, and create in public.

More specifically, each site survey consists of a collection of links to existing content publicly accessible on my web site, wrapped up as a bzip2-compressed survey file, which is an Atom-encoded feed; there is also a detached GPG signature for that compressed file, in order to assert with my GPG key that I have produced the survey and that I endorse it. (You can verify my signature with my GPG key or with keybase, e.g. by doing something like keybase verify 20151006.xml.bz2.asc 20151006.xml.bz2 or gpg -verify 20151006.xml.bz2.asc 20151006.xml.bz2 in a terminal.)

Neither half of the process that I use—production of a signed survey; archiving it with the Internet Archive—is perfect. Part of the problem is that I (of course) rely on the Internet Archive to host the archived backup, and my backup requests are sent to them automatically, and I assume that they succeed; if I'm distracted I may miss the fact that a few files aren't archived successfully. Too, the survey file (currently) contains a erroneous, nonexistent links to a few images (and several other file types) that are not hosted on my website itself, but elsewhere. This is because the process is based on the local copy of the website stored on my hard drive; it naïvely assumes the the remote copy on the Internet is a perfect mirror. This assumption is close to being correct, but is not quite factually true; the link-dumping process doesn't verify that remote copies exist, because the process itself is a pretty quick hack that uses a heuristic to produce the survey. All maps are (in any case) abstractions, simplifications, and approximations, after all. Still, the process generates more files than actually exist on my server, not fewer; there are false positives, not missing files. Since the false positives are incorrect locations for files that are dependencies of other files that are on the server, requesting that the Archive back up the actually existing files should also archive the missing media from their actual, externally hosted locations. (In theory. Most of the time. I hope.)

In any case, the survey methodology is not perfect, and there is room for improvement (which I hope will happen with future surveys). It is intended to be a good-faith gesture of honesty, not a contractual guarantee of accuracy. Nevertheless, I would be interested in hearing about any problems that anyone may discover so that I can fix them, just as I would appreciate hearing about any suggestions for how this could be done better. (In the future, I may do this with WARC, although preserving HTTP headers is perhaps less important for what happens to be on this particular site site than it is for an arbitrary site on the Internet. I'm also thinking about Git-versioning this website, though I'm not yet doing so.)

In addition to the above, I also hereby declare ...

Existing surveys

How the survey is currently produced

I use this Python script, which does the same thing as I used to do manually, as described below.

How the first survey was produced

From the root directory of my local copy of my website, I run find . -type f > ~/files.list; find . -type d >> ~/files.list in a terminal, which produces a list of all files and directories in that folder and its subfolders and places it on my desktop. I then move that list into a temporary folder that's empty and use wget to get the Internet Archive to save all of the files listed. (I want to also list directories because they count as URLs too: my server serves the index.html file for that directory if a URL ends with a slash.

To do this, I perform a search and replace on the list to prepend every line with http://web.archive.org/save/, plus the actual prefix of each file on the web that doesn't appear on the local copy; that is, I add an extra blank line to the beginning of the file, then search for \n./ (that's the Unix newline character followed by a dot-slash, the Unix prefix for in this folder, where (in this case) this folder means the current working directory from which I run the search) and replace it with \nhttp://web.archive.org/save/http://patrickbrianmooney.nfshost.com/. Then I run wget over the whole list: wget -i files.list --no-use-server-timestamps --limit-rate=50k -w 3 -S -O /dev/null, giving it three seconds between files and limiting the transfer rate to 50k to avoid being a burden to the Internet Archive server. I throw out the resulting files by routing them to /dev/null, since I don't actually want to keep them; I just want to load the URL because that gets the Internet Archive to save them.

Then I produce the Atom feed for the geographical survey itself; this involves using an existing Atom feed as a template to create a single-entry feed for the survey. Each file and directory appears again in it as a <link> element (this list is again created by searching and replacing through the list of generated files); a description is written up, piping things that I do in a terminal through wc -l to count files and directories. The resulting feed is compressed with bzip2, which reduces its size by about 95%; there's a lot of repeated data in there (most notably, the URL for the top level of my web site appears as a component in every single URL in the list). Then I list the results on this page and upload everything.