Fort Lee paper

Status

This is the Fort Lee paper. It sets out the an initial specification for a dark archive for RePEc full-text and templates.
The first draft was of 2012‒11‒07. This is the version of 2014‒03‒12.

Main part

goal

The goal is to provide a framework of the storage of persistent copies of the RePEc digital library. For the purpose of this document, the RePEc digital library consists of template metadata for documents, and of full-text copies of these documents referenced in metadata records, if such full text can be downloaded from a non-priviledged computer. Thus full text can be assumed to be publicly available.

definitions

object of interest (“obi”)
This is anything that has a URL or a RePEc handle, and its URL or RePEc handle is found in RePEc data or in ArchEc data.
RePEc data
This is the set of all RePEc template metadata known now.
ArchEc data
ArchEc data is RePEc template data stored by ArchEc at a previous time, or objects of interest stored by a run of ArchEc to gather file objects of interest.
template object of interest (“tobi”)
a template found in the RePEc data, as a string. Its handle is the RePEc handle.
file object of interest (“fobi”)
something that is pointed to by a file-* construct in RePEc's ReDIF date. Its handle is the URI, subject to normalization of URIs. (FIXME: say what this is.)
obi store
a way to store the obis. The obi is stored on disk. This is described below. For the moment, let’s just assume that the obi store exists, and it creates a permanent disk location of the obi.
tist
unix time stamp at a time
software tist
The tist at which we start to make a gathering run of ArchEc.
tobi tist
the tist of a file in the obi store which a tobi is found. When tobi store software is run, either an
tist found
The tist of the time we find an obi for the first time. This is either the tobi time or the software tist of a run where we first get a copy of the fobi.
tist noted
the software tist we find an object for the first time, or any time we notice that its contents has changed from the last time we observed it.
tist lost
the software time we find that an obi that we have observed is no longer there. Such an obi is called lost.
refipo
this is a pointer to file information we find in the ReDIF data
incidence
either appearance, change, or deletion of an obi.
The object store depends on time. Therefore time handling is quite important.

RePEc data

RePEc is externally furnished to ArchEc and periodically updated. This is the parametere ReDIF_directory in the configuration file.
For the purposes of building the a gathering application static tarballs of old RePEc data are also available. The tarball location may be supplied with the -f argument at start time. In that case the ReDIF_directory parameter in the configuration file is ignored.

Configuration parameters

The obi start location is specfied with the object_store parameter in the configuration file.

directory structure

the object store is in "archive/", sources are in "data/". That directory has "tarballs/" storing past tarballs, by data, and "current/" linking to the current dataset.

object store

We use the tist noted. Invert it by chars, set a / after the third char and the sixth char for the first occurrences of the tist.
Example. Store the homepage of Thomas Krichel http://openlib.org/home/krichel at some observed tist.
krichel@trabbi:~$ date +%s
1352481230
invert the string to get 0321845231 and split for directories 032/184/5231
Take URL and uri_escape_utf8 (in Perl)
http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel%2F
The file is the store is 032/184/5231/http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel
but since there is a redirect to http://openlib.org/home/krichel/, it stores at
032/184/5231_http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel%2F
If there is a redirect from URL foo to URL bar, make foo a symlink to bar. Exception: if foo and bar differ by a trailing slash, omit the symlink.

Caching database

For the object store, we need a caching database. We use MongoDB for this purpose. Not the caching database just for caching. It is used to hold any data that could not be derived again.
We have four collection.
tobifobi
datamimo
checksumvivo
Of these four, the two in the second column are not implemented.
The data collections, “m”, link an id of an obi to a hash. The hash matches tists to md5 sums. The file can be trivially constructed from tist and obi id.
The checksum collections, “v”, links an md5 checksum of the text in the file—rather than of the file—to a location in the the obi store where the checksumed contents is held.

Valid XHTML 1.0!