Fort Lee paper
Status
This is the Fort Lee paper. It sets out the an initial specification
for a dark archive for RePEc full-text and templates.
The first draft was of 2012‒11‒07. This is the version of 2014‒03‒12.
Main part
goal
The goal is to provide a framework of the storage of persistent
copies of the RePEc digital library. For the purpose of this
document, the RePEc digital library consists of template metadata for
documents, and of full-text copies of these documents referenced in
metadata records, if such full text can be downloaded from a non-priviledged
computer. Thus full text can be assumed to be publicly available.
definitions
- object of interest (“obi”)
- This is anything that has a URL or a RePEc handle, and its URL or RePEc handle is found in RePEc data or in ArchEc data.
-
RePEc data
-
This is the set of all RePEc template metadata known now.
- ArchEc data
-
ArchEc data is RePEc template data stored by ArchEc at a previous
time, or objects of interest stored by a run of ArchEc to gather file
objects of interest.
- template object of interest (“tobi”)
- a template found in the RePEc data, as a string. Its handle is the RePEc handle.
-
file object of interest (“fobi”)
-
something that is pointed to by a file-* construct in RePEc's ReDIF
date. Its handle is the URI, subject to normalization of URIs. (FIXME: say
what this is.)
- obi store
- a way to store the obis. The obi is
stored on disk. This is described below. For the moment, let’s just
assume that the obi store exists, and it creates a permanent disk
location of the obi.
- tist
- unix time stamp at a time
- software tist
- The tist at which we start to make a gathering run of ArchEc.
- tobi tist
-
the tist of a file in the obi store which a tobi is found.
When tobi store software is run, either an
- tist found
-
The tist of the time we find an obi for the first time. This is
either the tobi time or the software tist of a run where we first
get a copy of the fobi.
- tist noted
- the software tist we find an object for the
first time, or any time we notice that its contents has changed from
the last time we observed it.
- tist lost
- the software time we find that an obi that we
have observed is no longer there. Such an obi is called lost.
- refipo
- this is a pointer to file information we find in
the ReDIF data
- incidence
- either appearance, change, or deletion of an obi.
The object store depends on time. Therefore time handling is quite important.
RePEc data
RePEc is externally furnished to ArchEc and periodically updated. This is the
parametere ReDIF_directory in the configuration file.
For the purposes of building the a gathering application static tarballs of old
RePEc data are also available. The tarball location may be supplied
with the -f argument at start time. In that case the ReDIF_directory
parameter in the configuration file is ignored.
Configuration parameters
The obi start location is specfied with the
object_store parameter in the configuration file.
directory structure
the object store is in "archive/", sources are in "data/". That
directory has "tarballs/" storing past tarballs, by data, and
"current/" linking to the current dataset.
object store
We use the tist noted. Invert it by chars, set a / after the
third char and the sixth char for the first occurrences of the tist.
Example. Store the homepage of Thomas Krichel http://openlib.org/home/krichel
at some observed tist.
krichel@trabbi:~$ date +%s
1352481230
invert the string to get 0321845231 and split for directories 032/184/5231
Take URL and uri_escape_utf8 (in Perl)
http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel%2F
The file is the store is
032/184/5231/http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel
but since there is a redirect to http://openlib.org/home/krichel/,
it stores at
032/184/5231_http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel%2F
If there is a redirect from URL foo to URL bar, make foo a symlink
to bar. Exception: if foo and bar differ by a trailing slash, omit
the symlink.
Caching database
For the object store, we need a caching database. We use
MongoDB for this purpose. Not the caching database just
for caching. It is used to hold any data that could not
be derived again.
We have four collection.
| tobi | fobi |
data | mi | mo |
checksum | vi | vo |
Of these four, the two in the second column are not implemented.
The data collections, “m”, link an id of an obi to a hash. The hash
matches tists to md5 sums. The file can be trivially constructed from
tist and obi id.
The checksum collections, “v”, links an md5 checksum of the text in the
file—rather than of the file—to a location in the the obi store
where the checksumed contents is held.