Fort Lee paper
Status
  This is the Fort Lee paper. It sets out the an initial specification
  for a dark archive for RePEc full-text and templates. 
  The first draft was of 2012‒11‒07.  This is the version of 2014‒03‒12.
Main part
goal
  The goal is to provide a framework of the storage of persistent
  copies of the RePEc digital library. For the purpose of this
  document, the RePEc digital library consists of template metadata for
  documents, and of full-text copies of these documents referenced in
  metadata records, if such full text can be downloaded from a non-priviledged
  computer. Thus full text can be assumed to be publicly available. 
definitions
- object of interest (“obi”)
 
- This is anything that has a URL or a RePEc handle, and its URL or RePEc handle is found in RePEc data or in ArchEc data.
 
- 
  RePEc data
 
- 
  This is the set of all RePEc template metadata known now. 
 
- ArchEc data
 
- 
  ArchEc data is RePEc template data stored by ArchEc at a previous
  time, or objects of interest stored by a run of ArchEc to gather file
  objects of interest.
 
- template object of interest (“tobi”)
 
- a template found in the RePEc data, as a string. Its handle is the RePEc handle.
 
- 
  file object of interest (“fobi”)
 
- 
  something that is pointed to by a file-* construct in RePEc's ReDIF
  date. Its handle is the URI, subject to normalization of URIs. (FIXME: say
  what this is.)
 
- obi store
 -  a way to store the obis. The obi is
stored on disk. This is described below. For the moment, let’s just
assume that the obi store exists, and it creates a permanent disk
location of the obi.
 
- tist
 
- unix time stamp at a time
 
- software tist
 
- The tist at which we start to make a gathering run of ArchEc.
 
- tobi tist
 
- 
  the tist of a file in the obi store which a tobi is found.
  When tobi store software is run, either an 
 
- tist found
 
-  
  The tist of the time we find an obi for the first time. This is
  either the tobi time or the software tist of a run where we first
  get a copy of the fobi.
 
- tist noted
 -  the software tist we find an object for the
first time, or any time we notice that its contents has changed from
the last time we observed it.
 
- tist lost
 -  the software time we find that an obi that we
have observed is no longer there. Such an obi is called lost.
 
- refipo
 -  this is a pointer to file information we find in
the ReDIF data
 
- incidence
 -  either appearance, change, or deletion of an obi.
 
The object store depends on time. Therefore time handling is quite important.
RePEc data
  RePEc is externally furnished to ArchEc and periodically updated. This is the
  parametere ReDIF_directory in the configuration file. 
  For the purposes of building the a gathering application static tarballs of old
  RePEc data are also available. The tarball location may be supplied
  with the -f argument at start time. In that case the ReDIF_directory
  parameter in the configuration file is ignored.
Configuration parameters
  The obi start location is specfied with the 
  object_store parameter in the configuration file.
  directory structure
  the object store is in "archive/", sources are in "data/". That
  directory has "tarballs/" storing past tarballs, by data, and
  "current/" linking to the current dataset.
object store
  We use the tist noted. Invert it by chars, set a / after the
  third char and the sixth char for the first occurrences of the tist.
  Example. Store the homepage of Thomas Krichel http://openlib.org/home/krichel
  at some observed tist. 
  krichel@trabbi:~$ date +%s
  1352481230
  invert the string to get 0321845231 and split for directories 032/184/5231
  Take URL and uri_escape_utf8 (in Perl)
  http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel%2F
  The file is the store is 
  032/184/5231/http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel
  but since there is a redirect to http://openlib.org/home/krichel/,
  it stores at
  032/184/5231_http%3A%2F%2Fopenlib.org%2Fhome%2Fkrichel%2F
  If there is a redirect from URL foo to URL bar, make foo a symlink
  to bar. Exception: if foo and bar differ by a trailing slash, omit
  the symlink. 
 Caching database
  For the object store, we need a caching database. We use 
  MongoDB for this purpose. Not the caching database just
  for caching. It is used to hold any data that could not
  be derived again.
 We have four collection. 
  
     | tobi | fobi | 
  
  
    | data | mi | mo | 
  
  
    | checksum | vi | vo | 
  
  Of these four, the two in the second column are not implemented.
  The data collections, “m”, link an id of an obi to a hash. The hash
  matches tists to md5 sums. The file can be trivially constructed from
  tist and obi id. 
  The checksum collections, “v”, links an md5 checksum of the text in the
  file—rather than of the file—to a location in the the obi store
  where the checksumed contents is held.