Berlin requirements document

José Manuel Barrueco Cruz, Christopher F. Baum, Thomas Krichel, Sune Karlsson, Ivan V. Kurmanov, Robert P. Parks and Christian Zimmermann

2002–05–26

0: Introduction

0. This document is the Berlin document. Its initial version was written on 2002–03–23 by Thomas Krichel. This version was completely rewritten by Ivan V. Kurmanov. That version served as basis for discussions at the RePEc meeting in Syracuse. This version It is available in PDF for A4 paper and US Letter paper. It has benefitted from comments on the RePEc-tech lists.

1. HoPEc is a key RePEc user service. The recent years have shown the success of the service, but also revealed a number of problems and limitations. Based on the experience of running the service and on common sense a number of possible extensions to it seem useful.

2.There are three extensions that the service can receive

The addition to HoPEc of an interface that will allow HoPEc registrants to add papers to RePEc. This will be done through EconWPA. There would be the option to add full text files through URLs or the option to add papers directly to be stored on a server. These options are mutually exclusive.
The integration with CitEc data. In the future, authors should not only be able to associate to papers that they have written, but also to papers that cite them.
A WhoisWho in Economics, from HoPEc + LogEc + CitEc data. A comprehensive statistical evaluation, building on citations data, log data, and the impact of the channel of publication.

The enhancements to the basic HoPEc service will be dealt with in this order. In Section 2, we discuss the requirement for this enhancements.

3. A number of deep-hidden bugs in the current HoPEc software still occasionally cause difficulties to the users and increase administrative burden. The current C++ version is costly to manage and difficult to extend. In addition, there are a number of technical deficiencies to the software

Users don't have the option to delete their registration from the database.
There is still no secure method to allow a user to change their email address from one to another. When there is a confirmation email following a change of the email address, the confirmation should be sent to both addresses. We need special verbiage to explain that there is only one confirmation mail required.
HoPEc still treats ReDIF handles in a case-sensitive way. This needs fixing.

Therefore, the HoPEc software needs to be completely rewritten.

4. New HoPEc software will have the application logic (including database logic) completeley separated from presentation. The user interface will be governed by a set of special template files, which would contain presentation only. These template files must be documented and must be simple enough for a non-developer to read and modify. At the same time, the main application code and the database structure must also be well-documented.

1: Accepting users' document submissions

5. From discussion on RePEc-tech mailing list we come to the conclusion that what we need is a carefully designed forms-based interface for managing users' submitted metadata and documents.

6. We still think that there is a possibility that some users will find it more comfortable to provide metadata in data files. But implementation of such a system will be delayed until we have an agreed XML metadata format.

7. On the "research profile" page users will be encouraged to search for appropriate documents in the main RePEc database. In addition, they will be given a link to "add your research now", which will lead to a fresh form for submitting a research document. This will lead to a page at EconWPA.

8. "submit documents" will lead to a choice "submit new working paper" or "submit working paper version of document that is already in RePEc", internally called a "dair". The second choice will lead to a list of claimed items (paper/article/chapter). The user picks one of the claimed items. When the user has picked an item, (s)he will be lead to the same screen as the for the "submit documents" choice. This will be the main screen for submission of a document. But all the fields for this new document will have pre-set values from the existing template that is already in RePEc. This is the only difference between submitting a new document, and submitting working paper version of a document that is already in RePEc. The dair will be a separate document, but it will have a "Handle-Paper:" to the existing document. User-provided template data will not suppress data that is already provided in RePEc.

9. On the upload screen, users will be required to either

upload the full-text document files to the service or
supply one or more URLs at which the full-text can be accessed.

There will be no mix and match between methods. That users cannot upload some full-text and specify links to other components of the resource.

10. Users should be encouraged to submit one full-text file/url per document, but may prefer to provide multiple format of the text or supplemental materials.

11. At the upload/url interface, users will be queried what to do with each component. Add or remove, or change function like "full text", "appendix".

12. For each full-text file be they at a URL or uploaded, we need some sanity checks.

The first file should be the main text and be analyzed if they are similar to a full-text file in terms of length and contents.
There will an advertised limit for the size of uploads.
If authors submit a URL, if the URL ends in .pdf or .ps or .pdf.zip or .ps.gz etc, the interface will assume that it is full text, and checks that it is downloadable.

13. Registrants will be told to submit working papers or working paper versions of articles, chapters and other third-party copyrighted material. Users will be warned not to upload files that come from a publisher's web site, or the galley proof version of an article.

14. JEL codes supplied for the document, will be entered in a simple textual input field, but upon submitting the form, the system will validate them and the respective descriptions of the JEL codes will be presented to the user for a review before (s)he even gets a chance to confirm the submission. JEL codes will be encouraged, but not mandatory. The caption of JEL codes will follow the one implemented in S-WoPEc.

15.The creation year will be a mandatory metadata field. Other mandatory fields are title, author data, and upload data.

16. There will be no support for users to change the data of other users.

2: Other changes

17. HoPEc will be renamed the "RePEc authors service", RAS.

18. Users will log into the system by their email address and a password. Once logged in, users will have an option to change his password and their email addresses. Password will be reminded on request. Currently registered HoPEc users will have their passwords randomly generated by the system and be mailed to them upon the system launch.

19. Once a user registered, the system will set several http cookies for domain ".repec.org" containing:

the user's email address. It will be used when the user next time comes to RAS. The user will not have to type it again.
the user's person template handle. That will allow other ".repec.org"-based services to use it for user-identification.

20. Each user of the service will be automatically internally assigned a unique persistent short (no longer than 7 chars, normally 3-4 chars) alphanumeric ID id and the users' profile will be available on authors.repec.org by the URL http://authors.repec.org/id. The URL will be advertised for the user to circulate around. All new person templates will contain that ID in it. All existing templates would have to be rewritten for that.

21. The system will collect—from different sources—some news/questions for the user. When the user next time logs into the system, he will be asked those questions and presented those news. This includes several possible applications of which only one will be implemented for now. People will be presented with a list of new RePEc items that have an author, matching by the name. The user will be told to carefully check that and will be able to immediately claim appropriate items.

22. RAS will use lowercase name component in the handles.

23. Upon confirming a registration, registrants will be sent an HTTP cookie, which—stored in their browsers—may help to identify them when they return back to the site. The cookie will probably contain the person handle in an encoded form and the full name of the registrant.

24. Currently, the person template files are stored in per/pers directory of the RePEc mirror. Now there are more than 4,000 template files there. The suggestion is to create alphabetically-named sub-directories to split the files by the first two letters of the person's last name from per/pers/AA/ to per/pers/ZZ/ and move files there appropriately.

25. We will allow registrants to confirm the changes they did by simply replying to the confirmation email.

26. There will be an option to suppress the appearance of email in the user services.

27. Extend the person record in the database and in the ReDIF templates with more data fields. RAS will store the last modification date in an additional field, to be defined in ReDIF. A "died" field (of date type) will be added for personal records of those who left this world for a better one.

28. As a low priority, make a special administrator interface, which would support RAS administration activities and would give direct access to the database.

29. There should be support for person handles in paper and article templates. In that way the document template can point to the person template, in addition to the person templates pointing to the document templates. There is currently no precise way to invert from the latter relationship to the former because of multiple authors.

30. To simplify, we will no longer expose a search function for public use. The search function will be more hidden, at the same address where it is now but not linked. index.html should take people to registration, search.html will be the search, powered by RAS. We hope that this will reduce the bozo coefficient, and make clearer that RAS is a data input channel or "data sucker" as Christopher F. Baum says. There should be a rebranding of the existing HoPEc search page for use as WoPEc's view into this data (FiRe author search).

31. A password system will be introduced, when users register, they will have to give a password, and use that password in subsequent interaction. A password reminder service via email will be available. This is common practice and well understood now. The name plus date device will still be used to construct the handle, but neither name nor date will be part of the login procedure. As a transition measure, existing users will be called to use to create an account and password when they login as an existing user. There will be no confirmation email.

32. URLs used by the RAS system, that lead to the descriptions of people and papers, should be short.

33. Changes of name will need to require a special request. Name changes are sufficiently rare to deal with them manually. No change in the person handle will occur when a name change occurs. Persons who change name have to login with their old name and old date, but can choose to be displayed under a modified name.

34. There should be an additional screen for RAS that allows people to specify fields of interest and register for NEP. Ultimately changes of email address in RAS should generate changes in NEP registrations.

35. Two additional features are required for the RAS reminder mailings, currently run by Christian Zimmermann. There should be a perl script that generates an indication of the number of unclaimed items that the registrants may want to add. There should be a cgi script that displays the unclaimed items, without display of existing items. This should be similar to the bottom part of the current update screen.

36. Email confirmation is required when registering the first time or if the user changes the e-mail since this serves to verify that it is a functioning address.

37. There will be "on-request" RAS database update service, similar to recently launched for HoPEc, operating through HTTP. It will have two distinct features:

will be running permanently (so that update requests will be performed immediately). Currently the requests are checked periodically every 15 minutes.
will be intergrated with mirroring, so that for any (remote) archive, RAS update software will first perform mirroring (if necessary), and then will update it to the database.

38. The develepers will provide a general Perl- and Mysql-based toolkit, which will allow to remotely query RAS' database for personal data. The toolkit will be used by EconWPA to suggest RAS-registered persons data to document submittors and to find-out the person handles.

It will also be available to other services, the RAS database access will be reasonably restricted.

39. Currently, the RePEc author data has some very bad problems. Some archives supply data, where in one "author-name" attribute (and thus in one author-person cluster) there are several author names.

Current HoPEc software tries hard to resolve such cases and the results are good. It also performs some "sanity" checks on name, finding many bogus entries (with numbers, with HTML codes, with institution names and so on). This software writes a log of all certain and suspected problems in personal names data. That log can be used to actually correct the personal names where appropriate (automatically, by a script) or at least bring the attention of the data maintainers to the exact templates, which need correction. This will be maintained in the RAS project.

3: RAS and EconWPA

40. The maintenance of the infrastructure to run HoPEc will be transfer from Manchester to Washington University of St. Louis.

41. RAS will be placed in the RePEc domain, but every single page will say that it is hosted at Washington University of St. Louis. The redesigned Authors Service home page will be hosted at authors.repec.org to allow for bookmarking.

42. Thomas Krichel will sysadmin the machine, Bob Parks will look at the machine if physical problems occur.

43. The hardware for EconWPA and RAS will be separate. While it is easier to operate on a single machine, the design on two machines will allow for better separation of the two functions if need there be in the future.

44. RAS and EconWPA will live on single or neibouring machines. There will be "integration" on the filesystem level (using NFS if its two machines). That would clear the need of web-based mirroring for EconWPA's archive. A document submitted to EconWPA will be available for RAS in form of a ReDIF file immediately, then EconWPA requests an update through "on-request" update service and the document ends up in RAS.

45. There will be no special integration between EconWPA and RAS. In the longer run, RAS will have standard and documented way to accept data from any RePEc archiving services. RAS will provide a set of tools for archiving services who would wish to use personal data from it.