Collecting digital documents


If you are reading a printed copy of this page, you are reading an incomplete version, please print for US letter paper or for A4 sized paper.

Course Description

More and more archives and libraries are trying to build collections of digital documents to support the parent institutions' mission of dissemination of information. Many of these documents have a long-run value. And many of them come as complex object. Therefore these documents require curatorial efforts that go beyond simple storage on a web site. At this time, such documents are typically stored in formal repositories. Repository building and maintenance is a crucial avenue for libraries to increase relevance in the digital age.

There are three basic challenges in collecting digital docments. First there is the issue of maintaining a server to house and maintain the collection. Second there is the issue of understanding the complicated structure of digital documents. For example, a digitatized book may come as a set of scanned images, some text that has been extracted from the pages, some cover art and some metadata on the origin of the book, i.e. the particular edition and print. The book therefore is a composite object of distinct files, in different format, but that have be kept together. Third, there is the issue of managing the collection itself, and the implementation of the management through the software. The collection policy can be extremely simply. But it has to be precise, because the collection management software has to be set up to enforce it.

This is hands-on course that develops repositories for document collection from scratch. See the prerequisite section for technical details.

All software used is open-source. Broadly, there are three parts to the course. First, the students set up an open source operating system. Second the students set up an open source repository system. Third, students configure the repository to implement a collection scenario that the student defines. Fourth the students feed some documents into it.

The operating system is the testing version of Debian GNU/Linux. The repository software is Fedora Commons, version 3. Surely the repository software could be installed on a different operating system.

Course objectives

After taking this course students


There are no educational prerequisites for this course. However there are some technical prerequisits. Students who are unsure about these should ask in plenty of time before the course starts, by email to the instructor, with subject "LIS901-6".

The server hardware can come from three sources. The easiest, but possibly most expensive version, is to rent a server. Server rentals cost start at about $30 a month. Students who want to rent should arrange for the rental shortly before class starts. Students who rent a server should make sure it runs the Debian (preferred) or Ubuntu (should work too) operating systems. Students need to have ssh access data for the server. This would be a password for the user "root", and an domain or ip address for the server. Students should bring their own laptops to class. Before renting a server students may wish to check with the instructor if the particular server type is correct. If a server recommended by the instructor turns out to be incorrect, the instructor will reimburse the student for the cost incurred.

First, students who have a recent machine, and that machine runs a Microsoft operating system, we can use use VMWare to install a secord operating system on that computer. If that computer is a laptop, students have the benefit of being able to carry our server with them easily. The drawback is that without a dedicated machine the risk is that the server may not be avaiable. With a recent machine, this is the easiest option.

If a student has an older PC, that is surplus to requirements, it is quite easy to make this a dedicated machine to run the server. The computer should be an Intel or AMD processor based PC. Installing on another computer is possible, but would add difficulty. An Intel-based computer needs to have at least a Pentium processor. It should have 500 megabytes of RAM, and 4 gigabyte of disk space. Most old computers will do a lot better than than. The computer needs to be dedicated to the course during the run-time of the course but can be put to other usage once the course is over. The instructor will try to collect old computers for those who have difficulty finding a computer.

If a student has no such machine, a spare machine from the stock of the university may be available.

If the student has none of this, it is possible to rent a server. Renting a server. A server can be rented for a little as $30 a month. Here is a list link to server companies,

If a student wishes to run a server beyond the lifetime of the coures, for a example to run it as a web server, it is cheap enough to buy a barebones machine, comprising case, motherboard, memory and disk, for about $200.

Students should have a network connection at home. It is best if the network goes via a cable or DSL mode connection, that leads to an Ethernet connection. The server can be connected to a router that is common in home networking scenarios. In that case the student need to have administrator rights to the router. If a student does not have such a connection at home, it is possible to host the computer in the instructor's office, but public service on such computers would, bar an extensive and very complicated effort, only be visible on campus. In addition the server would be at the risk on networking policy changes at the CW Post campus.


Thomas Krichel
Palmer School of Library and Information Science
C.W. Post Campus of Long Island University
720 Northern Boulevard
Brookville, NY 11548–1300
work phone: +1–(516)299–2843
skype: thomaskrichel

Private contact details may be obtained from the online CV.

Class structure

Classes are held on the CW Post campus on between 10:30 and 16:30. There will be some talks by the instructor, but most of the work is done hands-on by the students.

Outside of class time, support via skype is available pretty much around the clock, i.e. unless the instructor is riding his bicycle or is asleep.

The class schedule is:

2009–08–24 10:30 to 16:30 Important concepts of open source software, computer hardware and computer networks.
2009–08–25 10:30 to 16:30 Theoretical background to the Fedora Commons software. Installing the operating system. Trial transfer of servers home.
2009–08–26 10:30 to 16:30 Installing and configuring required software. Security and backup. Project plans and service models.
2009–08–27 10:30 to 16:30 Setup of repostitory software.
2009–08–28 10:30 to 16:30 Configuring and collecting.
2009–04–25 10:30 to 16:30 Presentation of finished work. Essentially a spare session.

Contents of individual sessions may differ from this schedules because the course is experimental.

To print the slides in Microsoft powerpoint, press control-p to print, then under "Print what" choose "Handouts", and under "Color/grayscale" choose "Pure Black and White". You can also use openoffice to print the slides. The slides posted here are draft until the time that the class is held.


Mailing list

There will be a mailing list for the course at All students are encouraged to subscribe. As a rule, answers to email sent to the instructor are copied to the list. There are exceptions to this rule


There is no text for this course. The expertise acquired in the course is very difficult to find in existing literature because it spans a wide area of subjects. Some Internet sites include

Formally authored documents include

Other resources are at on the class resource site.



Since this course is essentially of an experimental nature, there will be no quizzes. Any students who will be able to get a server running, that implents a collection strategy that the student can summarize in a one-page printed statement, will get an A. Students who can not complete this task by the end of the week following closure of classes, may ask for an incomplete until such time that they have been able to complete it. Students may also wish to settle for a lower grade when they don't want to set up the full server. But this should be a rare occasion.

Valid XHTML 1.0!