From open source to open libraries.


0. About this paper

This is a draft for a contribution to the ASIS&T bulletin.
It is online at http://openlib.org/home/krichel/papers/kuyus.txt. Some
of this paper was written while I enjoyed the hospitality of Siberian
Federal University during it's Sibir i Mir festival. I am grateful
to ??? for comments that have helped to improve the paper. 


1. Introduction

Much of what we have been reading in this issue is concerned about
open source software (OSS) in libraries. The basic angle there is to
look at what is being done with OSS in libraries, or what can be
done. This invited contribution takes a broader look. I am interested
in speculative implications of OSS software for libraries. These
implications go beyond what OSS can bring to libraries in their
current mission. They look at what can be learned from the OSS
movement to understand the changing nature of libraries as we move
from print to electronic storage, and from slow physical transport to
fast transport via computer networks.

2. Some theory

I will start by making some parallel between software and
libraries. It may appear to be far fetched, but I hope it is
nevertheless interesting.

Let us start with the software. Conceptually, a piece of software can
be thought of as three things. First there is something that a user
can use. I can open a document, in say, OpenOffice, and I see a set of
interface elements suchs a prompt for keyboard input, buttons I can
press with the mouse, and they allow me to manipulate a document. In
principle, I can imagine another interface to the software. And often
enough, the same piece of software supports slightly different
interfaces depending on what computer system it runs on. Second, there
is the code that makes the software work. This is usually some textual
data called the source code. OSS is software where the customer who
acquires the software also gets the source code for it. Finally, there
are the objects that the software manipulates. Some may object that
the objects manipulated by the software are separate form the software
itself. But surely, every piece of software is tailored to the objects
it manipulates. For example, picture editng software needs to know the
structure of the picture as its underlying object. If that structure
would change, the software would be next to useless. Of course OSS is
not about making manipulated objects freely available. But generally,
if many objects that a software manipulates are freely available
that's good for the software itself, because it is cheaper to get hold
of existing objects to manipulate.

Let us turn to the library.  In a similar way as I have explained for
the software, there are three elements to a library. There is a way
that the collection can be accessed. If the library is a building or
if the library is a digital collection accessible via a web site is
not important here. Surely each of these types has a way to access it
that is different, but the interface can be thought of as a separate
component of the library.  Second there is the description of the
collection. Like the middle component of the software, this is the
central part of the library.  It contains descriptions of the objects
held, as well as links between the objects and the users. Finally,
there are the objects that the library holds. In a digital library
these are usally referred to as full-text files. In a physical library
they are books and periodicals. Again, like in the case of the
software, the objects that are manipulated by the library do not
necessarily have to be freely available. But it will help the library
if they are because more library users can access them and use
them. But surely a pack of digital objects on a disk does not
constitute a library just as a pile of other digital object does not
set up a software, if we look at the software from a service
perspective.

Thus we can think of the source code as the core of the software, and
the description of objects as being the heart of the library. Then we
can draw a parraallel between open source software and open
libraries. Thus we can think of an open library as a freely availble
collection of descriptions of digital objects.

Historically, I have been cultivating the term open library for a long
time. ??? and ??? are examples of my writings on the topic. I
registered the term openlib.org in ????. I did not register the term
openlibrary.org since I initially thought it was too long, but
registered openlib.org instead.  Later I could not buy it because I
did not want to spent the money the owners asked for. Nowadays the
term is used for a project sponsored by the Internet archive. My
social approach to open libraries is quite different.  It is much more
in line with what OSS is about. OSS is really not a project that on
organization runs. Rather it is very large set of small-scale
projects, many of which achieve great things because they are
compatible with others.

2. A look at OSS from a social perspective

A bit of history helps.  Since the 1980s Richard M. Stallman called
for GNU.  GNU stands for "GNU is not UNIX".  It is a free replacement
of UNIX. I guess that in the early days, most people were skeptic
could be undertaking. And it is not really quite complete. But the
spirit of GNU lives on, and it lives stronger than moset people ever
expected. Free operating systems for computers are a reality. They may
not be called GNU systems, but they are free in the sense that
Stallman envisioned. Is there a lesson here for information
professionals?  Can information professionals create and maintain open
libraries, just like computer professionals create and maintain freely
available software? Clearly, if they want to some insight on how OSS
software comes about will be useful to understand how open libraries
come about.

Computer software use highly structured information in its software
source code. On the level of an entire computer, this large amount of
highly structured information reaches an extraordinary level of
complication.  Let me illustrate this with a simple example.  My
operating system of choice is Debian GNU/Linux.  The system consist of
a set of packages.  When I looked at it May 2008, there were 22456
packages available in my (typical) installation. Each package provides
a certain functionality.  When I want to add a functionality to my
computer, I add a package, say package A.  But packages are not
independent. More often than not, when I add a new package, I am told
that there is a bunch of other packages I have to install as well
because without these, my desired package A will not run. And there is
also a bunch of other packages that are suggested by package A. I am
told that when I run package A, I may also install package B and C
that are just friends of package A. Actually, on a technical level
things are even more complicated. Each package comes with a version
number. Package A version 1 may require package B version 2 or
higher. It may be incompatible with version 2.1 of package B. And so
on. You get the idea.

Before I bore you with more technical details, let's move away from
technology and look at people.  Let's look people who package the
software.  Let us call them packagers. Most packagers are not the
authors of the software they package. They know the software well and
they know the operating system well. They work at the interface
between the software and the operating system. They take the software
as an input and contribute to the operating system. In doing that,
they may change some aspects of the software. They can do this because
the software can be changed. Thus it can be adapted to the
requirements of the operating system. These requirements stem from the
fact that the operating requires packages to work together, that
requires that user may change aspects of the software but still want
to have it update gracefully when the latest and greatest version of
the software comes out.

How come that all of this can be made available not no cost? The key
is reuse. All pieces of software rely on library. Yes, geeks use
libraries too. But their libraries are actually computer files.  These
special files contain compiled pieces of code that have already be
written by somebody else. When I write software for my digital library
systems, I use a language called Perl. The code that I write in Perl
is a simple text file. But I am not writing all my software from
scratch using the commands that Perl provides. Instead I uses system
of Perl code called modules that contain Perl code that has aleardy
been written by somebody else to get common tasks to run. These
modules form an open library of code. So this is one way we have
reuse. In a similar way, Perl itself is written in a language called
C, immortalized in the Beatles song "Write in C". C code itself relies
on C libraries to achieve common tasks such as showing a character on
the screen. Maintainers of the Perl software reuse these libraries. In
the same way, when you use a web site, the web server, most likely
Apache will anwer your requests. The web server software is also
written in C, and uses the very same library to achive common tasks,
for example reading a file. 

Reuse is a technical feature, but it close related with the social
organization of OSS. OSS writers usually maintain just a few,
sometimes just one piece of the extraordinary complex edifice that is
the software that runs on a computer. But the important thing is that
they work in a way that is compatible with others.

Now, some readers may object that digital libraries contain objects
that are produced with a profit motive in mind. Such readers should
remind themselves that I am writing about digital libraries as
descriptions of objects. If these objects are freely available for
use, all the better. Recall that OSS does not prevent its users to
create objects with a profit motive. You can use OSS to create a piece
of poetry and charge people to download a copy of the
poetry. Similarly the open library as envisioned here works at the
level of the description of the objects, which I call the central
aspect of a library. And remind yourself that not all underlying
objects in a library have been written with an immediate profit from
sale purpose. For example, I have not been paid to write this paper.


4. Examples of open digital libraries

Of course, I would not be boring you will all this theory if there
were not already some examples of open libraries around there. I
humbly think that I have created probably one of the very finest, the
RePEc open library, see http://repec.org.  I formally co-founded RePEc
in 1997, but really goes back to efforts I made in 1993 to collect
information about scientific papers in economics and make it freely
available. Today, RePEc is based on close to 900 RePEc archives. These
furnish bibliographic information that can be harvested. User services
are separate from these archives.  They provide aggregates of the
bibliographic data for end users. Thus the provision of archives and
the provision of interfaces of the data in the archive is
separated. Archive maintainers face a situation where they maintain
data that is used in a variety of interfaces.  The more widely the
data is seen, the more interest they have to maintain it up to
date. User services have a wide set of data to show. It attracts wide
usage. But as a whole the most important thing about RePEc is that it
is sustainable without external subsidy.  RePEc has no budget, no
officers, no meetings, and no formal decision making process. And it
does no longer rely on a single person to keep things together.

How did RePEc come about? If you start with empty archives, you have
empty user interfaces. Who will start the first archive, and who build
the first user interface? Well, I did a lot of that in the first year,
1993. In 1994 Jose Manuel Barrueco Cruz joined me. By the time we got
funding from the Joint Information Systems Committee of the United
Kingdom's higher education funding bodies, in 1995, we already had, I
think, collected 5000 descriptions of online papers. Although I can't
prove this scientifically, just having a few people compile
bibliographic data is not how things have really scaled up. In our
case of an academic open library, the thing that really worked to
bring in community involvement---yes, that is what is needed---is to
build an author registration system. In RePEc's case, there is now
special interface that authors use to register the works that they
have written and that are catalogued in the RePEc data. In 1998 I
supervised a student by the name of Markus Johannes Richard Klink. We
set this up in 1999. The system is now called the RePEc Author
Service.

Author registration is a piece of glue that provides me with
a link between the things I have done in the past (RePEc and
other projects that I am not talking about here) and my plans
for the future, and some of the current work that I do.

What's the big deal about author registration? Well on it's own
nothing much. Author registration has to be seen within a wider
context of a academic digital library.

Author registration allows us to put the papers of a person an a
common page. This starts to look like part of a CV. But there two
crucial differences. Instead of researcher having to compile and format
the bibliographical information on her own, here she can just say what
papers she has written. Second, the publication list is reusable across
further services. The key is reusability. The CV is not reusable,
the author registration record is. 

Readers may object that a CV is not just a list of document descriptions
that a person has written. I 

Author registration allows for a battery of author rankings calculate
from various measures of usage. Typically such measure involve the
number of abstract views, the number of full-text
downloads. Interestingly although such usage is distributed across
RePEc services, the service collaborate to build a cross-service usage
database. This dataset is called LogEc. It was built by Sune
Karlsson. He also maintains it. Other measures of usage include
citations. There is a citations database that is part of RePEc. It
is the CitEc system built and maintained by Jose Manuel Barrueco Cruz.

The gathering of usage records, and the linking of the usage records
with 


5. Reasons to be cheerful

In recent years a lot of progress has been made in the technologies
that enable the reuse of textual data. First, XML has established
itself as a lingua franca of textual data. There is a large set of
technologies that surround XML. One important example are XML
Namespaces. Using namespaces, XML allows for the mixing of element
from different name spaces. In simple terms, it allows you to use
different vocabularies of terms in a single document. This is useful
where uniting sets of metadata. For exmaple, Peter and Paul can
describe the a bunch of objects using each a different set of
properties. Here is an example from my own work. Most of the data
collection used a format called AMF. OK, much of the design of it I
did myself in 2000. It have been very useful.  It allows me to
describe objects of my interest, in the scholarly communications
domain, in the way I want to describe them. But sometimes I have some
special needs, that I can't describe in AMF without extending it in
ways that is only useful to solve a particular issue that I have
with a speficic system. Well, AMF is built as an open format. It
accepts foreign vocabularies form other namespaces pretty much
anywhere. 


Another piece of good news is that primary publishing business, the
provision of contents, also has made great strides in the provision of
open-access information in a loosely structured form. The best example
is Wikipedia. Making Wikipedia a lot more structured is difficult
because of the general level that it operates at. But the example 
illustrates well that free information can be of a very high quality
indeed. 


The Obstacles

There are powerful obstacles to achieving free digital libraries. 
I am not sure which ones are the most important here, so I give
a somewhat ran dom list. They are technical incompetence, and two
more theoretical issue, the myth of industry and the myth of the
full text.

Technical incompetence is a major limit. Unicode, XML and its related
technologies such as XML Schema and XSTL, XHTML, CSS, SQL, OAI PMH and
OAI ORE, operating system skills, basic knowledge of networking, and
above all, knowledge of a scripting language such as Perl or PHP, it
all adds up to an large body of knowledge.  While it is not required
that every digitial library builder has to have a deep knowledge of
each of these areas, a deep understanding of at least a few of them,
as well as having the programming skills. Without this
prequalification, you are not operational. Usually none of this is
taught in library school. For years, I have been battling myself to
introduce at least a small part of this to the curriculum, but with
limited success. As a result, the average library school graduate,
has virtually no chance to get involved in digital library building.
One may argue that this work can be left to technical staff and 
the library staff only need to design the system. However this is a
fallacy, without having studied the techincal underpinnings, library
staff lack the analytical reasoning skills that are required to get
even started with the design of new systems. As a result, innovation
in libaries is stifled. There is a tendency to contract out. As
I wrote in a mailing list recently "libraries are outsourcing to
their death".

Onother probelm, is more difficult to understand, it something
I have called "the myth of industry". It is the tendancy
of people involved in digital library work to keep the work
for themselves and limit interoperabilty. Their ideas is that
they have built the data, and therefore they want to keep a 
tight control over its usage. Therefore you have to ask them
to get a copy of the data, and more often than not, the answer
iss no. For example, it has not been possible for me to get a
copy of the Astrophysical Data Service to used in my AuthorClaim
registration service. What industry mythers not understand is that
bu giving the structured data freely away, they incourage reuse
of the data. It means that their own contributors have better
incentives to contribute data since it is more widely used. That
is by giving away their structuered data, open access publishers
and digital library builder add value to their collections, rather
than reducing value. But this point is not widely understood.

Finally, there is the worship of the full text.  It is an inability to
reason beyond end users reaching full text. Many people place the full
text at the center of theyr collection development.  If the full-text
is really a textual object, and if it is freely available as it should
be, it can be quite easily be indexed by a full-text engine, say
Google. Thus textual metadata attached (in some form) to the full-text
is not really important, the same textual string can also be found in
the full text. So people put documents on web site, have them indexed
by Google, and say that's it, I am done. It's ok if this is an
announcement of your next birthday party. But it appears insufficient
when we deal with important documents such as scientific papers, legal
documentation, technical documentation or works of art. In these
fields we are typically not only interested in getting access to the
full text. In fact we are also interested in the links between these
object. In patents, we are interested in citation links between
patents, who applied, if the patent was approved or not etc. In
academic work, we need to know who the author was. In preservation we
need to know what general class a full-text object belongs to, so that
we can reach a decision about whether and how to preserve it. All
these concerns require registries. And these registries have to be
compiled partly at least by hand. This is the job of digital
librarians. Regististries make for a digital library to be more than
just a collection of http available computer files. But work on
registries can not progress unless more people realize their
importance. In digital libraries, the full text should not be
considered to be of central importance.  Rother, it should be
attribute, hopefully in the structure an OAI ORE recorcdd that
describes a digital object worthy of interest.


I am not going to give you examples of how these obstacle impact current
projects. But I invite you to think about, for example the National
Science Digital Library, within the contexts of these obstactles.

Conclusions

The most important trend libraries are facing is the increase of free
access information resources.  Nowhere is this more to be seen as on
the web. More and more serious information is being made available for
free on web sites. Project Gutenberg was an early starter. Many
newspapers, for example, have been building web sites and offer much
of their contents for free on these sites.  Many institutions offer
important information about themselves on web sites. Encyclopedic
knowledge is more widely available than ever thanks to Wikipedia.

The fact that we now have freely available computer operating system
is obviously a part of the attention economy trend.