From open source to open libraries. 0. About this paper This is a draft for a contribution to the ASIS&T bulletin. It is online at http://openlib.org/home/krichel/papers/kuyus.txt. Some of this paper was written while I enjoyed the hospitality of Siberian Federal University during it's Sibir i Mir festival. I am grateful to ??? for comments that have helped to improve the paper. 1. Introduction Much of what we have been reading in this issue is concerned about open source software (OSS) in libraries. The basic angle there is to look at what is being done with OSS in libraries, or what can be done. This invited contribution takes a broader look. I am interested in speculative implications of OSS software for libraries. These implications go beyond what OSS can bring to libraries in their current mission. They look at what can be learned from the OSS movement to understand the changing nature of libraries as we move from print to electronic storage, and from slow physical transport to fast transport via computer networks. 2. Some theory I will start by making some parallel between software and libraries. It may appear to be far fetched, but I hope it is nevertheless interesting. Let us start with the software. Conceptually, a piece of software can be thought of as three things. First there is something that a user can use. I can open a document, in say, OpenOffice, and I see a set of interface elements suchs a prompt for keyboard input, buttons I can press with the mouse, and they allow me to manipulate a document. In principle, I can imagine another interface to the software. And often enough, the same piece of software supports slightly different interfaces depending on what computer system it runs on. Second, there is the code that makes the software work. This is usually some textual data called the source code. OSS is software where the customer who acquires the software also gets the source code for it. Finally, there are the objects that the software manipulates. Some may object that the objects manipulated by the software are separate form the software itself. But surely, every piece of software is tailored to the objects it manipulates. For example, picture editng software needs to know the structure of the picture as its underlying object. If that structure would change, the software would be next to useless. Of course OSS is not about making manipulated objects freely available. But generally, if many objects that a software manipulates are freely available that's good for the software itself, because it is cheaper to get hold of existing objects to manipulate. Let us turn to the library. In a similar way as I have explained for the software, there are three elements to a library. There is a way that the collection can be accessed. If the library is a building or if the library is a digital collection accessible via a web site is not important here. Surely each of these types has a way to access it that is different, but the interface can be thought of as a separate component of the library. Second there is the description of the collection. Like the middle component of the software, this is the central part of the library. It contains descriptions of the objects held, as well as links between the objects and the users. Finally, there are the objects that the library holds. In a digital library these are usally referred to as full-text files. In a physical library they are books and periodicals. Again, like in the case of the software, the objects that are manipulated by the library do not necessarily have to be freely available. But it will help the library if they are because more library users can access them and use them. But surely a pack of digital objects on a disk does not constitute a library just as a pile of other digital object does not set up a software, if we look at the software from a service perspective. Thus we can think of the source code as the core of the software, and the description of objects as being the heart of the library. Then we can draw a parraallel between open source software and open libraries. Thus we can think of an open library as a freely availble collection of descriptions of digital objects. Historically, I have been cultivating the term open library for a long time. ??? and ??? are examples of my writings on the topic. I registered the term openlib.org in ????. I did not register the term openlibrary.org since I initially thought it was too long, but registered openlib.org instead. Later I could not buy it because I did not want to spent the money the owners asked for. Nowadays the term is used for a project sponsored by the Internet archive. My social approach to open libraries is quite different. It is much more in line with what OSS is about. OSS is really not a project that on organization runs. Rather it is very large set of small-scale projects, many of which achieve great things because they are compatible with others. 2. A look at OSS from a social perspective A bit of history helps. Since the 1980s Richard M. Stallman called for GNU. GNU stands for "GNU is not UNIX". It is a free replacement of UNIX. I guess that in the early days, most people were skeptic could be undertaking. And it is not really quite complete. But the spirit of GNU lives on, and it lives stronger than moset people ever expected. Free operating systems for computers are a reality. They may not be called GNU systems, but they are free in the sense that Stallman envisioned. Is there a lesson here for information professionals? Can information professionals create and maintain open libraries, just like computer professionals create and maintain freely available software? Clearly, if they want to some insight on how OSS software comes about will be useful to understand how open libraries come about. Computer software use highly structured information in its software source code. On the level of an entire computer, this large amount of highly structured information reaches an extraordinary level of complication. Let me illustrate this with a simple example. My operating system of choice is Debian GNU/Linux. The system consist of a set of packages. When I looked at it May 2008, there were 22456 packages available in my (typical) installation. Each package provides a certain functionality. When I want to add a functionality to my computer, I add a package, say package A. But packages are not independent. More often than not, when I add a new package, I am told that there is a bunch of other packages I have to install as well because without these, my desired package A will not run. And there is also a bunch of other packages that are suggested by package A. I am told that when I run package A, I may also install package B and C that are just friends of package A. Actually, on a technical level things are even more complicated. Each package comes with a version number. Package A version 1 may require package B version 2 or higher. It may be incompatible with version 2.1 of package B. And so on. You get the idea. Before I bore you with more technical details, let's move away from technology and look at people. Let's look people who package the software. Let us call them packagers. Most packagers are not the authors of the software they package. They know the software well and they know the operating system well. They work at the interface between the software and the operating system. They take the software as an input and contribute to the operating system. In doing that, they may change some aspects of the software. They can do this because the software can be changed. Thus it can be adapted to the requirements of the operating system. These requirements stem from the fact that the operating requires packages to work together, that requires that user may change aspects of the software but still want to have it update gracefully when the latest and greatest version of the software comes out. How come that all of this can be made available not no cost? The key is reuse. All pieces of software rely on library. Yes, geeks use libraries too. But their libraries are actually computer files. These special files contain compiled pieces of code that have already be written by somebody else. When I write software for my digital library systems, I use a language called Perl. The code that I write in Perl is a simple text file. But I am not writing all my software from scratch using the commands that Perl provides. Instead I uses system of Perl code called modules that contain Perl code that has aleardy been written by somebody else to get common tasks to run. These modules form an open library of code. So this is one way we have reuse. In a similar way, Perl itself is written in a language called C, immortalized in the Beatles song "Write in C". C code itself relies on C libraries to achieve common tasks such as showing a character on the screen. Maintainers of the Perl software reuse these libraries. In the same way, when you use a web site, the web server, most likely Apache will anwer your requests. The web server software is also written in C, and uses the very same library to achive common tasks, for example reading a file. Reuse is a technical feature, but it close related with the social organization of OSS. OSS writers usually maintain just a few, sometimes just one piece of the extraordinary complex edifice that is the software that runs on a computer. But the important thing is that they work in a way that is compatible with others. Now, some readers may object that digital libraries contain objects that are produced with a profit motive in mind. Such readers should remind themselves that I am writing about digital libraries as descriptions of objects. If these objects are freely available for use, all the better. Recall that OSS does not prevent its users to create objects with a profit motive. You can use OSS to create a piece of poetry and charge people to download a copy of the poetry. Similarly the open library as envisioned here works at the level of the description of the objects, which I call the central aspect of a library. And remind yourself that not all underlying objects in a library have been written with an immediate profit from sale purpose. For example, I have not been paid to write this paper. 4. Examples of open digital libraries Of course, I would not be boring you will all this theory if there were not already some examples of open libraries around there. I humbly think that I have created probably one of the very finest, the RePEc open library, see http://repec.org. I formally co-founded RePEc in 1997, but really goes back to efforts I made in 1993 to collect information about scientific papers in economics and make it freely available. Today, RePEc is based on close to 900 RePEc archives. These furnish bibliographic information that can be harvested. User services are separate from these archives. They provide aggregates of the bibliographic data for end users. Thus the provision of archives and the provision of interfaces of the data in the archive is separated. Archive maintainers face a situation where they maintain data that is used in a variety of interfaces. The more widely the data is seen, the more interest they have to maintain it up to date. User services have a wide set of data to show. It attracts wide usage. But as a whole the most important thing about RePEc is that it is sustainable without external subsidy. RePEc has no budget, no officers, no meetings, and no formal decision making process. And it does no longer rely on a single person to keep things together. How did RePEc come about? If you start with empty archives, you have empty user interfaces. Who will start the first archive, and who build the first user interface? Well, I did a lot of that in the first year, 1993. In 1994 Jose Manuel Barrueco Cruz joined me. By the time we got funding from the Joint Information Systems Committee of the United Kingdom's higher education funding bodies, in 1995, we already had, I think, collected 5000 descriptions of online papers. Although I can't prove this scientifically, just having a few people compile bibliographic data is not how things have really scaled up. In our case of an academic open library, the thing that really worked to bring in community involvement---yes, that is what is needed---is to build an author registration system. In RePEc's case, there is now special interface that authors use to register the works that they have written and that are catalogued in the RePEc data. In 1998 I supervised a student by the name of Markus Johannes Richard Klink. We set this up in 1999. The system is now called the RePEc Author Service. Author registration is a piece of glue that provides me with a link between the things I have done in the past (RePEc and other projects that I am not talking about here) and my plans for the future, and some of the current work that I do. What's the big deal about author registration? Well on it's own nothing much. Author registration has to be seen within a wider context of a academic digital library. Author registration allows us to put the papers of a person an a common page. This starts to look like part of a CV. But there two crucial differences. Instead of researcher having to compile and format the bibliographical information on her own, here she can just say what papers she has written. Second, the publication list is reusable across further services. The key is reusability. The CV is not reusable, the author registration record is. Readers may object that a CV is not just a list of document descriptions that a person has written. I Author registration allows for a battery of author rankings calculate from various measures of usage. Typically such measure involve the number of abstract views, the number of full-text downloads. Interestingly although such usage is distributed across RePEc services, the service collaborate to build a cross-service usage database. This dataset is called LogEc. It was built by Sune Karlsson. He also maintains it. Other measures of usage include citations. There is a citations database that is part of RePEc. It is the CitEc system built and maintained by Jose Manuel Barrueco Cruz. The gathering of usage records, and the linking of the usage records with 5. Reasons to be cheerful In recent years a lot of progress has been made in the technologies that enable the reuse of textual data. First, XML has established itself as a lingua franca of textual data. There is a large set of technologies that surround XML. One important example are XML Namespaces. Using namespaces, XML allows for the mixing of element from different name spaces. In simple terms, it allows you to use different vocabularies of terms in a single document. This is useful where uniting sets of metadata. For exmaple, Peter and Paul can describe the a bunch of objects using each a different set of properties. Here is an example from my own work. Most of the data collection used a format called AMF. OK, much of the design of it I did myself in 2000. It have been very useful. It allows me to describe objects of my interest, in the scholarly communications domain, in the way I want to describe them. But sometimes I have some special needs, that I can't describe in AMF without extending it in ways that is only useful to solve a particular issue that I have with a speficic system. Well, AMF is built as an open format. It accepts foreign vocabularies form other namespaces pretty much anywhere. Another piece of good news is that primary publishing business, the provision of contents, also has made great strides in the provision of open-access information in a loosely structured form. The best example is Wikipedia. Making Wikipedia a lot more structured is difficult because of the general level that it operates at. But the example illustrates well that free information can be of a very high quality indeed. The Obstacles There are powerful obstacles to achieving free digital libraries. I am not sure which ones are the most important here, so I give a somewhat ran dom list. They are technical incompetence, and two more theoretical issue, the myth of industry and the myth of the full text. Technical incompetence is a major limit. Unicode, XML and its related technologies such as XML Schema and XSTL, XHTML, CSS, SQL, OAI PMH and OAI ORE, operating system skills, basic knowledge of networking, and above all, knowledge of a scripting language such as Perl or PHP, it all adds up to an large body of knowledge. While it is not required that every digitial library builder has to have a deep knowledge of each of these areas, a deep understanding of at least a few of them, as well as having the programming skills. Without this prequalification, you are not operational. Usually none of this is taught in library school. For years, I have been battling myself to introduce at least a small part of this to the curriculum, but with limited success. As a result, the average library school graduate, has virtually no chance to get involved in digital library building. One may argue that this work can be left to technical staff and the library staff only need to design the system. However this is a fallacy, without having studied the techincal underpinnings, library staff lack the analytical reasoning skills that are required to get even started with the design of new systems. As a result, innovation in libaries is stifled. There is a tendency to contract out. As I wrote in a mailing list recently "libraries are outsourcing to their death". Onother probelm, is more difficult to understand, it something I have called "the myth of industry". It is the tendancy of people involved in digital library work to keep the work for themselves and limit interoperabilty. Their ideas is that they have built the data, and therefore they want to keep a tight control over its usage. Therefore you have to ask them to get a copy of the data, and more often than not, the answer iss no. For example, it has not been possible for me to get a copy of the Astrophysical Data Service to used in my AuthorClaim registration service. What industry mythers not understand is that bu giving the structured data freely away, they incourage reuse of the data. It means that their own contributors have better incentives to contribute data since it is more widely used. That is by giving away their structuered data, open access publishers and digital library builder add value to their collections, rather than reducing value. But this point is not widely understood. Finally, there is the worship of the full text. It is an inability to reason beyond end users reaching full text. Many people place the full text at the center of theyr collection development. If the full-text is really a textual object, and if it is freely available as it should be, it can be quite easily be indexed by a full-text engine, say Google. Thus textual metadata attached (in some form) to the full-text is not really important, the same textual string can also be found in the full text. So people put documents on web site, have them indexed by Google, and say that's it, I am done. It's ok if this is an announcement of your next birthday party. But it appears insufficient when we deal with important documents such as scientific papers, legal documentation, technical documentation or works of art. In these fields we are typically not only interested in getting access to the full text. In fact we are also interested in the links between these object. In patents, we are interested in citation links between patents, who applied, if the patent was approved or not etc. In academic work, we need to know who the author was. In preservation we need to know what general class a full-text object belongs to, so that we can reach a decision about whether and how to preserve it. All these concerns require registries. And these registries have to be compiled partly at least by hand. This is the job of digital librarians. Regististries make for a digital library to be more than just a collection of http available computer files. But work on registries can not progress unless more people realize their importance. In digital libraries, the full text should not be considered to be of central importance. Rother, it should be attribute, hopefully in the structure an OAI ORE recorcdd that describes a digital object worthy of interest. I am not going to give you examples of how these obstacle impact current projects. But I invite you to think about, for example the National Science Digital Library, within the contexts of these obstactles. Conclusions The most important trend libraries are facing is the increase of free access information resources. Nowhere is this more to be seen as on the web. More and more serious information is being made available for free on web sites. Project Gutenberg was an early starter. Many newspapers, for example, have been building web sites and offer much of their contents for free on these sites. Many institutions offer important information about themselves on web sites. Encyclopedic knowledge is more widely available than ever thanks to Wikipedia. The fact that we now have freely available computer operating system is obviously a part of the attention economy trend.