Networks in digital libraries, a personal view

Thomas Krichel
http://openlib.org/home/krichel
krichel@openlib.org
Palmer School of Library and Information Science
College of Information and Computer Science
CW Post Campus of Long Island University
720 Northern Boulevard
Brookville 11548-1300
USA
Division of Information Systems
Faculty of Information Technology
Novosibirsk State University
2, Pirogov Street
630090 Novosibirsk
Russia
Network research has been around for some time now. I was introduced to this from my math books at school. There I saw a little schematic map of the town of Königsberg, now named Kaliningrad. The town is located in East Prussia and currently forms an enclave of Russia. The town was built on the river Pregel. There were seven bridges going to and from a couple of islands. The task set was to devise a walk that would start and end at the same place but would visit every bridge just once. No solution was given. I vividly remember my frustration in trying to find the answer to this problem but there was no Internet library to aid my studies. However, 30 years later, with the emergence of the Internet, I revisited this problem. It turns out that there is no mathematical solution to this question. Ever since, I have been cursing the writer of this textbook for wasting my time. But there are some mitigating circumstances. First, the person who showed that there was no solution, in 1738, was Leonard Euler, not exactly a mathematician to sneeze at. Second, the problem is considered to be the birthplace of graph theory. You have pieces of territories connected through bridges. This is just like having nodes connected by edges.
I became interested in the idea of graphs as a tool to produce value-added services in digital libraries. Surely a distinguishing feature of digital libraries, as opposed just a pile of files on a disk somewhere, is that each element of the digital library has been uniquely identified, and that there are some relationships between them. Thus, every document is a node. The relationship that it has to another node implies that there is an edge between nodes. The classic example comes if you think of your library as containing scientific papers, and the relationships as citations. One paper cites the other. This is an asymmetric relationship, because in most cases if paper “A” cites paper “B”, then paper “B” does not cite paper “A”. Citation graphs can be used to find related papers, since related papers will tend to cite the same paper. This is particularly powerful because it cuts across language barriers. For example, two papers, one in Spanish and another one in English, can be found to be related even though they may not share the same terms at all. Free citation data is scarce at this time. But anybody who is interested can have a look at Jose Manuel Barrueco Cruz’s CitEc dataset at http://citec.repec.org, which does have a good free citation dataset of the RePEc digital library.
One feature that I have worked on is the structure of co-authorship in the RePEc digital library. The library has an author registration system. Authors contact that system to state which papers, as catalogued in the library, they have written. If two authors claim to have written the same paper then they are co-authors. Co-authorship then sets up a graph between people. This is something that we can call a network. The graph is symmetric, because if I am somebody’s co-author then they are my co-author. Nowadays co-authorship is very common. Thus, as more and more authors register, we can build a network of co-authorship across all authors in the discipline. Authors are linked through co-authorship links. These links can be short. I have a co-author “Joe”, so the path between me and Joe is of length one. Joe has a co-author “Jane”. I have neither had the pleasure nor the ability to write a paper with Jane, so my path to Jane is of length 2. All authors who can be reached by paths between them form a co-authorship community. Typically, the largest such community covers more than 50 percent of all authors, often 80 or 90 percent. For each author in the community, we can then observe how they are linked to other. As the simplest measure, we can calculate the average length between the author and all others. If that average length is small, then we can say that the author is central to the community. If that average length is large, we can say that the author is marginal. This gives us a way to rank authors in the community. Why do this? If an author wants to improve their ranking, then they can write more papers with more co-authors. This is not something that can be accomplished immediately as it takes time and effort to write papers. A cheaper and easier alternative is for the author to make sure that all of their co-authors are registered and that their list of papers is up to date. Thus the centrality ranking can be a tool to get others involved. If we want to set up such a system in practice, we find that it is not trivial to do so once the number of authors reaches a few thousand. For each author we need to find the shortest number of paths between themselves and their remaining co-authors. So, if there are 10000 authors we need to calculate 49995000 shortest paths. We cannot simply do this only once because new authors may register at any time and will appear in the community of connected authors, necessitating path recalculation. I have designed software to be able to accomplish this task, but I do not have a dedicated machine for running this service. So what is holding back network usage in digital libraries, be it networks with documents or networks with people? Well first it is a fundamental matter. We really need a stable identification of items in the digital library. This is something that is still quite hard to achieve, because the evaluation of sameness cannot be left to a computer and it has to be lead by humans. We need an organized collection description, hopefully something that is freely available. These are issues that I have written about in other papers, most recently in “From open source to open libraries”, available at http://openlib.org/home/krichel/papers/kuyus.html. Second, as I have pointed out here, we need better computational resources. While computers are improving, it is still hard labor to go through the items in a digital library. Since RePEc now has about 600000 items, we are talking about a huge amount of calculations that would be required to produce centrality calculations when full citation information for all is available. Finally, we have the problem of visualization. Just a centrality table is not something that will get us a lot of attention. We really need to have pictures that communicate stories, show clusters of nodes that belong together and show peaks and troughs. We are still quite far from intensive usage on networks in digital libraries. A lot remains to be done.

Valid XHTML 1.0!