Metz paper

by Thomas Krichel

version of 2007‒03‒19

Introduction

This is the Metz paper. It's first draft by Thomas Krichel on was written 2006-09-04.

The Metz paper describes a software system to be known as icanis. icanis stands for instantaneous co-authorship network information system.

Icanis is a software system that can maintain and visualize social network data while the underlying network data is changing.

The application area of icanis is co-authorship data. Thomas Krichel thinks that there is no precedence for a service that can visualize a large (1000+ nodes) co-authorship system that changes over time. Studies of such networks are limited to either to one static instance or the comparison of the state of the network at few points in time.

The focus of icanis, at this stage, is not the examination of a network's historic development. Instead, it looks at the network at the current moment in time.

Icanis does not aim to give an accurate portrait of the network at any time. Instead, a best effort is made to give a reasonably accurate picture of the network at all times. The effort depends on the computational resources that are available. The effort is regulated by crontab entries. The more often jobs are run, the more accurate the network description will be. The setting of scheduled jobs is out of the scope for icanis itself.

This design is made with the possibility of splitting the production of the results and the presentation of the results in mind.

Generalities and limitations

The UTF-8 character encoding is used throughout.

Icanis software is written in Perl and as XSLT files. The whole system is made available under the term of the GNU public licence.

Icanis uses the mySQL database system and the LibXSLT XSLT processor.

Icanis is limited to symmetric network types. This implies that paths from any node a to a node b are the same as the paths from node b to a node a. But the network types do not need to be binary.

Any network type has a type identifier henceforth referred to as nettype. "binary" is an example of a value that the nettype can take. In the implementation, it is the only one that is used.

Any network data source has an identifier henceforth referred to as source. "ras", standing for the RePEc Author Service, is an example of a value that the source can take. The implementation uses this value for this particular source.

A single icanis installation can handle a range of networks types and data sources. But icanis is not designed to deal with a lot of them. There is no systematic management of sources and nettypes.

Any node has an identifier henceforth referred to as its "handle".

Time stamps are represented by the Unix epoch, i.e. the number of seconds since 1970-01-01. This is henceforth referred to as the "tist".

Some scripts may have to check, at startup time, if there is another version of it running, because no two instance should be run at one time. Such scripts are called non-concurrent. If a non-concurrents script is started while another instance of the script is running, the starting script exits with an error message to STDOUT. [FIXME: Dmitri to check if there is a standard way to do this in Perl.] If there is no other instance running, the script opens a log file in the loging directory.

Each stript issues a message on start that it started at certain time. It ends with a message on when it ends. The time is reported with the normal output of the "date" command. It is therefore locale depended. Something like
print "started: ".`date`; and
print "ended: ".`date`;
will do it. The message is printed to the scripts log file.

Directory structure

Icanis assumes that there is a user icanis on a machine running the software. icanis itself runs in the subdirectory icanis of that user. This directory is henceforth referred to as home. home typically takes the value /home/icanis/icanis.

Script logs are stored in home/var/log. Each script logs its start time, end time and any other thing it judges valuable in a file script_name_tist.log, where script_name is the name of the script and tist is the tist of the start of the start time of the script. Scripts to clear the log directory when it is considered full are outside the scope of icanis.

Configuration files are stored in home/etc.

Perl script are stored in home/perl.

XSLT file are stored in home/xsl.

Input

Node data contains are stored as Perl structures in home/input/. There are two types of input data. They are called node inputs and adjacency inputs.

Node inputs are only identified by the network source. Thus they are stored files named source_nodes_tist.dump.gz. As the file names suggest, they are gzipped Perl structures. Node inputs are hashes, that have handles as keys, and hashes with keys "name" and "url" as values. Data for all nodes that belong to the giant component of all network types are stored. Other nodes may also be stored.

Adjacency inputs file are named source_nettype_tist.dump.gz. Thus, unlike node input, that are only identified by the source, adjacency inputs are identified by the network type that they belong to, as will as by the source. An implicit assumption is that the the same set of nodes can be looked at in several networks. Adjacency input files contain hashes of hashes that document half the adjacency matrix. Keys are handles. Only hash values from a lower node handle to a higher node handle, in standard string comparison order, are given to conserve disk space. Data for all nodes that belong to the giant component are stored. Other nodes must not be stored.

Component analysis has to be carried out prior to placing data in an icanis input directories installation. How this is done is outside the scope of this specification.

As an example, let there be a simple network source banja that has three notes at a time 1173213901. Then a file banja_nodes_1173213901.dump.gz, contains, after decompression, a variable, say $n

$n->{di}->{name}='Dmitry Ishkov';
$n->{di}->{url}='http://www.banja.net/di';
$n->{tk}->{name}='Thomas Krichel';
$n->{tk}->{url}='http://www.banja.net/tk';
$n->{vs}->{name}='Vladimir Selivanov';
$n->{vs}->{url}='http://www.banja.net/vs';

Now assume that the network links node di to node tk and node tk to node vs, on a hypothetical tist 1173219390. The network type is binary. Then, banja_binary_1173219390.dump.gz, after decompression, contains a variable, say $a with

$a->{id}->{tk}=1;
$a->{tk}->{vs}=1;

This completes the example.

Shortest path output form

The system aims to compute all shortest paths between any couple of nodes. We call the shortpath set from node A to node B the set of all shortest paths from A to B. Symmetry implies that the shortpath set from A to B is the same as the shortpath set from B to A.

Each shortpath set has a handle that is composed out of the handle of the starting node and the handle of the ending node, separated by the tab character.

Within a shortpath set, the path strings are the composed by the sequence of intermediary node handles, separated by tab characters.

Within each shortpath set, path strings are numbered by standard string comparison order, starting with 1. The number of the path in that order will be referred to as the inset path number.

For its storage of shortpaths in a database, the following column structure would seem appropriate

  1. shortpath set handle
  2. inset path number
  3. the number of intermediate nodes
  4. the path string

This table is the path table of a network. It is identified by source_nettypepaths. It is part of the icanis database.

A second table stores node data in the database. This table is only used for node searches. The structure is

  1. handle
  2. name
  3. url

This table is the nodes table of a network. It is identified by source_nettypenodes. It is part of the icanis database.

Maintenance of the path table is ensure by the path table feeder script path_table_feed. This script lives in home/perl. It takes the source and handle as parameters. When the script is fired up, it examines the most recent versions of the input data. It compares it to the tist of the last version for which an update has been made. That version is saved in a file home/var/update/paths_source_nettype.tist. This file just contains the tist and the newline char, that's all. The tist is the one of the input data used, not the one of the generation of the summary. This script is an exclusive script.

Another script for the maintenance is path_table_feed_node. This script lives in home/perl. It takes the source and handle as and the node to be parameters, as well as the parameter "deep" or "shallow". If the node parameter is not given, the script feeds paths for a random node. This is a non-exclusive script.

Maintenance of the nodes table is ensure by the path table feeder script nodes_table_feed. This script lives in home/perl. It takes the source and handle as parameters. When the script is fired up, it examines the most recent versions of the input data. It compares it to the tist of the last version for which an update has been made. That version is saved in a file home/var/update/nodes_source_nettype.tist. This file just contains the tist and the newline char, that's all. The tist is the one of the input data used, not the one of the generation of the summary. This script is an exclusive script.

The user name for the database is icanis, and the password is stored as a string in /home/icanis/etc/mysql_passwd. That file is readable and writable by the user icanis only. Note that it lives outside the normal icanis directory tree. It may be required that it be read by the user who runs the web server.

Shortest path update calculations through path_table_feed

When the adjacency matrix is updated icanis path_table_feed calculates the difference between the matrix it holds an previous tist and the matrix at the current test. It then carries out a minimum operation for updates.

There are 6 types of incidents that the matrix of edge can undergo.

  1. "a+" A new node appears.
  2. "a-" An existing node disappears.
  3. "e+" An node has a new edge to another.
  4. "e-" An node looses an edge to another.
  5. "l+" The edge of an node to another becomes shorter.
  6. "l-" The edge of an node to another becomes longer.

Quite naturally, one would like to update the minimum amount in order to arrive at a speedy update of the path table.

If an node has only one one edge to any other node (it's corresponding node, say), it is a marginal node. Since there are a large number of marginal nodes, roughly 1/3 in the RAS network, such authors require special treatment.

Work for a marginal node is quite simple. In a+, new paths are added to the pathbase that corresponds to a replication of the paths of the corresponding node, augmented by the initial node. In event a-, these paths are removed from the path base. In event "l+" and "l-", only the lengths of these paths are recalculated by the different of lengths.

In event "e-" for a non-marginal node, if that node passes to marginal state, all paths that go through the author have to be deleted. Therefore it appears useful to keep a set of paths that go through a node, for each node. How this is to be implemented remains to be seen.

In the algorithms that Thomas Krichel is aware of, paths are calculated from a certain start node to all end points that are reachable. Within the context of icanis, that only studies a single component, that means all nodes. [FIXME: is this correct?]

We assume in the following that we can not calculate, in an isolated fashion, individual paths between an arbitrary start node and an arbitrary end node. Thus the difference calculations should lead to a minimum set of nodes for which the entire set of paths will be updated. [FIXME: is this correct?]

It is useful to define a full update of a node. Updating a node means calculating all paths between the node and all other nodes, as well as updating the path for this author, both shallow and deep.

A shallow update of paths of an nodes (start node) involves checking the paths from the start node to all other nodes, the end nodes. If the common length of the shortpaths in the updated calculation is shorter, then replace the shortpaths in the pathbase with the ones in the calculate set.

In a deep update, icanis does not only update the paths between two nodes but all sub shortpaths between others that lie on the paths lead between two nodes. Deep updates only make sense for paths where there is a change in length in shallow updating. Thus an iterative algorithm can find all the paths that need a shallow update.

In incidence a+, if the node is not marginal, and e+, since the node is no longer marginal, as well as "l+" icanis makes full update for that node.

In incidence a-, if the node is not marginal, we have to find all the incidents where this author is intermediary. This is expensive since in principle, it requires to check all paths to see where deleted author is an intermediary. A path update for all adjacent nodes will not do us any good. At this time of writing, there is no good strategy to deal with this quickly. But one would expect only rarely that this events occurs.

Shortpath Summaries

We call a shortpath summary any kind of calculation that can be performed on the shortpath database and that gives a ranking of nodes.

A summary is characterized by a criterion name crit_name. The Perl script that implements it lives in name home/perl/summarize_crit_name. The script takes the source as a required parameter. The script always uses the current version of the shortpath set.

If more than one node achieve a common value of the criterion, they are called equi-placed. Their common value of the rank is equal to the sum of the rank they would achieve if they had been placed in arbitrary order, divided by the number of equi-placed nodes.

The summary script outputs a perl structure st form $_->handle->v and $_->handle->r where the "v" hash holds the value and the "r" has holds the rank.

The summary structure live in home/summary/source_crit_name.dump. We note that there is no tist involved. The calculation is conduced on the pathbase at the time when the calculation is performed. There is no requirement to keep historic data of summaries.

There are two summary scripts that are part of an icanis installation. One has the crit_name "betweenness". The other has the crit_name "closeness". They rank nodes according to closeness and betweenness, as commonly understood in networks. Other path summary scripts may be installed.

Web Interface

All HTML files are written as XHTML 1.0 strict. They use CSS level 2.1 to add style, but within reason.

The web service files live in in two directories, home/html and home/xsl.

Let base_url be the base url of the service, essentially the domain name at which the service lives. This value is not used in icanis. All URLs are relative to a source and an nettype. They live in URLs starting with base_url/source/nettype/. It is this value that we will referred to as url in what follows.

As an example, if the RAS source is used for a binary network, then the url would be something like http://collec.openlib.org/ras/binary. How several top levels for several sources and nettypes are combined is out of scope for icanis.

All HTML pages refer to a CSS file at the href "/source/nettype/icanis.css".

With each page, there is a source and a nettype associated with it, by virtue of its url. In addition, each page a generation type and a page type.

There are three generation types in icanis.

In order to keep the system configurable to a maximum, the generation of updated and on-the-fly pages involves the creation, by icanis, of an XML data structure that we call here a nucleus. The nucleus is then transformed with an XSLT file. This specification addresses the structure of the nucleus, as well as the location of the XSL file. The XSL file itself is not part of icanis. Each icanis installation will have to maintain its own XSL files, for each source and nettype maintained.

The node page type

Node pages are update pages. The page type "node" lives at url/node/s_handle.html, where s_handle is a URI encoded save version of the handle that properly encodes special characters that the handles may contain. This is done by storing

The node pages are generated by the script make_node_pages. It requires the parameters source and nettype. It takes the optional parameter handle. If that parameter is set, only the node with the handle handle is updated. The script lives in home/perl. The page for a node with handle handle are stored in home/html/node/s_handle.html. The pages

For each handle, the following nucleus XML is created by the script

<node>
 <own_data>
  <name>name</name>
  <homepage>url</homepage>
  <handle>handle</handle>
 </own_data>
 <criteria>
  <!-- for each criterion -->
  <criterion name="crit_name">
   <position>number</position>
   <value>value</value>
  </criterion>
 </criteria>
 <neighbors>
 <!-- for each neighbor, sorted by ascending edge length  -->
  <neighbor>
   <name>name</name>
   <homepage>url</homepage>
   <handle>handle</handle>
   <edge_length>value</edge_length>
  <neighbor>
 </neighbors>
</node>

The nucleus for each node is transformed with an XSLT stylessheet. The XSLT styles sheet lives at home/xsl/source/nettype/node_pages.xsl.

The criterion pages

Criteria pages illustrate the position of nodes with respect to a criterion crit_name. They are generated with the make_criterion_pages script. The first and second are source and nettype. The third parameter is the crit_name which can take the values betweenness and closeness. The fourth parameter is the page size, in terms of number of nodes to be shown on a page. The resulting URLs live url/rank/crit_name_start_end.html. where start is the start number, and end is the end number of the node shown. Icanis adds leading 0s are appended to start and end to get file names of the same length. In addition, the script creates two symbolic links. On is start.html pointing to the frist criterion page. The other is end.html, pointing to the last criterion page.

The criterion pages live in home/html/source/nettype/rankcrit_name/.

The script lives in home/perl. It builds a nucleus

<criterion name="crit_name">
 <nodes start="handle" end="handle">
  <!-- for each node from start to end -->
  <node>
   <name>name</name>
   <homepage>homepage</homepage>
   <handle>handle</handle>
   <rank>number</rank>
   <value>value</value>
  </node>
 <nodes>
 <!-- if there is a next page -->
 <next start="first node number" end="last node number"/>
 <!-- if there is a previous page -->
 <previous start="first node number" end="last node number"/>
</criterion>

The transforming XSLT lives at home/xsl/source_nettype_crit_name.xsl.

The criterion doc page

Each criterion has a documentation page. This is a static page. It lives at url/doc/crit_name.html Theses pages explain the criteria.

The search page

There is one single search script. It runs plain old CGI, to avoid complicating matters at system administration level. For a later running public service, we may want to run it differently. The page lives at url/bin/search.

If search called is without an argument, it builds a nucleus

<page/>

which is transformed with home/xsl/source_nettype_search.xsl. This builds a full search page, with two search boxes for the start and the end node.

If search is called with arguments h1=handle_1 and h2=handle_2 the nucleus formed is

<page>
 <paths>
  <!-- for each path in the shortpath set -->
  <path>
   <!-- for each node in the shortpath -->
   <node>
    <name>name</name>
    <homepage>url</homepage>
    <handle>handle</handle>
   </node>
  </path>
 <paths>
<page>

which is transformed with home/xsl/source_nettype_search_result.xsl. This builds a complete result pages showing all paths from node handle_1 to handle_2. If handle_1 has higher lexographical order than handle_2 the paths in the shortpath set have to be reversed.

If search is called with arguments q1=exp_1 and q2=exp_2 queries for node name or id matching exp_1 and exp_2, repectively, are made.

If two unique answers are found, the nucleus is identical to the search results case. The search result is shown.

If no results for neither exp_1 nor exp_1 are found, the nucleus is

<page>
 <result q1="query string"/>
 <result q2="query string"/>
</page>

which is transformed with home/xsl/source_nettype_search_page.xsl.

If no results for exp_2 but one unique result for exp_1 are found, the script forms the nucleus

<page>
 <result q1="query string">
  <node>
   <name>name</name>
   <homepage>url</homepage>
   <handle>handle</handle>
  </node>
 </result>
 <result q2="query string"/>
</page>

It is transformed with home/xsl/source_nettype_search_half_result.xsl.

If no results for exp_1 but multiple results for exp_2 are found, the script forms the nucleus

<page>
 <result q1="query string"/>
 <result q2="query string">
  <nodes>
   <!-- for each found result -->
   <node>
    <name>name</name>
    <homepage>url</homepage>
    <handle>handle</handle>
   </node>
  </nodes>
 </result>
</page>

It is transformed with home/xsl/source_nettype_search_half_result.xsl.

The search script may be called with an argument h1 replacing q1 and h2 replacing q2, respectively. The first case is more common. It occurs when search is called from a node page, where the user is looking for paths to a destination node from the node described in the node page. In the nucleus, h1 replaces q1. For example

<page>
 <result h1="handle">
  <node>
   <name>name</name>
   <homepage>url</homepage>
   <handle>handle</handle>
  </node>
 </resultgt;
 <result q2="query string">
  <nodes>
   <!-- for each found result -->
   <node>
    <name>name</name>
    <homepage>url</homepage>
    <handle>handle</handle>
   </node>
  </nodes>
 </result>
</page>

It is transformed with home/xsl/source_nettype_search_half_result.xsl.

The service home page

This is a static page. It lives at url/index.html. It links to the start of criteria pages, and the search page.

Apache configuration

This section is for information only. It is not part of the specs. Let domain be the domain of the service. The apache configuration then is something like

NameVirtualHost domain
<VirtualHost domain>
 ServerAdmin webmaster@localhost
 DocumentRoot home/html
 <Directory />
  Options Indexes FollowSymLinks
  AllowOverride None
 </Directory>
 Directory home/html>
  Options Indexes FollowSymLinks
  AllowOverride None
  Order allow,deny
  allow from all
  # This directive is useful when we have only one source and one nettype
  RedirectMatch ^/$ source/nettype/
 </Directory>
 Alias /icons home/html/icons
 <Directory home/html/icons>
  Options Indexes FollowSymLinks
  AllowOverride None
  Order allow,deny
  allow from all
 </Directory>
 #  for every nettype and source
 ScriptAlias source/nettype/bin home/perl
 <Directory "home/perl">
  AllowOverride None
  Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
  Order allow,deny
  Allow from all
 </Directory>
 LogLevel warn
 ErrorLog /var/log/apache2/domain_error.log
 CustomLog /var/log/apache2/domain_access.log combined
 ServerSignature On
</VirtualHost>

Implementation

A running implementation with data from the RePEc Author Service is part of the project. This implementation is code-named "CollEc", but this name may be changing later. This source is labelled as "ras".

The nettype in the implementation is labeled "binary". It deals with binary relationships between nodes.

The web site will live at http://collec.openlib.org. This redirects to http://collec.openlib.org/ras/binary because this is what we will implement.

At this time, Thomas Krichel is not aware of an algorithm that can calculate all paths for non-binary network. We are only aware of the Dijkstra algorithm, but that only calculates one path. Therefore, at this time, no weighted algorithm is implemented.

Configuration time option

At configuration time, the following parameters need to be set.

Temporary: Interface testing

To test the interface, a prototype sohuld be build. This uses a few static files. They should already be in the right places. Once this is done, Thomas can have an expert look at the system. A later interface test with a questionaire should be prepared.

Missing stuff

There should be some way to do error handling when users hit non-existing paths.

Valid XHTML 1.0!