This is the Metz paper. It's first draft by Thomas Krichel on was written 2006-09-04. The following versions are available
The Metz paper describes a software system to be known as icanis. icanis stands for instantaneous co-authorship network information system.
Icanis is a software system that can maintain and visualize social network data while the underlying network data is changing.
The application area of icanis is co-authorship data. Thomas Krichel thinks that there is no precedence for a service that can visualize a large (1000+ nodes) co-authorship system that changes over time. Studies of such networks are limited either to one static instance or to the comparison of the state of the network at a few points in time.
The focus of icanis, at this stage, is not the examination of a network's historic development. Instead, it looks at the network at the current moment in time.
Icanis does not aim to give an accurate portrait of the network at any time. Instead, a best effort is made to give a reasonably accurate picture of the network at all times. The effort depends on the computational resources that are available. The effort is regulated by crontab entries. The more often jobs are run, the more accurate the network description will be. The setting of scheduled jobs is out of the scope for icanis itself.
This design is made with the possibility of splitting the production of the results and the presentation of the results in mind.
Icanis software is written in Perl and as XSLT files. The whole system is made available under the term of the GNU public licence.
The UTF-8 character encoding is used throughout.
Icanis uses the mySQL database system and the LibXSLT XSLT processor. Care must be taken that the tables are set up to work with UTF-8.
Icanis uses the LibXSLT XSLT processor.
Icanis is limited to symmetric network types. This implies that paths from any node a to a node b are the same as the paths from node b to a node a. But the network types do not need to be binary.
Any network type has a type identifier henceforth referred to as nettype. “binary” is an example of a value that the nettype can take. In the implementation, it is the only one that is used.
Any network data source has an identifier henceforth referred to as source. “ras”, standing for the RePEc Author Service, is an example of a value that the source can take. The implementation uses this value for this particular source.
A single icanis installation can handle a range of networks types and data sources. But icanis is not designed to deal with a lot of them. There is no systematic management of sources and nettypes.
Any node has an identifier henceforth referred to as its “handle”.
Time stamps are represented by the Unix epoch, i.e. the number of seconds since 1970–01–01. This is henceforth referred to as the “tist”.
Some scripts may have to check, at startup time, if there is another version of it running, because no two instance should be run at one time. Such scripts are called non-concurrent. If a non-concurrent script is started while another instance of the same script is running, the starting script exits with an error message to STDOUT.
 Some common functions, as well as settings are stored in
 home/perl/Icanis/Common.pm
handle_to_file(node,source,nettype)handle_to_file returns its argument. file_to_handle(node,source,nettype)handle_to_file(file_to_handle()) must be the
identity function. handle_to_url(node,source,nettype)handle_to_nodepage(node,source,nettype)/, i.e. it includes the
source and the nettype.
 For each network type nettype, a Perl module
 home/perl/Icanis/nettype.pm
 that is responsible for the calculations of shortest paths.  Thus
 icanis can be extended to a range of network types and computation
 algorithms.
 Icanis assumes that there is a user icanis on a machine running the
 software. icanis itself runs in the subdirectory icanis
 of that user. This directory is henceforth referred to as
 home.  home typically takes the value
 /home/icanis/icanis.
 Perl scripts and modules are stored in
 home/perl.
 Perl modules are stored in home/perl/Icanis.
 XSLT file are stored in home/xsl.
 File that don't need to be backed up are stored in
 home/opt, or in
 homehtml/opt.
 The directory
 home/opt/paths/source/nettype/handle_file/,
 where handle_file is the result of applying
 file_to_handle() to the handle of node node
 is called the path directory for the node node
 The directory
 home/html/source/nettype/
 is the web directory for the source source and the nettype
 nettype.
 Node data contains are stored as Perl structures in
 home/opt/input/. This is the input directory.
The log time is the output of the Unix command ‘date +%F_%T‘.
 By default, script logs are stored in
 home/var/log. However, some scripts log to
 STDOUT.
 Scripts that do not log to STDOUT log to file
 script_name_logtime.log,
 where script_name is the name of the script, as based on
 the value of $0, and logtime is the log time
 of the start time of the script.  Scripts to clear the log directory
 when it is considered full are outside the scope of icanis.
Unless specified otherwise, each script issues a message on start that it started at certain time. It ends with a message on when it ends. The time is reported in the log time format. The message is printed to the script's log file or to STDOUT, if the script logs to STDOUT.
There are two types of input data. They are called node inputs and adjacency inputs.
 Node inputs are only identified by the network source. Thus they are
 stored files named
 source_nodes_tist.dump.gz.
 As the file names suggest, they are gzipped Perl structures.  Node
 inputs are hashes, that have handles as keys, and hashes
 with keys "name", and "homepage" as values.  The "homepage" is
 optional.  Data for all nodes that belong to the giant component of
 all network types are stored. The giant component is independent of
 the network type.  Information on nodes that do not belong to the
 giant component must not be stored.
 Adjacency inputs file are named
 source_nettype_tist.dump.gz. Thus,
 unlike node input, that are only identified by the source, adjacency
 inputs are identified by the network type that they belong to, as
 well as by the source. An implicit assumption is that the same set of
 nodes can be looked at in several networks.  Adjacency input files
 contain hashes of hashes that document half the adjacency
 matrix. Keys are handles. Only hash values from a lower
 node handle to a higher node handle, in standard string comparison
 order, are given to conserve disk space.  Data for all nodes that
 belong to the giant component are stored. Other nodes must not be
 stored.
Component analysis has to be carried out prior to placing data in an icanis input directories installation. How this is done is outside the scope of this specification.
 As an example, let there be a simple network source banja that has
 three notes at a time 1173213901. Then a file
 banja_nodes_1173213901.dump.gz, contains, after
 decompression, a variable, say $n, with
$n->{di}->{name}="Dmitry Ishkov";
$n->{tk}->{name}="Thomas Krichel";
$n->{tk}->{homepage}="http://openlib.org/home/krichel";
$n->{vs}->{name}="Vladimir Selivanov";
 Now assume that the network links node di to node
 tk and node tk to node vs, on a
 hypothetical tist 1173219390.  The network type is binary.  Then,
 banja_binary_1173219390.dump.gz, after decompression,
 contains a variable, say $a, with
$a->{id}->{tk}=1;
$a->{tk}->{vs}=1;
This completes the example.
 The user name for the database is icanis.  Access to the database is
 enabled through the file /home/icanis/.my.cnf. That file is readable
 and writable by the user icanis only. Note that it lives outside the
 normal icanis directory tree.  
The name of the database is icanis.
 For each source and each nettype, a database table
 stores node data.  This table is the nodes table of a network.  It is
 identified by
 source_nettype_nodes.
 The structure is
handlenamehomepagenode_tistpath_tistclosenessbetweenness
 A script update_node takes the source,
 nettype, and node as arguments.  It proceeds to
 update the node into the nodes table. If the node is not in the
 table, it inserts it into the table.  If the node is not in the
 table, and not in the node's input data, the script logs an error.
 If the node is in the nodes data, add_node populates the
 handle, name and homepage
 (optional) columns values found in the most recent version of the
 node's input data. It updates the node_tist with the
 current tist.  The script logs to STDOUT.
 A script delete_node takes the source,
 nettype, and node as arguments.  It proceeds to
 delete the node in the nodes table. The script logs to STDOUT.
 A script rollover_node_update takes the
 source, nettype, and number as
 arguments.  It proceeds to call update_node in the
 number nodes with the lowest node_tist. If
 the number is larger than the number in the the table, all
 nodes are updated.  If a node is in the nodes table, but not in the
 nodes data, it is deleted from the node
 table. rollover_node_update is a non-concurrent script.
In the algorithms that Thomas Krichel is aware of, paths are calculated from a certain start node to all end points that are reachable. Within the context of icanis, that only studies a single component, that means all nodes.
 A script update_path takes the source
 nettype, start node node and gz as
 arguments. It proceeds to calculate to all shortest path between the
 node node and all other nodes. The script logs to STDOUT.
The results are stored in the path directory of the start node node. If gz is set, the files created in the path directory are compressed.
 The path directory for a node contains two files. Any of these files
 may be compressed. In that case, the customary .gz is
 appended to their names.
 The first file is called paths. It contains all the
 paths from the node start node to any other node. There is one line
 per path. The line starts with the length of the path, followed by
 the tabulation character. Then follows a list of all intermediary
 nodes to the destination node, including the destination node.  This
 list contains the handles of these node, separated by the tabulation
 character. If the network is binary, the length of the path is equal
 to the number of nodes mentioned in the line. Paths in the file may
 appear in any order.
 The second file is called inter. It contains information
 on how much a particular node appears in between two other nodes in
 all the path between start node and a destination. Thus, if the
 paths file for a node “e” looks like, say
2 a b 3 a c d 2 a c 1 a
 the corresponding inter file will be
3 a 1 c
 The inter files does not mention any node that has not
 been an intermediary node between two others. inter
 files sort by the importance of the node, i.e. by the number in the
 first column.  
 At the start of its run,
 update_path changes the path_tist column in
 the record of the start node in the nodes table.  If that record does
 not exist, update_path calls update_node
 for this node. If the node is not in the nodes input,
 update_path notes this as an error.  
 At the end
 of its run, update_path changes the
 closeness nodes table column of the start node record in
 the nodes table. If that record does not exist,
 update_path notes this fact in its log as an error.
 A script rollover_path_update takes the
 source nettype, node number number
 and gz as arguments.
If number is greater than the number of records in the nodes table, number is set to be equal to the number of records in the nodes table.
 On startup the script selects the number nodes with the
 lowest path_tist.  It the proceeds to fire up
 update_path for each of those nodes. If gz is
 set, the resulting files are compressed.
 rollover_path_update is an non-concurrent script.
 A script incidence_path_update takes the
 source, nettype, and gz as
 arguments. It is said to run for two tists. These tists are the start
 running tist and the end running tist.
 At its start, incidence_path_update finds its most
 recent log. If no log file can be found, the current start running
 tist is the earliest tist for which edge data is available. If a last
 log is found, incidence_path_update parses its last log
 for the end running tist of the last run. If the log can not be
 parsed for such a tist, incidence_path_update exits
 immediately without writing another log. If the log file can be found
 and parsed for the last end running tist, the current start running
 tist is the last end running tist.
The end running tist is the tist of the most recent adjacency input found.
If the adjancy data for the start tist can not be found, the start tist is changed to be the oldest start tist for which adjacency data is available.
Both the start running tist and the end running tist are logged.
 The script incidence_path_update computes the difference
 between the adjacency matrix at the start and the end tist. From
 there, it finds a group of nodes that are the affected nodes and
 their immediate neighbors. 
There are 6 types of incidents that the matrix of adjacency can undergo.
Incidents “l+” and “l–” are not relevant in a binary network. They are not considered in the current implementation.
 In incidence “a+”, update_node and
 update_path are run for the new node. For the neighbors
 of the new node, update_path is run.
In incidence “a–”, if the node is not marginal, we have to find all the incidents where this author is intermediary. This is expensive since in principle, it requires to check all paths to see where deleted author is an intermediary.
One way to deal with this problem is to do basically nothing. If a node is deleted, references to it will not disappear immediately. Thus in the event “a–”, only the neighbors of the deleted node are updated.
The same updating procedures also hold for “e+” and ”e–”. Only the node and its neighbors are updated on impact.
 incidence_path_update is a non-concurrent script.
 When a new implemetation of icanis is built, it is useful
 to use a concurrent script to build paths and 
 inter files for all nodes.
 A script paths_build takes the
 source, nettype, and gz as
 arguments.
 At the start, paths_build looks at the
 nodes in the node table in random order. 
 For each node, it checks if path data is available. If
 there is no paths date, it runs update_path
 for the node.
Whereas closeness is updated when the paths from a node are calculated, betweenness values are only estimated.
 A script betweenness_update takes the source,
 nettype, and number as arguments.  It proceeds
 to examine the number nodes from the nodes table that have
 the highest path_tist field values. If no path data for a node is
 available it skips it.  Otherwise, it reads the number of paths from
 the paths and finds betweenness values in the
 inter file. When the calculations of betweenness are
 finished, it updates the betweenness columns in all records in the
 nodes table. If a record for a node to be updated does not exist,
 betweenness_update ignores the issue.
 betweenness_update is a non-concurrent script.
 Assume that a between node a and z needs to be
 found. Let handle_file_a be the be the result of
 handle_to_file(a).  Let
 handle_file_z be the be the result of
 handle_to_file(s).  Icanis
 scripts first compares
 home/opt/paths/source/nettype/handle_file_a/paths,
 home/opt/paths/source/nettype/handle_file_a/paths.gz,
 home/opt/paths/source/nettype/handle_file_z/paths,
 and
 home/opt/paths/source/nettype/handle_file_z/paths.gz
 to find the one that is most the recent, as by its file time stamp.
 It then zgreps the file for the path. 
All HTML files are written as XHTML 1.0 strict. They use CSS level 2.1 to add style, but within reason.
 The web service files live in in two directories,
 home/opt/html and
 home/xsl.
 Let base_url be the base url of the service, essentially
 the domain name at which the service lives.  This value is not used
 in icanis.  All URLs are relative to a source and an
 nettype. They live in URLs starting with
 base_url/source/nettype/.
 It is this value that will be referred to as service_url
 in what follows.
 As an example, if the RAS source is used for a binary network, then
 the service_url would be something like
 http://collec.openlib.org/ras/binary.  How several top
 levels for several sources and nettypes are combined is out of scope
 for icanis.
 All HTML pages refer to a CSS file at the href
 /source/nettype/icanis.css.
Each page, by virtue of its location, has a source and a nettype associated with it. In addition, each page has a “generation type” and a “page type”.
There are three generation types in icanis.
In order to keep the system configurable to a maximum, the generation of updated and on-the-fly pages involves the creation, by icanis, of an XML data structure that we call here a nucleus. The nucleus is then transformed with an XSLT file. This specification addresses the structure of the nucleus, as well as the location of the XSLT file. The XSLT file itself is not part of icanis. Each icanis installation will have to produce its own XSLT files, for each source and nettype maintained.
 Node pages are updated pages.  The page type “node” lives
 at
 service_url/node/node_file.html,
 where node_file is the application of
 node_to_file for this node.
 The node pages are generated by the script
 make_node_pages. It requires the parameters
 source and nettype. It takes the optional
 parameter handle. If that parameter is set, only the node
 with the handle handle is updated.  The corresponding
 files live in a directory opt/node of the web directory.
 To enable this setup without cumbersome apache configuration changes,
 a symbolic link node -> opt/node is be placed in the web
 directory.
 Let node_page stand for
 handle_to_nodepage(node,source,nettype)
 for a certain node node. Then, for each handle, the
 following nucleus XML is created by the script
<node> <own_data> <name>name</name> <url>url</url> <homepage>homepage</homepage><!-- optional --> <handle>handle</handle> </own_data> <criteria> <!-- for each criterion --> <criterion name="crit_name"> <position>number</position> <value>value</value> </criterion> </criteria> <neighbors> <!-- for each neighbor, sorted by ascending edge length --> <neighbor> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> <edge_length>value</edge_length> <neighbor> </neighbors> </node>
 The nucleus for each node is transformed with an XSLT styles sheet.
 The XSLT styles sheet lives at
 home/xsl/source/nettype/node_pages.xsl.
 Criteria pages illustrate the position of nodes with respect to a
 criterion crit_name.  They are generated with the
 make_criterion_pages script. The first and second are
 source and nettype.  The third parameter is the
 crit_name which can take the values
 betweenness and closeness.  The fourth
 parameter is the page size, in terms of number of nodes to be shown
 on a page.  The resulting URLs live
 url/rank/crit_name_start_end.html.
 where start is the start number, and end is the
 end number of the node shown.  Icanis adds leading 0s are appended to
 start and end to get file names of the same
 length.  In addition, the script creates two symbolic links. One is
 start.html pointing to the first criterion page. The
 other is end.html, pointing to the last criterion page.
 The criterion pages live at URLs starting with
 service_url/crit_name/. The
 pages are stored in
 home/html/source/nettypeopt/crit_name/. To
 enable this setup, symbolic links crit_name ->
 opt/crit_name are used.
The script builds a nucleus
<criterion name="crit_name"> <nodes start="handle" end="handle"> <!-- for each node from start to end --> <node> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> <rank>number</rank> <value>value</value> </node> <nodes> <!-- if there is a next page --> <next start="first node number" end="last node number"/> <!-- if there is a previous page --> <previous start="first node number" end="last node number"/> </criterion>
 The transforming XSLT lives at
 home/xsl/source_nettype_crit_name.xsl.
 Each criterion has a documentation page. This is a static page.  It
 lives at
 url/doc/crit_name.html
 Theses pages explain the criteria.
 There is one single search script. It runs plain FCGI.  The page
 lives at service_url/bin/search.
If search called is without an argument, it builds a nucleus
<page/>
 which is transformed with
 home/xsl/source_nettype_search.xsl.
 This builds a full search page, with two search boxes for the start
 and the end node.
If search is called with arguments h1=handle_1 and h2=handle_2 the nucleus formed is
<page>
 <paths>
  <!-- for each path in the shortpath set -->
  <path>
   <!-- for each node in the shortpath -->
   <node>
    <name>name</name>
    <url>url</url>
    <nodepage>node_page</nodepage>
    <homepage>homepage</homepage>
    <handle>handle</handle>
   </node>
  </path>
 <paths>
<page>
 which is transformed with
 home/xsl/source_nettype_search_result.xsl.
 This builds a complete result pages showing all paths from node
 handle_1 to handle_2.
If search is called with arguments q1=exp_1 and q2=exp_2 queries for node name or id matching exp_1 and exp_2, respectively, are made.
If two unique answers are found, one for each query, the nucleus is identical to the search results case. The search result is shown.
If no results for neither exp_1 nor exp_1 are found, the nucleus is
<page> <result q1="query string"/> <result q2="query string"/> </page>
 which is transformed with
 home/xsl/source_nettype_search_page.xsl.
If no results for exp_2 but one unique result for exp_1 are found, the script forms the nucleus
<page> <result q1="query string"> <node> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> </node> </result> <result q2="query string"/> </page>
 It is transformed with
 home/xsl/source_nettype_search_half_result.xsl.
If no results for exp_1 but multiple results for exp_2 are found, the script forms the nucleus
<page>
 <result q1="query string"/>
 <result q2="query string">
  <nodes>
   <!-- for each found result -->
   <node>
    <name>name</name>
    <url>url</url>
    <nodepage>node_page</nodepage>
    <homepage>homepage</homepage>
    <handle>handle</handle>
   </node>
  </nodes>
 </result>
</page>
 It is transformed with
 home/xsl/source_nettype_search_half_result.xsl.
The search script may be called with an argument h1 replacing q1 and h2 replacing q2, respectively. The first case is more common. It occurs when search is called from a node page, where the user is looking for paths to a destination node from the node described in the node page. In the nucleus, h1 replaces q1. For example
<page>
 <result h1="handle">
  <node>
   <name>name</name>
   <url>url</url>
   <nodepage>node_page</nodepage>
   <homepage>homepage</homepage>
   <handle>handle</handle>
  </node>
 </result>
 <result q2="query string">
  <nodes>
   <!-- for each found result -->
   <node>
    <name>name</name>
    <url>url</url>
    <nodepage>node_page</nodepage>
    <homepage>homepage</homepage>
    <handle>handle</handle>
   </node>
  </nodes>
 </result>
</page>
 It is transformed with
 home/xsl/source_nettype_search_half_result.xsl.
 This is a static page.  It lives at
 service_url/index.html. It links to the start
 of criteria pages, and the search page.
This section is for information only. It is not part of the specs. Let domain be the domain of the service. The apache configuration then is something like
 NameVirtualHost domain
 <VirtualHost domain>
  ServerAdmin webmaster@localhost
  DocumentRoot
 home/html
  <Directory />
   Options
 Indexes FollowSymLinks
   AllowOverride
 None
  </Directory>
  Directory
 home/html>
   Options Indexes FollowSymLinks
   AllowOverride None
   Order allow,deny
   allow from all
   # This directive is useful when we have
 only one source and one nettype
   RedirectMatch
 ^/$
 source/nettype/
  </Directory>
  Alias /icons
 home/html/icons
  <Directory
 home/html/icons>
   Options Indexes FollowSymLinks
   AllowOverride None
   Order allow,deny
   allow from all
  </Directory>
  # 
 for every nettype and source
  ScriptAlias
 source/nettype/bin
 home/perl
  <Directory
 "home/perl">
   AllowOverride None
   Options ExecCGI -MultiViews
 +SymLinksIfOwnerMatch
   Order
 allow,deny
   Allow from all
  </Directory>
  LogLevel
 warn
  ErrorLog
 /var/log/apache2/domain_error.log
  CustomLog
 /var/log/apache2/domain_access.log
 combined
  ServerSignature On
 </VirtualHost>
A running implementation with data from the RePEc Author Service is part of the project. This implementation is code-named “CollEc”, but this name may be changing later. This source is labelled as “ras”.
The nettype in the implementation is labeled “binary”. It deals with binary relationships between nodes.
The web site will live at http://collec.openlib.org. This redirects to http://collec.openlib.org/ras/binary because this is what we will implement.