This is the Metz paper. Its first draft by Thomas Krichel on was written 2006‒09‒04. The following versions are available
The Metz paper describes a software system called icanis. icanis stands for instantaneous co-authorship network information system.
Icanis is a software system that can maintain and visualize social network data while the underlying network data is changing.
The application area of icanis is co-authorship data. Thomas Krichel thinks that there is no precedence for a service that can visualize a large (1000+ nodes) co-authorship system that changes over time. Studies of such networks are limited either to one static instance or to the comparison of the state of the network at a few points in time.
Icanis looks at the network at the current moment in time. The historic development of a network is out of scope.
Icanis does not aim to give an accurate portrait of the network at any time. Instead, a best effort is made to give a reasonably accurate picture of the network at all times. The effort depends on the computational resources that are available. The effort is regulated by crontab entries. The more often jobs are run, the more accurate the network description will be. The setting of scheduled jobs is out of the scope for icanis itself.
Icanis software is written in Perl and as XSLT files. Icanis uses the mySQL database system.
The UTF-8 character encoding is used throughout.
Any network data source has an identifier henceforth referred to as source. “ras”, standing for the RePEc Author Service, is an example of a value that the source can take. The implementation uses this value for this particular source.
The same set of nodes, coming from a single source, can be looked at in several networks. There are two limitations on the network types as they are implemented in Icanis.
Any network type has a type identifier henceforth referred to as nettype. “binary” is an example of a value that the nettype can take. In the implementation, it is the only one that is used.
A single icanis installation can handle a range of networks types and data sources. But icanis is not designed to deal with a lot of them. A systematic management of sources and nettypes is out of scope.
Any node has an identifier henceforth referred to as its “handle”.
Any node has a home page of called the nodepage. This is not the external description of the node. Rather it is page that describes the node with respect to the network. Thus it shows neighbours and rankings. It may have a search box etc.
Time stamps are represented by the Unix epoch, i.e. the number of seconds since 1970‒01‒01. This is henceforth referred to as the “tist”. Its database representation is an INT.
Some scripts may have to check, at startup time, if there is another version of it running, because no two instance should be run at one time. Such scripts are called non-concurrent. If a non-concurrent script is started while another instance of the same script is running, the starting script exits with an error message to STDOUT. Icanis uses Sys::RunAlone for this task.
Icanis assumes that there is a user icanis on a machine running the
software. icanis itself runs in the subdirectory icanis
of that user. This directory is henceforth referred to as
home. home typically takes the value
/home/icanis/icanis
.
Perl scripts and modules are stored in home/perl
.
Perl modules are stored in home/perl/Icanis
.
XSLT file are stored in home/xsl
.
HTML files are in
home/html
. Files that are generated
by scripts are placed in subdirectories of this directory.
This makes is easier to avoid backing them up.
The configuration is handled with a module Conf.pm
.
This defines functions handle_to_file()
that
will find a file correspending to a node handle. A large
number of nodes make require this not to be equal to the
file handle, but prefix it with a subdirectory that makes
sense. The result of this function is a string expression.
The directory
home/paths/
source/
nettype
is called the paths directory.
The directory
home/html/
source/
nettype/
is the web directory for the source source and the nettype
nettype.
Node data contains are stored as Perl structures in
home/opt/input/
. This is the input directory.
Some common functions, as well as settings
are stored in
home/perl/Icanis/Conf.pm
handle_to_file(
node,
source,
nettype)
file_to_handle(
node,
source,
nettype)
handle_to_file(file_to_handle())
must be the
identity function. handle_to_url(
node,
source,
nettype)
handle_to_nodepage(
node,
source,
nettype)
/
, i.e. it includes the
source and the nettype.
For each network type nettype, a Perl module
home/perl/Icanis/
nettype.pm
that is responsible for the calculations of shortest paths. Thus
icanis can be extended to a range of network types and computation
algorithms.
The log time is the output of the Unix command ‘date +%F_%T‘.
By default, script logs are stored in
home/var/log
. However, some scripts log to
STDOUT. Such scripts are mentioned here explicitely.
Scripts that do not log to STDOUT log to file
script_name_
source_
nettype_
logtime.log
,
where script_name is the name of the script, as based on
the value of the base file name of
$0
, and logtime is the log time
of the start time of the script. Scripts to clear the log directory
when it is considered full are outside the scope of icanis.
Unless specified otherwise, each script issues a message on start that it started at certain time. It ends with a message on when it ends. The time is reported in the log time format. The message is printed to the script's log file or to STDOUT, if the script logs to STDOUT.
There are three types of input data. They are called node inputs, edge inputs, and text inputs, respectively. Text inputs are optional.
Node inputs are identified by the network source and nettype. Thus they are
stored files named
source_
nettype_nodes_
tist.xml
.
As the file names suggest, they contain XML data. An example will
help.
<nodes>
<node ref="pba1" name="Jose Manuel Barrueco" homepage="http://www.uv.e/=barrueco"/>
<node ref="pde1" name="Antonella De Robbio" homepage="http://www.derobbio.it"/>
<node ref="pkr1" name="Thomas Krichel" homepage="http://openlib.org/home/krichel"/>
<node ref="pzi1" name="Christian Zimmermann" homepage="http://ideas.repec.org/zimm"/>
</nodes>
The homepage attribute is optional. Data for all nodes that belong to the giant component of all network types are stored. Information on nodes that do not belong to the giant component must not be stored. It is convenient, but not required, that nodes are sorted by the uniquely valued ref attribute.
Edge input files are named
source_
nettype_edges_
tist.xml
. Thus,
unlike node input, that are only identified by the source, adjacency
inputs are identified by the network type that they belong to, as
well as by the source. Edge input files
contain edges data for half the adjacency
matrix. An example will help.
<edges>
<edge from="pba1" to="pkr1" length="0.5"/>
<edge from="pde1" to="pkr1" length="1"/>
<edge from="pkr1" to="pzi1" length="1"/>
</edges>
Only hash values from a lower node handle to a higher node handle, in standard string comparison order, are given to conserve disk space. Data for all nodes that belong to the giant component are stored. Data for other nodes must not be stored.
Text inputs file are named
source_
nettype_texts_
tist.xml
.
They are for information only. They contain text identifiers as well
as sorted list of authors. An example will help.
<texts>
<text ref="info:lib:dblp:conf/nddl/CruzK02" authors="pba1 pkr1"/>
<text ref="info:lib:dblp:conf/ercimdl/CruzKK00" authors="pba1 pkr1"/>
<text ref="info:lib:elis:4497" authors="pde1 pkr1"/>
<text ref="info:lib:elis:3117" authors="pkr1 pzi1"/>
</texts>
Data for all nodes that belong to the giant component are stored. Data for other nodes must not be stored.
Component analysis has to be carried out prior to placing data in an icanis input directories installation. How this is done is outside the scope of Icanis.
The user name for the database is icanis. Access to the database is
enabled through the file /home/icanis/.my.cnf
. That file
is readable and writable by the user icanis only. Note that it lives
outside the normal icanis directory tree.
The web user name for the database is web_icanis. The web user
only has SELECT access to the tables. The password is stored in
the file /home/icanis/etc/web_password
. That file
is readable and writable by all user on the system. Note that it lives
outside the normal icanis directory tree.
The name of the database is icanis.
For each source and each nettype, a database table
stores node data. This table is the nodes table of a network. It is
identified by
source_
nettype_nodes
.
The structure is
handle
name
homepage
node_tist
path_tist
closeness
closeness_rank
betweenness
betweenness_rank
A script update_nodes
takes the source,
and nettype. It first proceeds to
update the nodes from the last available node dataset into the nodes table.
It proceeds to
add the node into the nodes table, or replace the value if it is
there. It populates the handle
, name
and
homepage
(optional) columns with values found in the
most recent version of the node's input data. It updates the
node_tist
with the current tist.
For each node in the nodes table, if the nodes is not contained in
the edges data, delete it. By contained in the edges data, it means
it is a key of the edges data or appears in the values.
If the node is not in the
table, it inserts it into the table.
It then proceeds to
look at each node in the nodes table. If the nodes in no longer
in the XML data, it deletes the node from the nodes table.
In the algorithms that Thomas Krichel is aware of, paths are calculated from a certain start node to all end points that are reachable. Within the context of icanis, that only studies a single component, that means all nodes.
The first file is called paths
. It contains all the
paths from the node start node to any other node. There is one line
per path. The line starts with the length of the path, followed by
the tabulation character. Then follows a list of all intermediary
nodes to the destination node, including the destination node. This
list contains the handles of these node, separated by the tabulation
character. If the network is binary, the length of the path is equal
to the number of nodes mentioned in the line. Paths in the file may
appear in any order.
The second file is called inter
. It contains information
on how much a particular node appears in between two other nodes in
all the path between start node and a destination. Thus, if the
paths
file for a node “e” looks like, say
2 a b 3 a c d 2 a c 1 a
the corresponding inter
file will be
3 a 1 c
The inter
files does not mention any node that has not
been an intermediary node between two others. inter
files sort by the importance of the node, i.e. by the number in the
first column.
A script update_paths
takes the source and
nettype as its arguments. It chooses the node with
the oldest path tist, and updates its path tist. It then renews
its path and inter files, as well as the closeness column
in the nodes table. When it has done that it again updates
the path tist for the node. It proceeds to the next node. It
stops when the path tist of all nodes is more recent that the
tist of the edges data file.
The
double updating of the path tist prevents concurrent instances
from dealing with the same node.
Betweenness for all nodes is calculated by the update_betweenness script. It takes the source, nettype, and number as arguments. This is not a concurrent script.
Assume that a between node a and z needs to be
found. Let handle_file_a be the be the result of
handle_to_file(
a)
. Let
handle_file_z be the be the result of
handle_to_file(
s)
. Icanis
scripts first compares
home/opt/paths/
source/
nettype/
handle_file_a/paths
,
home/opt/paths/
source/
nettype/
handle_file_a/paths
,
home/opt/paths/
source/
nettype/
handle_file_z/paths
,
and
home/opt/paths/
source/
nettype/
handle_file_z/paths
to find the one that is most the recent, as by its file time stamp.
It then zgreps the file for the path.
All HTML files are written as XHTML 1.0 strict. They use CSS level 2.1 to add style, but within reason.
The web service files live in in two directories,
home/html
and
home/xsl
.
Let base_url be the base URL of the service, essentially
the domain name at which the service lives. This value is not used
in icanis. All URLs are relative to a source and an
nettype. They live in URLs starting with
base_url/
source/
nettype/
.
It is this value that will be referred to as service_url
in what follows.
As an example, if the RAS source is used for a binary network, then
the service_url would be something like
http://collec.openlib.org/ras/binary
. How several top
levels for several sources and nettypes are combined is out of scope
for icanis.
All HTML pages refer to a CSS file at the href
/
source/
nettype/icanis.css
.
Each page, by virtue of its location, has a source and a nettype associated with it. In addition, each page has a “generation type” and a “page type”.
There are three generation types in icanis.
In order to keep the system configurable to a maximum, the generation of updated and on-the-fly pages involves the creation, by icanis, of an XML data structure that we call here a nucleus. The nucleus is then transformed with an XSLT file. This specification addresses the structure of the nucleus, as well as the location of the XSLT file. The XSLT file itself is not part of icanis. Each icanis installation will have to produce its own XSLT files, for each source and nettype maintained.
The own_node nucleus is a contruct that looks at a row in the nodes table. The element name may be different depending on the usage. If the value of a column is defined and true in the sense of Perl, the own_node element obtains a child with the name of the column and the value of the column for the node. The own_node element may carry attributes as specified.
Node pages are updated pages. The page type “node” lives
at
service_url/node/
node_file.html
,
where node_file is the application of
handle_to_file()
for the handle of the node.
The node pages are generated by the script
make_node_pages
. It requires the parameters
source and nettype.
The corresponding
files live in a directory nodes
of the web directory.
Let node_page stand for
handle_to_nodepage(
node,
source,
nettype)
for a certain node node. Then, for each handle, the
following nucleus XML is created by the script
<node> <own_data> <own_node_nucleus of node> </own_data> <neighbours> <!-- for each neighbor, sorted by ascending edge length --> <neighbour> <own_node_nucleus of neighbour> <edge_length>value</edge_length> <neighbour> </neighbours> </node>
The nucleus for each node is transformed with an XSLT styles sheet.
The XSLT styles sheet lives at
home/xsl/
source/
nettype/node_pages.xsl
.
Criteria pages illustrate the position of nodes with respect to a
criterion crit_name. They are generated with the
make_criterion_pages
script. The first and second are
source and nettype. The third parameter is the
crit_name which can take the values
betweenness
and closeness
. The fourth
parameter is the page size, in terms of number of nodes to be shown
on a page. The resulting URLs live
url/rank/
crit_name_
number.html
.
where number is an integer number starting at 1 that is
incremented when a new file is started.
is the start number, and end is the
end number of the node shown.
In addition, the script creates two symbolic links. One is
first.html
pointing to the first criterion page. The
other is last.html
, pointing to the last criterion page.
The criterion pages live at URLs starting with
service_url/rank
crit_name/
. The
pages are stored in
home/html/
source/
nettype/rank/
crit_name/
.
The script builds a nucleus
<criterion name="crit_name"> <nodes previous="previous_file_number" next="next_file_number"> <!-- for each node from start to end --> <node> own_node data </node> <nodes> <!-- if there is a next page --> </criterion>
The previous
attribute is left out for the
first file. The next
attribute is left out
for the last file.
The transforming XSLT lives at
home/xsl/
source_
nettype_
crit_name.xsl
.
Each criterion has a documentation page. This is a static page. It
lives at
url/
crit_name.html
Theses pages explain the criteria.
There are basically two types of searches. First, there are node searches. A user comes to discover a node. Second there are path searches. A user knows a start node, knows or may not know an end node, and performs a search for the paths leading to that node.
Search results embed themselves in the pages that call them.
There is one single search script. It runs plain FCGI. The page
lives at service_url/bin/search.fcgi
.
If search called is without an argument, it builds a nucleus
<page/>
which is transformed with
home/xsl/
source_
nettype_search.xsl
.
This builds a full search page, with two search boxes for the start
and the end node.
If search is called with arguments h1=handle_1 and h2=handle_2 the nucleus formed is
<page> <paths> <!-- for each path in the shortpath set --> <path> <!-- for each node in the shortpath --> <node> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> </node> </path> <paths> <page>
which is transformed with
home/xsl/
source_
nettype_search_result.xsl
.
This builds a complete result pages showing all paths from node
handle_1 to handle_2.
If search is called with arguments q1=exp_1 and q2=exp_2 queries for node name or id matching exp_1 and exp_2, respectively, are made.
If two unique answers are found, one for each query, the nucleus is identical to the search results case. The search result is shown.
If no results for neither exp_1 nor exp_1 are found, the nucleus is
<page> <result q1="query string"/> <result q2="query string"/> </page>
which is transformed with
home/xsl/
source_
nettype_search_page.xsl
.
If no results for exp_2 but one unique result for exp_1 are found, the script forms the nucleus
<page> <result q1="query string"> <node> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> </node> </result> <result q2="query string"/> </page>
It is transformed with
home/xsl/
source_
nettype_search_half_result.xsl
.
If no results for exp_1 but multiple results for exp_2 are found, the script forms the nucleus
<page> <result q1="query string"/> <result q2="query string"> <nodes> <!-- for each found result --> <node> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> </node> </nodes> </result> </page>
It is transformed with
home/xsl/
source_
nettype_search_half_result.xsl
.
The search script may be called with an argument h1 replacing q1 and h2 replacing q2, respectively. The first case is more common. It occurs when search is called from a node page, where the user is looking for paths to a destination node from the node described in the node page. In the nucleus, h1 replaces q1. For example
<page> <result h1="handle"> <node> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> </node> </result> <result q2="query string"> <nodes> <!-- for each found result --> <node> <name>name</name> <url>url</url> <nodepage>node_page</nodepage> <homepage>homepage</homepage> <handle>handle</handle> </node> </nodes> </result> </page>
It is transformed with
home/xsl/
source_
nettype_search_half_result.xsl
.
This is a static page. It lives at
service_url/index.html
. It links to the start
of criteria pages, and the search page.
This section is for information only. It is not part of the specs. Let domain be the domain of the service. The apache configuration then is something like
NameVirtualHost
domain
<VirtualHost
domain>
ServerAdmin webmaster@localhost
DocumentRoot
home/html
<Directory />
Options
Indexes FollowSymLinks
AllowOverride
None
</Directory>
Directory
home/html>
Options Indexes FollowSymLinks
AllowOverride None
Order allow,deny
allow from all
#
This directive is useful when we have
only one source and one nettype
RedirectMatch
^/$
source/
nettype/
</Directory>
Alias /icons
home/html/icons
<Directory
home/html/icons>
Options Indexes FollowSymLinks
AllowOverride None
Order allow,deny
allow from all
</Directory>
#
for every nettype and source
ScriptAlias
source/
nettype/bin
home/perl
<Directory
"
home/perl">
AllowOverride None
Options ExecCGI -MultiViews
+SymLinksIfOwnerMatch
Order
allow,deny
Allow from all
</Directory>
LogLevel
warn
ErrorLog
/var/log/apache2/
domain_error.log
CustomLog
/var/log/apache2/
domain_access.log
combined
ServerSignature On
</VirtualHost>
A running implementation with data from the RePEc Author Service is part of the project. This implementation is code-named “CollEc”, but this name may be changing later. This source is labelled as “ras”.
The nettype in the implementation is labeled “binary”. It deals with binary relationships between nodes.
The web site will live at http://collec.openlib.org. This redirects to http://collec.openlib.org/ras/binary because this is what we will implement.
To create the database, login to mysql as root. Select a password password, then do
create database icanis;
use icanis;
grant all on icanis to 'icanis'@localhost identified by 'password' ;
Next, we have to set up the access for the user.
In the home directory of icanis, create the
file .my.cnf
, with contents
[client]
user = icanis
password = password
use icanis;
CREATE TABLE `source_nettype_nodes` (
`handle` VARCHAR( 10 ) CHARACTER SET ascii COLLATE ascii_bin NULL ,
`name` VARCHAR( 100 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL ,
`homepage` VARCHAR( 1024 ) CHARACTER SET ascii COLLATE ascii_bin NULL ,
`node_tist` INT UNSIGNED NULL ,
`path_tist` INT UNSIGNED NULL ,
`closeness` FLOAT UNSIGNED NULL ,
`closeness_rank` FLOAT UNSIGNED NULL ,
`path_summary_unique` VARCHAR (1000) NULL,
`betweenness` FLOAT UNSIGNED NULL ,
`betweenness_rank` INT UNSIGNED NULL ,
PRIMARY KEY ( `handle` )
);
To set up the web user, do
GRANT SELECT on icanis.* to 'web_icanis'@localhost IDENTIFIED BY '
web_password