Metz paper

version of 2006-09-04

Introduction

This is the Metz paper. It's first draft by Thomas Krichel on was written 2006-09-04.

Metz is a project to build a software system that can maintain and visualize social network data while the underlying network data is changing. There is no precedence for such a service.

A running implementation with data from the RePEc Author Service is part of the project. This implementation is codenamed "CollEc", but this name may be changing later. The web site will live at http://collec.openlib.org.

General limitations

Metz will be limited to symmetric networks. But they may not need to be binary.

Any network type has a type identifier henceforth refered to a n_id.

At this time, I am not aware, this time of a software that can calculate all paths for non-binary network

Metz will be limited to symmetric networks. But they may not need to be binary. This implies that paths between from any node a to a node b are the same as the paths from node b to a node a.

All software is written in Perl.

Input formats

The network input format describes a symmetric matrix. For row, say, there is an identifier. Idenitfiers are not supposed to be strings that contain ascii characters only. They may not contain neither whitespace (i.e. blank, tab, carriage return and line feed). Neither may they contain control characters 0x00 to 0x1F nor Ox7F (in hex).

Since the network input matrix is symmetric, only the data for half the matrix has be be maintained in the input data. The rest can be induced.

Network input comes as lines. Each line is of the form
idTAB[numberTABidTAB]+"\n"
where id is an id of a node, as outlined previously, and numberis a number, and TAB is the tabulation character. There must be an odd number of ids and an even number of numbers in the line. The line starts and ends with an id. All ids in a line the follow the first id must have a higher lexographical position than the first id. If they don't they are rejected and an error is signalled. "\n" is the newline or carriage return and newline.

Example

ab 34 be 32 bi 9 me

is a correct line.

ab 34 bi 3.2 be 9 me

is also correct,

ab 34 bi 3.2 be

is correct but

bi 34 be 32 ab 9 me

because one of the following identifiers has a lower lexigraphical position than the first one.

All Metz data is contained in a Metz directory, in files that end with ".metz". There is one file per network. The name of the file is the identifier of the network.

All network are assumed to contain only one single component.

A Metz installation ought to be able to have several networks, each having its own Metz network input directory. However in the implementation software at this time, only one network and its binary form will be considered.

The binary form of a network will not be separately described since it can be deducted from the non-binary network. It is assumed, in the following that the network is not binary and therefore a binary equivalent exists.

Shortest path output form

The system aims to compute all shortest paths between any couple of nodes. We call the shortpath set from node A to node B the set of all shortest paths from A to B. Symmetry implies that the shortpath set from A to B is the same as the shortpath set from B to A.

Shortest paths can be stored in a file that has the paths starting for an A to every B that is lexographically higher. This can be a flat file. For its stoage in a datebase, the following column structure would seem appropriate

path id
the start_id
end_id, lexically higher than the end_id
the number of intermediate nodes
the total distance between the nodes
a string that has tab separated intermediate ids in the order that they are on the path between start_id and end_id

Note that there may be many paths with the same start_id and end_id

Shortest path update calculations

In the algorithms that I am aware of, paths are calculated from a certain start to all end points that are reasable.

I assume in the following that we can not calculate in an isolated fashion, indivdual paths between an arbtrary start_id and an arbitrary end_id.

When the paths for an individual author are updated, there are a an number of intermediary paths being update. In fact, all paths between two authors that appear in a path are calculated again. This is best seen by an example. Let there be two shortest paths between two node a1 and a2
a1 a5 a4 a3 a2
a1 a5 a6 a3 a2
Here, there are intermediate paths between a3 and a5 are also contained. Thus a full update for an author will also update the sub paths, which are paths that do not belong to the author.

There are 6 types of incidents that the matrix of edge can undergo.

"a+" A new author appears.
"a-" A existing author disappears.
"e+" An author has a new edge to another.
"e-" An author looses an edge to another.
"w+" The edge of an author to another becomes shorter.
"w-" The edge of an atuhor to another becomes longer.

The software written must recognize there events when comparing two version of the edges matrix.

For "w-", and "w+", the binary version does not change. The weighted version needs recalculation.