Similarity thesauri

This code was used to produce examples for my talk Similarity thesauri and cross-language retrieval. More information about this talk is on my studies page; a handout is also available.

The example code was written in Python; it requires Python version 2.2 or higher.

IRstruct.py

This file contains the following classes:

Token
This class provides the link between items and features.
Tokenizable
This is the super class for all objects that can be decomposed into tokens.
TokString
The simplest kind of document, consisting of just a string
Properties
This class is derived from UserDict and implements the subsumption order on feature structures as operators <= and >=.
Document
Documents can have additional properties, for example their language. Furthermore, they can be composed of other documents.
IndexComp
This is the common super class of Item and Feature. The constructor takes a Properties object as argument and either returns a previously constructed object with the same properties, or constructs a new one.
Item
This class was derived from IndexComp without any changes.
Feature
This class was derived from IndexComp without any changes.
IRstruct
This class provides the basic functions for IR systems, for example weighting methods and storage of items and features.
SimThes
This class implements the construction of a similarity thesaurus as described in the handout.
SimThes_CL
A class implementing a cross-language similarity thesaurus. This changes only the output functions.

Most classes contain methods .asTeX and .asMP that produce TeX and MetaPost snippets describing the object.

docs.py

This file contains the documents used to construct the examples in the handout.

irtest.py

This file constructs two similarity thesaury from the documents in docs.py and writes the corresponding TeX and Metapost snippets to files.


Copyright © 1999--2004 Sebastian Marius Kirsch webmaster@sebastian-kirsch.org , all rights reserved.
Id: index.wml,v 1.3 2004/05/26 10:05:29 skirsch Exp