Publications‎ > ‎

Digital library of multiversion documents

Digital library of multiversion documents

Kodeks Company

Processing versions of text documents is a critical task for many applications, such as document workflow or legal information systems. In this article we analyze two problems of document versioning: versions storage and fulltext indexing.

Many different storing schemas for multiversion data are proposed. However, they are rarely used in text storage systems. Our experiments show that these complex techniques make storage and retrieval of big documents too slow. We have achieved better results using brute force method of storing versions of a document as separate objects compressed by conventional LZSS algorithm.

The main problem of indexing of multiversion documents is that small changes in a text can cause dramatics changes in fulltext index structures. Imagine that a word is inserted in the beginning of a long document that contains many words. In that case, all lists of positions of words from the document must be changed. Thus, insertion of one word, cause changes in many pages of the index structure. We use a special method of indexing of word’s positions in a text relatively to special points in text or "landmarks". In our example, the word insertion affects only the positions that are coded relatively to the first landmark. The selection of the landmarks must ensure minimal changes of a word’s positions index caused by typical text changes. We propose a new algorithm of the landmark generation that uses a special hash function from words of sliding text window. Our experiments show that the proposed method reduces the index changes significantly.

The disadvantage of the proposed method is an additional structure – landmark directory that maps landmarks to positions of words in a version. Access to the directory reduces productivity of index searches. To improve the situation we propose two methods that make possible to process typical queries without usage of a landmark directory. First is a special word position numbering scheme, second is additional flags of the last version in post list. These methods only slightly increase the index size and update time, but noticeably decrease a query processing time.

We plan to use the proposed algorithms in commercial multiversion document storage system