Publications‎ > ‎

Automatic hypertext linking of textual documents

Kodeks Company

max@kodeks.ru

Many modern text information systems support hypertext links in documents. Manual markup of the hyperlinks is very laborious process. The article describes development, adjustment and evaluation of an automatic hyperlink markup system. The system is aimed to find out cross-references in texts of legal and business documents.

The system uses relatively simple algorithm of document processing. First text of document is scanned for patterns, supposed to be cross-references. The patterns are created using semi-automatic process and stored in a system initialization file. The system has different pattern sets for different collections of documents, i.e. for Russian legislation documents there are about 150 patterns. When a pattern is found in a text the attributes of a referenced document are extracted and the system performs document search using this attributes. If there are a number of documents found or there are a number of overlapped patterns in a text special heuristic rules are used to solve this ambiguity.

The quality of the markup produced by the system was manually evaluated. About 200 documents with more then 7000 hyperlinks were tested. The system marked about 90% of cross-references in the documents.

The article proposes future directions of the system development