Publications‎ > ‎

Information Retrieval in a Heterogeneous Document Collection

Modern digital libraries usually contain documents from different sources. For example, public Internet search engine search through cites, conferences, BLOGs etc., local search tools look for files, e-mail messages etc. These documents have different formatting, size and other attributes. Modern studies in the area usually evaluate quality of information retrieval over standard uniform collections.

We studied two methods of search in heterogeneous collections: "search through" when the search engine unites all documents in one set and "differential search" when the search engine performs separated searches in every set and then merge results. The article contains discussion of development of the algorithm of the results merge.

We perform a number of experiments using two heterogeneous collections:

  1. legal documents and web documents;
  2. legal documents from two different sources – federal legislation and local Saint-Petersburg legislation.

The experiments show that the second method slightly increase search quality.