Search and Mining in Web Archives

Klaus Berberich

Search and Mining in Web Archives

The World Wide Web evolves constantly, and every day contents are added and removed. Some of the new contents are published first and exclusively on the Web and reflect current events. In recent years, there has been a growing awareness that such born-digital contents are part of our cultural heritage and therefore worth preserving. National libraries and organizations such as the Internet Archive [http://www.archive.org] have taken over this task. Other contents were published a long time ago and are now, thanks to improved digitization techniques, for the fi rst time available to a wide public via the Web. Consider, as one concrete example, the archive of the British newspaper The Times that contains articles published from as early as 1785.

Our current research focuses on scalable search and mining techniques for such web archives. Improved search techniques, on the one hand, make it easier for users to access web archives. Mining techniques, on the other hand, help to gain insights about the evolution of language or popular topics. In the following, we describe three aspects of our current work.

Time travel in web archives

Existing search techniques ignore the time dimension inherent to web archives. For instance, it is not possible to restrict a search, so that only documents are retrieved that existed at a specified time in the past.

In our work, we consider time-travel keyword queries that combine a keyword query (e.g., “bundestag election projection”) with a temporal context such as September 2009. For this specific query, only relevant documents that discuss election projections and which existed back in September 2009 should be retrieved as results.

Our approach builds on an inverted index that keeps a list of occurrences for every word. Depending on the type of query that has to be supported, the inverted index remembers an identifier, how often the word occurs, or the exact positions at which the word can be found in the document for every document in which a word occurs. We extend this information by a valid-time interval to also keep track of when a word was contained in a document and thus to enable time-travel queries.

Consecutive versions of the same document tend to differ only slightly. We exploit this observation to reduce index size drastically. To process time-travel queries more efficiently, we keep multiple lists for every word in the inverted index, each of which is responsible for a specific time interval. This introduces redundancy to the index, increases its size, and thus leads to a trade-off between index size and query-processing performance. Our approach casts this trade-off into optimization problems that can be solved efficiently and determine the lists to be kept for every word in the inverted index.

Temporal information needs

Information needs often have a temporal dimension, as expressed by a temporal phrase contained in the user’s query and are best satisfi ed by documents that refer to a particular time. Existing retrieval models fail for such temporal information needs. For the query “german painters 15th century”, a document with detailed information about the life and work of Albrecht Durer (e.g., mentioning 1471 as his year of birth) would not necessarily be considered relevant. This is because existing methods are unaware of the semantics inherent to temporal expressions and thus do not know that 1471 refers to a year in the 15th century.

To capture their semantics, we formally represent temporal expressions as time intervals. We then integrate them into a retrieval approach based on statistical language models that has been shown to improve result quality for temporal information needs.

Mining of characteristic phrases

Mining of web archives is another aspect of our current work. More precisely, we are interested in insights about ad-hoc subsets of the web archive, for instance, all documents that deal with Olympics. Given such an ad-hoc subset, we can identify persons, locations, or in general, phrases that are characteristic for documents published in a particular year. In our Olympics example, these could include Michael Phelps, Beijing and “bird‘s nest” for documents published in 2008.

To identify such characteristic phrases efficiently, one needs frequency statistics for so-called n-grams (i.e., sequences of one or more words). We develop efficient and scalable techniques to compute these n-grams statistics in a distributed environment. One design objective here is to allow for easy scaleout in order to keep up with the growth of web archives in the future.

Klaus Berberich

DEPT. 5 Databases and Information Systems
Phone +49 681 9325-5005
Email kberberi@mpi-inf.mpg.de