When we wrote the first edition of this book in 1998, the Web was relatively new, and information retrieval was an old
field but it lacked popular appeal. Today the word Google has joined the popular lexicon, and Google indexes more than four
billion Web pages. In 1998, only a few schools taught graduate courses in information retrieval; today, the subject is
commonly offered at the undergraduate level. Our experience with teaching information retrieval at the undergraduate level,
as well as a detailed analysis of the topics covered and the effectiveness of the class, are given in [Goharian et al.,
2004]. .
The term Information Retrieval refers to a search that may cover any form of information: structured data, text, video,
image, sound, musical scores, DNA sequences, etc. The reality is that for many years, database systems existed to search
structured data, and information retrieval meant the search of documents. The authors come originally from the world of
structured search, but for much of the last ten years, we have worked in the area of document retrieval. To us, the world
should be data type agnostic. There is no need for a special delineation between structured and unstructured data. In 1998,
we included a chapter on data integration, and reviews suggested the only reason it was there was because it covered some of
our recent research. Today, such an allegation makes no sense, since information mediators have been developed which operate
with both structured and unstructured data. Furthermore, the eXtensible Markup Language (XML) has become prolific in both the
database and information retrieval domains.
We focus on the ad hoc information retrieval problem. Simply put, ad hoc information retrieval allows users to search for
documents that are relevant to user provided queries. It may appear that systems such as Google have solved this problem, but
effectiveness measures for Google have not been published. Typical systems still have an effectiveness (accuracy) of, at
best, forty percent [TREC, 2003]. This leaves ample room for improvement, with the prerequisite of a firm understanding of
existing approaches.
.Information retrieval textbooks on the market are relatively unfocused, and we were uncomfortable using them in our
classes. They tend to leave out details of a variety of key retrieval models. Few books detail inference networks, yet an
inference network is a core model used by a variety of systems. Additionally, many books lack much detail on efficiency,
namely, the execution speed of a query. Efficiency is potentially of limited interest to those who focus only on
effectiveness, but for the practitioner, efficiency concerns can override all others.
Additionally, for each strategy, we provide a detailed running example. When presenting strategies, it is easy to gloss
over the details, but examples keep us honest. We find that students benefit from a single example that runs through the
whole book. Furthermore, every section of this book that describes a core retrieval strategy was reviewed by either the
inventor of the strategy (and we thank them profusely; more thanks are in the acknowledgments!) or someone intimately
familiar with it. Hence, to our knowledge, this book contains some of the gory details of some strategies that cannot be
found anywhere else in print.
Our goal is to provide a book that is sharply focused on ad hoc information retrieval. To do this, we developed a
taxonomy of the field based on a model that a strategy compares a document to a query and a utility can be plugged into any
strategy to improve the performance of the given strategy. We cover all of the basic strategies, not just a couple of them,
and a variety of utilities. We provide sufficient detail so that a student or practitioner who reads our book can implement
any particular strategy or utility. The book, Managing Gigabytes [Witten et al., 1999], does an excellent job of describing a
variety of detailed inverted index compression strategies. We include the most recently developed and the most efficient of
these, but we certainly recommend Managing Gigabytes as an excellent side reference. ..
So what is new in this second edition? Much of the core retrieval strategies remain unchanged. Since 1998, numerous papers were written about the use of language models for information retrieval. We have added a new section on language models. Furthermore, cross-lingual information retrieval, that is, the posting of a query in one language and finding documents in another language, was just in its infancy at the time of the first version. We have added an entire chapter on the topic that incorporates information from over 100 recent references.
Naturally, we have included some discussion on current topics such as XML, peer-to-peer information retrieval, duplicate
document detection, parallel document clustering, fusion of disparate retrieval strategies, and information mediators.
Finally, we fixed a number of bugs found by our alert undergraduate and graduate students. We thank them all for their
efforts.
This book is intended primarily as a textbook for an undergraduate or graduate level course in Information Retrieval. It
has been used in a graduate course, and we incorporated student feedback when we developed a set of overhead transparencies
that can be used when teaching with our text. The presentation is available at www.ir.iit.edu.
Additionally, practitioners who build information retrieval systems or applications that use information retrieval
systems will find this book useful when selecting retrieval strategies and utilities to deploy for production use. We
have heard from several practitioners that the first edition was helpful, and we incorporated their comments and suggested
additions into this edition.
We emphasize that the focus of the book is on algorithms, not on commercial products, but, to our knowledge, the basic
strategies used by the majority of commercial products are described in the book. We believe practitioners may find that a
commercial product is using a given strategy and can then use this book as a reference to learn what is known about the
techniques used by the product.
Finally, we note that the information retrieval field changes daily. For the most up to date coverage of the field, the
best sources include journals like the A CM Transactions on Information Systems, the Journal of the American Society for
Information Science and Technology, Information Processing and Management, and Information Retrieval. Other relevant papers
are found in the various information retrieval conferences such as ACM SIGIR www.sigir.org, NIST TREC trec.nist.gov, and the
ACM CIKM www.cikm.org. ...