Tutoriais - SBBD

Title: Similarity Search and Indexing for High-Dimensional Data

Eduardo Valle e Matthieu Cord
(mail@eduardovalle.com e matthieu.cord@lip6.fr)

Abstract: Searching by similarity is a critical operation on many systems, and thus has attracted the attention of many disciplines in Computer Sciences, including Computational Geometry, Machine Learning, Multimedia Retrieval and, of course, Databases. To perform efficiently, similarity search requires the support of indexing, which suffers from the infamous "curse of the dimensionality". In this tutorial we will introduce the challenges of indexing and searching high-dimensional data, and present the most recent tools available to "tame the curse". The tutorial takes 3 hours and is divided in three main sections: an introduction, where we define the problem, its applications and its challenges; a presentation of the state of the art, where we present the most important/interesting solutions to the problem; and a study of case of the application of high-dimensional indexing to problems in Content Based Information Retrieval.

The second section (study of the state of the art) is the core of the tutorial. We will show how the (exceedingly prolific) literature in high-dimensional indexing can be tackled by dividing it in classes of solutions: a. Solutions based on space-partitioning: those algorithms are based on the idea of partitioning the space into mutually-exclusive hyper-rectangles; b. Solutions based on data partitioning and clustering: those algorithms partition the data into mutually exclusive sets, which may overlap in the space; c. Solutions based on data approximation and quantization: those techniques quantize the data, reducing the size of the files which have to be sequentially analyzed. They may also reorganize the data to reduce the amount of sequential read necessary; d. Solutions based on metric spaces: those techniques use the triangle inequality to select the portions of the data to analyze.

The try to minimize the number of distances computed, and are useful when the distance function is expensive or when the data do not belong to a vector space; e. Solutions based on decomposition into subspaces: those techniques use multiple parallel indexes of smaller dimensionality than the original space, and then aggregate the results. For each class, we show the archetypical, easy to understand method, and use it to explain the motivation behind that class of solutions and the challenges they face. We then proceed to explore the state of the art, showing how different methods address those challenges. We end this section by considering some common trends among the recently proposed solutions. At the end, the audience will have a good grasp of the current state of the art, the most promising research trends and the challenges still faced by the technology.

Authores: Eduardo Valle is currently a post-doctoral researcher at the Computing Institute of the State University of Campinas (UNICAMP). He has obtained his Ph.D. on Computer Sciences at the University of Cergy-Pontoise, France in 2008, having previously obtained his M.Sc. on Computer Sciences at the Universidade Federal de Minas Gerais (UFMG), Brazil in 2003. He is interested on high-dimensional indexing, multimedia databases, content-based information retrieval (CBIR), classification and learning. He has a soft spot for the applications of Information Technology to the domain of Cultural Heritage. Matthieu Cord is a full professor in the Machine Learning and and Information Retrieval team of the LIP6 labs of the University of Paris 6. He has obtained his Ph.D. degree in Image Processing in 1998 at the University of Cergy-Pontoise. In 1999, he joined the Luc Van Gool's Computer Vision team of ESAT lab, Belgium as a postdoctoral resercher and then joined the ETIS labs in France to create the Image Indexing research group. He obtained a full professor position at the LIP6 labs in 2006.


XXIV SBBD XXIII SBES - 05 a 09 de Outubro de 2009