Back to department main page
COURSE PRESENTATION FORM - SIMILARITY SEARCH - 2009/2010


COURSE NAME: Similarity Search

COURSE CODE:  70227 (BSc / BSc Old / MSc 509)

LECTURER: Nikolaus Augsten

TEACHING ASSISTANT: None

TEACHING LANGUAGE: English

CREDIT POINTS: 4

LECTURE HOURS: 24

EXERCISE HOURS: 12

TIMESPAN: 28.09.2009 - 23.01.2010

TIMETABLE: see Timetable Page

OFFICE HOURS LECTURER: Wednesday 8:45–10:15 (Oct 7 – Dec 23), office A4.17

OFFICE HOURS TEACHING ASSISTANT:  --


PREREQUISITES
Students should be familiar with basic concepts in databases (including relational databases, SQL, and relational algebra) and algorithms, as well as having good programming skills in Java. Further basic knowledge about XML is required. This material is taught in the following courses:
  • Introduction to Programming
  • Introduction to Databases
  • Data Structures and Algorithms
  • (XML and Semistructured Databases)

OBJECTIVES
This course will discuss similarity search techniques for flat strings and hierarchical data (for example, XML).
Selected methods will be presented, their effectiveness and efficiency will be discussed.
Filtering techniques to improve the efficiency will be introduced.
The students will implement similarity joins in a relational database management system.

SYLLABUS
  • Introduction: Approximate Matching
  • Similarity Joins and Stable Matching Algorithms
  • String Edit Distance: Algorithm and Complexity
  • q-Gram Distance for Strings: Algorithm and Complexity
  • Implementation of q-Grams in a Relational Database
  • q-Grams as a Filter for the Edit Distance
  • Hierarchical Data and XML in Relational Database Systems
  • Tree Edit Distance: Algorithms and Complexity
  • Tree Edit Distance Lower Bound: The Binary Branch Distance
  • Similarity Joins with Reference Sets
  • pq-Gram Distance for Ordered Trees: Algorithm and Complexity
  • Similarity of Unordered Trees (Data-Centric XML)

TEACHING FORMAT
Frontal lectures
Mini-project done in groups of 2-3 students (workload: 40 hours)

ASSESSMENT
Project and oral exam.

Project. The group defends the project by
  • handing in the resulting code;
  • handing in the project report;
  • giving a presentation of the project. The presentation is followed by a discussion.

Oral exam. There will be no midterms. The oral exam at the end will cover all topics treated in the course. Each student will be examined individually for 15 minutes.

Final grade. Both parts (project and oral exam) are graded with up to 30 points. Each part has to be passed (at least 18 points). The student has to pass the project before taking the oral exam.  The final grade weights the project with 40% and the oral exam with 60% (final_grade = grade_project * 0.4 + grade_oral_exam * 0.6).

The positive grade for the project counts for all 3 regular exam sessions. After that, the student has to defend a new project.

READING LIST
There is no single textbook that covers the entire course. The course material is collected from the following book chapters and research papers:
  • Alberto Apostolico and Zvi Galill, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997.
  • Nikolaus Augsten, Michael Böhlen, and Johann Gamper. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 301–312, Trondheim, Norway, September 2005. Morgan Kaufmann Publishers Inc.
  • Nikolaus Augsten, Michael Böhlen, Curtis Dyreson, and Johann Gamper. Approximate Joins for Data-Centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), Cancun, Mexico, April 2008.
  • Minos Garofalakis and Amit Kumar. XML stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279–332, 2005.
  • Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proc. of VLDB, pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc.
  • Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. Approximate XML joins. In Proc. of SIGMOD, pages 287–298,Madison, Wisconsin, 2002. ACM Press.
  • Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.
  • Esko Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, January 1992.
  • Rui Yang, Panos Kalnis, and Anthony K. H. Tung. Similarity evaluation on tree-structured data. In Proc. of SIGMOD, pages 754–765, Baltimore, Maryland, USA, June 2005. ACM Press.
  • Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989.

SOFTWARE USED
Java, MySQL

LEARNING OUTCOME
Students completing this course should be able to:
  • understand the need for similarity search techniques in real-world applications;
  • use the similarity search techniques in a relational database management system;
  • explain the algorithms with an example;
  • compare different approaches in terms of efficiency and effectiveness;
  • understand the application of filters for improving the efficiency.


COURSE PAGE
http://www.inf.unibz.it/dis/teaching/SS/




© UniBz