Accuracy of Approximate String Joins Using Grams

O. Hassanzadeh, Mohammad Sadoghi, and R. Miller.

In Proceedings of the 5th International Workshop on Quality in Databases (QDB'07) - VLDB Workshop, 2007.

Abstract

Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matching of string attributes. The accuracy of the similarity measures highly depends on the characteristics of the data such as amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of the strings, mainly due to their high efficiency. In this work, we evaluate the accuracy of the similarity measures used in these methodologies. We present an overview of several similarity measures based on q-grams. We then thoroughly compare their accuracy on several datasets with different characteristics. Since the efficiency of approximate joins depend on the similarity threshold they use, we study how the value of the threshold (including values used in recent performance studies) effects the accuracy of the join. We also compare different measures based on the highest accuracy they can gain on different datasets.

Download



Readers who enjoyed the above work, may also like the following:


  • Optimizing Key-Value Stores for Hybrid Storage Architectures.
    Prashanth Menon, Tilmann Rabl, Mohammad Sadoghi, and Hans-Arno Jacobsen.
    In Proceedings of CASCON, 2014.
    Tags: key-value stores, leveldb
  • Adaptive Parallel Compressed Event Matching.
    Mohammad Sadoghi and Hans-Arno Jacobsen.
    In 30th IEEE International Conference on Data Engineering, 2014.
    Tags: content-based matching, publish/subscribe, event processing
  • CaSSanDra: An SSD Boosted Key-Value Store.
    Prashanth Menon, Tilmann Rabl, Mohammad Sadoghi, and Hans-Arno Jacobsen.
    In 30th IEEE International Conference on Data Engineering, pages 1162-1167, 2014.
    Tags: cassandra, big data, key-value store, nosql