String Similarity Algorithms and Record Linkage

String Similarity Algorithms and Record Linkage

I've been diving into data cleaning, and exploring the connection between clean textual data in DataFrames and the effective use of string-based algorithms for this task.

There are many string similarity algorithms in existence today, such as Hamming, Levenshtein, Jaro, Cosine, etc. However, I am limiting my scope to the Demarau-Levenshtein Algorithm and the Cosine Algorithm.

At a high level, the D-L algorithm measures the distance between two strings by considering the minimum number of insertions, deletions, transpositions, and substitutions required to convert one string into another.
Take for example, two strings bellow and below . Using the D-L algorithm, the distance would be 1 because it would require the deletion of the first or second l in bellow to transform it into below . One drawback of this algorithm is that it relies on ordered strings and lacks semantic understanding. Consequently, Ted Lasso and Lasso Ted might be deemed dissimilar by the algorithm.

The Cosine algorithm works with string vectors, measuring similarity by comparing their directions. If two string vectors point in the same direction, it returns 1; otherwise, it returns a value less than 1. Unlike the D-L algorithm, this approach is beneficial as it understands semantics. Consequently, Ted Lasso and Lasso Ted would likely be deemed similar by this algorithm.

What is Record Linkage?

If you are familiar with the good old joins, you would recognise the unique key constraint between two different datasets that you would like to join. When joining datasets, it is essential that they have similar columns on which to join.

However, this constraint is what differentiates record linkage from joins. Compared to joins, record linkage is more flexible because it doesn't require that the datasets have similar columns before merging. Instead, it utilises the power of string similarity algorithms to identify related rows in datasets and ensures that the datasets are merged without duplicating similar rows.

How does Record Linkage work?

I will be using sample datasets called datab and datac and the recordlinkage python package to explain key concepts in record linkage. It is worth mentioning that different record linkage packages exist and I discovered one built by the data team in the Gov.UK team called Splink.

Record linkage is carried out in three steps: Indexing, Comparison, Filtering, and Linking.

  1. Indexing

    At a high level, data indexing organises table rows to enhance search efficiency. In computer science, unsorted data leads to inefficient searches, requiring scanning the entire dataset, resulting in slower performance as data size increases. Sorted data allows the application of efficient search algorithms with logarithmic running time, making searches faster—a delight for computer scientists.

    The datasets, datab and datac, share similar data, but merging them requires a unique identifier column to avoid duplication. Lacking such a column could treat Ted Lasso and Lasso Ted as distinct entities, despite referring to the same individual.

    To address this, we enhance efficiency by using indexing. This involves creating a 'block' column, like Date of Birth, to uniquely identify rows in both datasets. Comparing rows within the same block streamlines the process, ensuring that only related rows are compared, reducing redundancy, and speeding up the operation.

    The coloured lines attempt to show rows in the same block that should be compared with each other.

    Using the recordlinkage package, indexing would be carried out like this:

     import recordlinkage
     import pandas as pd
     datab = {'Full Name': ['Ted Lasso', 'Sam Obisanya', 'Roy Kent', 'Rebecca Welton', 'Keeley Jones'], 'Date of Birth': ['09-10-1984', '09-10-1996', '09-10-1992', '09-10-1984', '09-10-1992']}
     datac = {'Full Name': ['Lasso Ted', 'Obisanya Sam', 'Kent Roy', 'Welton Rebecca', 'Jones Keeley'], 'Date of Birth': ['09-10-1984', '09-10-1996', '09-10-1992', '09-10-1984', '09-10-1992']}
     datab = pd.DataFrame(datab)
     datac = pd.DataFrame(datac)
     # create column block for efficient search
     indexer = recordlinkage.Index()
     index_block = indexer.block('Date of Birth')
     # create pairs for comparison based on the block
     index_pairs = indexer.index(datab, datac)

    The pairs object returned contains pairs of row indices that are in the same block for comparison. The output looks like this:

     MultiIndex([(0, 0),
                 (0, 3),
                 (3, 0),
                 (3, 3),
                 (1, 1),
                 (2, 2),
                 (2, 4),
                 (4, 2),
                 (4, 4)],
  2. Comparison

    As the name suggests, this process involves using comparison functions to compare each individual pair in the same block to find potential matches.

    Using the recordlinkage package, this process would look like:

     # create a compare object
      compare = recordlinkage.Compare()
      # carry out comparison based on your usecase
      # in this case, find exact matches for date of birth
      # the method takes in the name of the date of birth columns in datab and datac and the label of the new dataframe
      compare.exact('Date of Birth', 'Date of Birth', label='Date of Birth')
      # find similar matches for full name column based on string similarity
      # any string similarity algorithm can be used, in this case I will use cosine
      compare.string('Full Name', 'Full Name', method='cosine', label='Full Name')
      # find matches
      matches = compare.compute(index_pairs, datab, datac)

    The output is a MultiIndex object where the first index corresponds to rows in datab, and the second index corresponds to rows in datac that are in the same block as the rows in datab. The columns, generated from comparison functions, contain values reflecting the similarity of the string pairs being compared. The output appears as follows:

     Date of Birth  Full Name
     0 0              1    1.00000
       3              1    0.00000
     3 0              1    0.00000
       3              1    1.00000
     1 1              1    1.00000
     2 2              1    1.00000
       4              1    0.27735
     4 2              1    0.27735
       4              1    1.00000
  3. Filtering
    The MultiIndex object generated post-comparison is filtered based on the use case. In this scenario, both the Date of Birth and Full Name columns should exhibit similarity. Therefore, summing these columns across rows should yield a value of 2. Any row not summing up to 2 is then filtered out.

     matches = matches[matches.sum(axis=1) == 2]
  4. Linking
    After identifying similar rows in datab and datac, they can be removed from either dataset before the final merge.

     # get indices from datac only
     dup_rows = matches.index.get_level_values(1)
     # remove duplicate rows from datab through subsetting
     datac = datac[~datac.index.isin(dup_rows)]
     # append with datab
     final_data = datab.concat(datac)

    The output appears as follows:

     Full Name Date of Birth
     0       Ted Lasso    09-10-1984
     1    Sam Obisanya    09-10-1996
     2        Roy Kent    09-10-1992
     3  Rebecca Welton    09-10-1984
     4    Keeley Jones    09-10-1992

    This overview clarifies how string similarity algorithms play a crucial role in improving data cleanliness and analysis through effective record linkage. By linking related information, these algorithms enhance accuracy and provide a more comprehensive view of the data. Embracing these techniques refines datasets, empowering more robust analyses and deepening our understanding of underlying data patterns and relationships.🚀

Did you find this article valuable?

Support Rita Okonkwo by becoming a sponsor. Any amount is appreciated!