I've been diving into data cleaning, and exploring the connection between clean textual data in DataFrames and the effective use of string-based algorithms for this task.
There are many string similarity algorithms in existence today, such as Hamming, Levenshtein, Jaro, Cosine, etc. However, I am limiting my scope to the Demarau-Levenshtein Algorithm and the Cosine Algorithm.
At a high level, the D-L algorithm measures the distance between two strings by considering the minimum number of insertions, deletions, transpositions, and substitutions required to convert one string into another.
Take for example, two strings bellow
and below
. Using the D-L algorithm, the distance would be 1
because it would require the deletion of the first or second l
in bellow
to transform it into below
. One drawback of this algorithm is that it relies on ordered strings and lacks semantic understanding. Consequently, Ted Lasso
and Lasso Ted
might be deemed dissimilar by the algorithm.
The Cosine algorithm works with string vectors, measuring similarity by comparing their directions. If two string vectors point in the same direction, it returns 1; otherwise, it returns a value less than 1. Unlike the D-L algorithm, this approach is beneficial as it understands semantics. Consequently, Ted Lasso
and Lasso Ted
would likely be deemed similar by this algorithm.
What is Record Linkage?
If you are familiar with the good old joins, you would recognise the unique key constraint between two different datasets that you would like to join. When joining datasets, it is essential that they have similar columns on which to join.
However, this constraint is what differentiates record linkage from joins. Compared to joins, record linkage is more flexible because it doesn't require that the datasets have similar columns before merging. Instead, it utilises the power of string similarity algorithms to identify related rows in datasets and ensures that the datasets are merged without duplicating similar rows.
How does Record Linkage work?
I will be using sample datasets called datab
and datac
and the recordlinkage python package to explain key concepts in record linkage. It is worth mentioning that different record linkage packages exist and I discovered one built by the data team in the Gov.UK team called Splink.
Record linkage is carried out in three steps: Indexing, Comparison, Filtering, and Linking.
Indexing
At a high level, data indexing organises table rows to enhance search efficiency. In computer science, unsorted data leads to inefficient searches, requiring scanning the entire dataset, resulting in slower performance as data size increases. Sorted data allows the application of efficient search algorithms with logarithmic running time, making searches faster—a delight for computer scientists.
The datasets,
datab
anddatac
, share similar data, but merging them requires a unique identifier column to avoid duplication. Lacking such a column could treatTed Lasso
andLasso Ted
as distinct entities, despite referring to the same individual.To address this, we enhance efficiency by using indexing. This involves creating a 'block' column, like
Date of Birth
, to uniquely identify rows in both datasets. Comparing rows within the same block streamlines the process, ensuring that only related rows are compared, reducing redundancy, and speeding up the operation.The coloured lines attempt to show rows in the same block that should be compared with each other.
Using the recordlinkage package, indexing would be carried out like this:
import recordlinkage import pandas as pd datab = {'Full Name': ['Ted Lasso', 'Sam Obisanya', 'Roy Kent', 'Rebecca Welton', 'Keeley Jones'], 'Date of Birth': ['09-10-1984', '09-10-1996', '09-10-1992', '09-10-1984', '09-10-1992']} datac = {'Full Name': ['Lasso Ted', 'Obisanya Sam', 'Kent Roy', 'Welton Rebecca', 'Jones Keeley'], 'Date of Birth': ['09-10-1984', '09-10-1996', '09-10-1992', '09-10-1984', '09-10-1992']} datab = pd.DataFrame(datab) datac = pd.DataFrame(datac) # create column block for efficient search indexer = recordlinkage.Index() index_block = indexer.block('Date of Birth') # create pairs for comparison based on the block index_pairs = indexer.index(datab, datac) print(index_pairs)
The pairs object returned contains pairs of row indices that are in the same block for comparison. The output looks like this:
MultiIndex([(0, 0), (0, 3), (3, 0), (3, 3), (1, 1), (2, 2), (2, 4), (4, 2), (4, 4)], )
Comparison
As the name suggests, this process involves using comparison functions to compare each individual pair in the same block to find potential matches.
Using the recordlinkage package, this process would look like:
# create a compare object compare = recordlinkage.Compare() # carry out comparison based on your usecase # in this case, find exact matches for date of birth # the method takes in the name of the date of birth columns in datab and datac and the label of the new dataframe compare.exact('Date of Birth', 'Date of Birth', label='Date of Birth') # find similar matches for full name column based on string similarity # any string similarity algorithm can be used, in this case I will use cosine compare.string('Full Name', 'Full Name', method='cosine', label='Full Name') # find matches matches = compare.compute(index_pairs, datab, datac) print(matches)
The output is a MultiIndex object where the first index corresponds to rows in
datab
, and the second index corresponds to rows indatac
that are in the same block as the rows indatab.
The columns, generated from comparison functions, contain values reflecting the similarity of the string pairs being compared. The output appears as follows:Date of Birth Full Name 0 0 1 1.00000 3 1 0.00000 3 0 1 0.00000 3 1 1.00000 1 1 1 1.00000 2 2 1 1.00000 4 1 0.27735 4 2 1 0.27735 4 1 1.00000
Filtering
The MultiIndex object generated post-comparison is filtered based on the use case. In this scenario, both theDate of Birth
andFull Name
columns should exhibit similarity. Therefore, summing these columns across rows should yield a value of2
. Any row not summing up to2
is then filtered out.matches = matches[matches.sum(axis=1) == 2]
Linking
After identifying similar rows indatab
anddatac
, they can be removed from either dataset before the final merge.# get indices from datac only dup_rows = matches.index.get_level_values(1) # remove duplicate rows from datab through subsetting datac = datac[~datac.index.isin(dup_rows)] # append with datab final_data = datab.concat(datac) print(final_data)
The output appears as follows:
Full Name Date of Birth 0 Ted Lasso 09-10-1984 1 Sam Obisanya 09-10-1996 2 Roy Kent 09-10-1992 3 Rebecca Welton 09-10-1984 4 Keeley Jones 09-10-1992
This overview clarifies how string similarity algorithms play a crucial role in improving data cleanliness and analysis through effective record linkage. By linking related information, these algorithms enhance accuracy and provide a more comprehensive view of the data. Embracing these techniques refines datasets, empowering more robust analyses and deepening our understanding of underlying data patterns and relationships.🚀