UTL_Match and Soundex for ‘Fuzzy’ Matching of Data in an Oracle Database

For other A-Team articles by Richard, click here

Introduction

In today’s data-centric world, the BI professional is often faced with the problem of joining data sets from different sources.

In a normalized relational database, you can typically just join column A to column B, and be on your way.

Unfortunately that’s always not the case.  Take the following example.  In one database we have an employee called Peter Parker. In the other database, we see 4 different rows of data for Peter Parker, Pete Parker, P.Parker, Peter Parker JR. This may well be the same person, but a traditional join will only match to 1 of the records.

This is where ‘fuzzy’ matching comes in. Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database.

Oracle has tools that can help – Enterprise Data Quality, for instance.  But it’s not always practical to bring in another tool.

Fortunately the Oracle Database comes with inbuilt functionality that may offer a solution.

This article will cover the UTL_Match command, that can help join two columns where the data is close, but doesn’t match exactly. It will also touch on the Soundex command, which can also be used in some circumstances, to join data based on a phonetic match.

 

fozzie_bear

Fuzzy – not Fozzie !!

The Commands

UTL_Match

UTL_Match provides two methods to calculate how ‘close’ different strings are to each other.  This difference – or ‘match certainty’ – can then be used as part of a join condition.  The 2 approaches:

1. “Edit Distance” is a measure of Similarity between two strings, s1 and s2. The distance is the number of insertions, deletions or substitutions required to transform s1 to s2.  For instance, the Edit Distance between strings “Shackleford” and “Shackelford” is 2.
2. The “Jaro-Winkler algorithm” is another way of calculating distance between two strings. This method, developed at the U.S. Census, is a String Comparison measure that gives values of partial agreement between two strings. The string comparison accounts for length of strings and partially accounts for typical human errors made in alphanumeric strings.

Both approaches will return a numeric value indicating the match confidence when comparing 2 strings.

More details on UTL_Match can be found here in the Oracle documentation,

Soundex

Soundex compares data items with their audiable or spoken values. It is a phonetic algorithm for indexing names by sound, as pronounced in English. For example, REIN, REIGN, and RAIN are all spelled differently but sound the same when spoken aloud.  Soundex does not return a numeric value based on matching level, instead will either return a match (or many matches), or none.

More details of the Soundex function can be found here in the Oracle documentation.

Examples of how to use both UTL_Match and Soundex will be used in the example problem below.

 

The Problem

To demonstrate the differences in the functions, a real world example will be used.  This should be similar enough to issues relevant to your workplace, so that the SQL examples, and findings, can be easily tweaked to be applicable.

Recently I was asked to look into viewing metrics from this website – A-Team Chronicles.

What subjects and articles are most popular ?

The problem – there is not a single place to get this information.  We need to use 2 different sets of data from 2 different companies.

  • Data from the hosting site, that provides the names of the articles and the authors – but very little in the way of viewing metrics
  • Data from a web-metrics site, that captures viewing metrics about the articles, but only limited information about the post itself.

The only element of data that appears in both data sets, is ‘Post Name’, but one set of data truncates this field to 60 characters and often has non-standard characters added to the name to replace punctuation.

As an example – in source 1, the post name field contained this entry:

WCF Interoperability with Java Kerberos – Tricky Problem â

And in source 2,

WCF Interoperability with Java Kerberos – Tricky Problem – Simple Solution

Doing a join, truncating source 2 to be 60 characters, still wouldn’t work in this case due to the non-standard characters.

As part of the analysis, I looked at both UTL_match algorithms and Soundex.

 

The Approach

I took the two data sets and then compared the 2 UTL_Match algorithms and Soundex, to see which offered the best results for ‘fuzzy’ joins

1. UTL_Match with ‘Edit Distance’

The syntax for this command is as follows:

utl_match.edit_distance_similarity (field1,field2)

This will return an integer value between 0 (no match) and 100 (perfect match).

To create a join condition, in the example below, we make the join where the match confidence is > 75

select
table1.page_title,
table2.page_title,
UTL_MATCH.EDIT_DISTANCE_SIMILARITY (table1.page_title,table2.page_title) as match_confidence
from table1
join table2 on
UTL_MATCH.EDIT_DISTANCE_SIMILARITY (table1.page_title,table2.page_title) > 75

 

The first few results, based on lowest Match Certainty over 75 are as shown in the table below.

You can see from the first row, that despite page title from source 1 having multiple non-standard chaacters, that the algorithm was able to match to the correct record from source 2.

You can edit the match certainty to best fit your data.  Keep in mind that the lower the match certainty value, the more chance you will have of matching to incorrect data, and also to matching to multiple rows of data.

 

Source 1 Page Title Source 2 Page Title Match Certainty
“Lift and Shift� On-Premise RPD to BI Cloud Service (BIC Lift and Shift On-Premise RPD to BI Cloud Service (BICS) 76
Using Oracle BI Publisher to Extract Data From Oracle Sales Using Oracle BI Publisher to Extract Data From Oracle Sales and ERP Clouds 80
How to Recover the RPD Repository Password in Fusion Applica How to Recover the RPD Repository Password in Fusion Applications or OBIEE 82

 

2. UTL_Match with Jaro Winkler

The syntax for this command is as follows:

utl_match.jaro_winkler (field1,field2)

This will return a decimal value between 0 (no match) and 1 (perfect match).

To create a join condition, in the example below, we make the join where the match confidence is > 0.85

select
table1.page_title,
table2.page_title,
UTL_MATCH.JARO_WINKLER (table1.page_title,table2.page_title) as match_confidence
from table1
join table2
on UTL_MATCH.JARO_WINKLER (table1.page_title,table2.page_title) > 0.85

The first few results, based on lowest Match Certainty over 0.85 are shown in the table below.

You can see the first row was again matched, despite the non-standard characters.

The next 2 rows show that different page titles from source 1, were matched to the same page title from source 2.  Both titles are similar.  The correct match had the higher ‘match certainty’.

 

Source 1 Page Title Source 2 Page Title Match Certainty
“Lift and Shift� On-Premise RPD to BI Cloud Service (BIC Lift and Shift On-Premise RPD to BI Cloud Service (BICS) 0.8847672724484319
BICS Data Sync – Running Post Load Procedures Against DBCS BICS Data Sync – Running Post Load Procedures Against DBCS and Oracle RDBMS 0.9273846153846155
BICS Data Sync – Running Post Load Procedures against a Sc BICS Data Sync – Running Post Load Procedures Against DBCS and Oracle RDBMS 0.906331983805668

 

To improve the matching, you could either increase the ‘match certainty’ threshold, which will reduce the likelihood of duplicate hits, but would also reduce the number of hits overall.  In this example, if we changed the ‘match certainty’ to > 0.91, for instance, that would eliminate the false match on the duplicate row, but would also eliminate the correct match for the first record.

A different approach is to group the results by the source 1 page title, and then take only the highest match certainty from source 2.

In my testing, this approached worked well.

 

3. Soundex

This command differs in that there is no numeric ‘match confidence’.  The Soundex command returns a short alpha numeric string calculated based on the string that is being passed (more details on the calculation and output can be found here)

The syntax for this command is as follows:

soundex(text_field)

 

The join condition is therefore different, as shown below for this use case.

select
table1.page_title,
table2.page_title,
from table1
join table2 on soundex(table1.page_title) = soundex(table2.page_title)

Because of the length of the page titles in this example, the soundex command created many duplicate matches.  Since there is no way to calculate a match certainty, there is no easy way to select only the ‘best match’.  For this use case, this was the least effective approach.  For shorter matching strings –  names of people as an example – this approach will likely be more effective.

 

Results

On A-Team Chronicles when I did this analysis, there were 1,674 unique articles.

Attempting to join source 1 to source 2 on a ‘traditional’ join, where source 2 page name was truncated to 60 characters to match source 1, this resulted in 1,032 matching rows – so a 61% success rate.

Using Soundex, 337 rows joined correctly – so 20%

Using Utl_Match with the distance calculation, 1,060 matched successfully, so 63%, and a little better than using the straight join option.

Using Utl_Match with jaro winkler, 1,489 rows matched successfully, so 89%.

In this use case, clearly Utl_match with jaro winkler was the most accurate.  This will vary with each use case and the characteristics of the data you are trying to match.

Note – using fuzzy matching can be CPU / memory intensive on the database.  Much more so than a regular join.  As such, the recommendation is to use this only where traditional join methods don’t work.  In this use case, the best approach would be to first try to join with the traditional join, truncating the second set of data to 60 characters.  That would remove 61% of the data.  With the remaining smaller set of un-matched data, that would then be run through the UTL_match join.  Doing it this way will reduce database workload, and provide quicker results.

 

Summary

This article walked through 3 approaches for ‘fuzzy joins’ to help match similar datasets.  These are built into the Oracle database and can offer an easy way to join data sets that are similar, but do not exactly match.

For other A-Team articles by Richard, click here

Add Your Comment