Levenshtein Distance and the concept of Fuzzy matching in Python

Sreemanto Kesh
Analytics Vidhya
Published in
4 min readApr 27, 2020

--

Ever wondered how spellchecks and auto-corrections in your mobile phones bails you out by automatically suggesting the words that you were just about to type?

I’ve come across so many amazing functionalities in Python but the fuzzywuzzy package in particular caught my interest. So i decided to write a blog about this topic and share it to wider network.

Down below are the sub topics of this blog :

Levenshtein Distance

Fuzzywuzzy Package

Some Examples

End Notes

Levenshtein Distance

The concept of Levenshtein Distance sometimes also called as Minimum Edit distance is a popular metric used to measure the distance between two strings. It is calculated by counting number of edits required to transform one string into another. The edits could be one of the following:

  • Addition of a new letter.
  • Removal of a letter.
  • Substitution.

Mathematical formula behind the calculation goes by something like this.

The formula above contains too many notations and subscripts too wrap one’s head around. I will quickly explain this by simple demonstration. Consider below the two strings ‘Shallow’ and ‘Follow’.

Following are the 3 simple edits required to change the 1st string to the other.

Edit 1 : Remove the 1st letter ‘S’.

Edit 2 : Substitute ‘h’ with ‘F’.

Edit 3 : Substitute ‘a’ with ‘o’.

And there we have it the Levenshtein distance between the two strings ‘Shallow’ and ‘Hallow’ is 3.

Now, there could be multiple ways of transitioning from one word to another but Levenshtein distance chooses the smallest possible path.

I have written a code in order to calculate the Levenshtein Distance for any two given strings. I highly recommend you to go through it. You can find the links to my notebook down below. Feel free to try out some other combinations and do let know if you find this helpful in the comment section.

Levenshtein Distance Calculation in Python : Link

Also, to understand the calculations in more details and to make sense of the bizarre mathematical formula shown above do watch this video. Very well explained, very intuitive and I highly recommend you to go through it once.

Fuzzywuzzy Package

The concept of fuzzy matching is to calculate similarity between any two given strings. And this is achieved by making use of the Levenshtein Distance between the two strings.

fuzzywuzzy is an inbuilt package you find inside python which has certain functions in it which does all this calculation for us. I’m going to discuss four of them which are as follows:

fuzz.ratio()

fuzz.partial_ratio()

fuzz.token_sort_ratio()

fuzz.token_set_ration()

Note : To use this you have to first install this package. Here’s how you do it.

pip install fuzzywuzzy

Examples

You may need to restart your kernel post installation.

ratio()

from fuzzywuzzy import fuzz
fuzz.ratio('My name is Sreemanta','My name is Sreemanta Kesh')

Output : 89

The output is indicative of the fact that both the strings are 89% similar. Let us try some more examples.

print(fuzz.ratio('My name is Sreemanta','My name is Sreemanta '))
print(fuzz.ratio('My name is Sreemanta','My name is Sreemanta'))

Output : 98

Output : 100

fuzz.ratio() returns 100% only when it finds exact match. The 2nd example stated is an exact match where as the 1st one differs by a space.

fuzz.ratio('My name is Sreemanta','Sreemanta name is My')

Output : 45

This tells us that order of the string also matters while comapring.

partial_ratio()

Lets perform the same example using partial_ratio() method.

print(fuzz.partial_ratio('My name is Sreemanta','My name is Sreemanta Kesh'))
print(fuzz.partial_ratio('My name is Sreemanta','My name is Sreemanta '))

Output : 100

Output : 100

Notice that the given strings are not same in either of the two cases still it gives a 100% match. This is because partial_ratio() is just checking if either of the string is a sub string of the other. Down below is another example to confirm the same

fuzz.partial_ratio('New York City','New York')

Output : 100

token_sort_ratio()

fuzz.token_sort_ratio('My name is Sreemanta','Sreemanta name is My   ')

Output : 100

This result is sharply different from the one observed when same command was run with ratio() method which yielded 45% match.

fuzz.token_sort_ratio('My name is Sreemanta','sreemanta name is My **  ')

Output : 100

From the above two example we can conclude that

  1. Order of the words does not matter.
  2. It also ignores punctuation.

It follows the concept of tokenization where the strings are converted in to tokens and then they are sorted in alphabetical order and thereafter the comparison happens.

token_set_ratio()

print(fuzz.token_sort_ratio('My name is Sreemanta','Sreemanta name is My  Kesh '))
print(fuzz.token_set_ratio('My name is Sreemanta','Sreemanta name is My Kesh '))

Output : 89

Output : 100

token_set_ratio() takes a more flexible approach than token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio(). Extra or same repeated words do not matter.

fuzz.token_set_ratio('Jingle Bells',' Bells Jingle Bells')

Output : 100

End Notes

When I performed fuzzy matching exercise first time I pondered over the following two questions

  1. How Levenshtein Distance (LD) is being used to calculate the ratio?
  2. What is this ratio?

After doing a small research I was able to unearth the formula.

Ratio = (len(str1)+len(str2) - LD) / (len(str1)+len(str2))

Exercise for you : Try to embed this formula into my Levenshtein Notebook

There are some more functionalities present inside the fuzzywuzzy package such as WRatio(), URation etc. I urge you to explore more.

Personally I have used fuzzy matching in my current organisation where the aim was to map ‘Vendor Names’ with it’s ‘Invoice Reference’ and check for duplicity by setting the fuzz.ratio() threshold to 96%. This methodology has a various other important application across the industries like spell checks as listed in the beginning, CRM applications, Bioinformatics etc.

Hope you had a great time reading my blog and learned something new.

Thank you readers.

Notebook : Link

Documentation : Link

--

--