The Person
It's always easier to understand something if we have an example. Let's say we have a customer named Shehadeh Rafiq Deha
.
Next we'll want to make sure he isn't on our sanctions list. To do that, you'll want to screen his name against the sanctions database.
1. Make an API call
Your first step will be to make an API call to screen for the name Shehadeh Rafiq Deha
.
It could be you'll find an exact match immediately, but much more likely is that your customer's name, even if he is on the sanctions list, won't match exactly. That's especially true if your customer's name is derived from a language that doesn't use the English alphabet. Or, if Shehadeh is actually a shady guy, he might have been smart and changed the spelling of his name so you can't match him so easily.
Which is why we'll need to search for similar names.
2. Find similar names
There are lists and lists with rows and rows of names. If we searched every single entry and compared it to Shehadeh Rafiq Deha
, it would take all day because there are thousands and thousands and thousands of rows in these databases. Not so efficient, right? So we won't do that.
Instead, what we do is first identify names that are really similar using something called Elasticsearch. We won't get into the technicalities here, but essentially Elasticsearch lets us define the parameters we want to narrow down the names we'll compare Shehadeh Rafiq Deha
against.
For example, Elasticsearch would skip the name "Princess Sarah
" (because there's almost no similarity with Shehadeh Rafiq Deha
) but would pick out SHEHADEH, Rafik
because both names are similar to 2 out of the 3 names from our original customer.
Next, to find out how close SHEHADEH, Rafik
and Shehadeh Rafiq Deha
actually are, we'll need to calculate a fuzziness score.
3. Calculate a fuzziness score
There's actually 2 steps you need to complete in order to calculate the fuzziness score of SHEHADEH, Rafik
. For all of this, we'll use a fuzziness measure called the Jaro Winkler algorithm (which gives us a score between 0 and 1 stating how similar two names are).
3.1 Calculate the full name score
For the full name score, we'll measure how close Shehadeh Rafiq Deha
is to SHEHADEH, Rafik
as full names. But we'll need to do some prep work.
Try our customer name
Shehadeh Rafiq Deha
in every combination of name order.shehadeh rafiq deha
shehadeh deha rafiq
rafiq deha shehadeh
rafiq shehadeh deha
deha rafiq shehadeh
deha shehadeh rafiq
Then take out the spaces between all the combinations we came up with for
Shehadeh Rafiq Deha
.shehadehrafiqdeha
shehadehdeharafiq
rafiqdehashehadeh
rafiqshehadehdeha
deharafiqshehadeh
dehashehadehrafiq
After we've put together our name in different combinations without spaces, we'll use the Jaro Winkler algorithm to generate a closeness score when we compare the two sets of names. The closer the number is to 1, the more close the match.
"shehadehrafiqdeha" and "shehadehrafik" (score: 0.93)
"shehadehdeharafiq" and "shehadehrafik" (score: 0.91)
"rafiqdehashehadeh" and "shehadehrafik" (score: 0.53)
etc.
After we try all of the combinations we prepared, we'd take the highest score and call that our full name score. In this example, our full name score would be 0.93.
3.2 Calculate the composite score
Now we need to calculate the composite score, which is almost the opposite of the full name score. Instead of putting all the names into one big string of letters, we look at each name separately (as long as the name has more than 2 letters) and give each of those names a score. Our composite score will then be the average of those scores.
[
shehadeh rafiq deha
] and [shehadeh rafik
]shehadeh
andshehadeh
match completely. So the score is: 1.0rafiq
andrafik
are almost a complete match. This is scored with a separate function which depends on the Jaro Winkler and Soundex algorithms. But here we'd get the score: 0.9232deha
isn't scored against anything. So the score would be: 0.0
If we average our above scores of 1, 0.9232, and 0, that will give us our composite score 0.641 when the names are in this order:
shehadeh rafiq deha
. Now we'll try another name order and calculate the composite score for that.[
shehadeh deha rafiq
] and [shehadeh rafik
]shehadeh
andshehadeh
match completely. So the score is again: 1.0deha
andrafiq
don't match. So the score is: 0.1152rafiq
has no other names to score against. So the score is: 0.0
So the score for this ordering is 0.372 (the average of 1, 0.1152, 0).
We would continue making composite scores for every possible order of the names.
At the end, we see which combination gave us the highest score and use that for our composite score. So, in the end, our composite score would be 0.641.
3.3 Calculate the final fuzzy matching score
Once you've done the full name score and the composite score, this part is easy. You just pick whichever of those numbers is highest, and that's your final score. Using our example here:
Full name score: 0.93
Composite score: 0.641
Our full name score is highest, so, voila β that's our final score: 0.93.
β The above example works with the assumption that the fuzzy matching threshold is 92. So, Shehadeh Rafiq Deha
would therefore be considered a match.
4. The API sends you info back
In step 1, you sent the name via API. Then all of the fuzzy scoring logic happened in steps 2-3. Now the final step is that the API will send you back the calculated final score. Voila!
A few final notes
The names in the algorithm aren't commutative. That's a fancy way of saying if we reversed the previous example, comparing
shehadeh rafik
toshehadeh rafiq deha
, then we wouldn't necessarily get the same final score. The full name score would still stay the same in this example, but the composite score would be a lot higher because there are simply less names to compare β so both of those 2 names would have a counter-match.Composite score in this new case would include the following orderings and scores.
[
shehadeh rafik
] and [shehadeh rafiq deha
]shehadeh
andshehadeh
match completely, the score: 1.0rafik
andrafiq
are almost a complete match, score: 0.9232
Score for this ordering is now 0.9616 or the average of 1, 0.9232
[
shehadeh rafik
] and [shehadeh deha rafiq
]shehadeh
andshehadeh
match completely, the score: 1.0rafik
anddeha
don't match, the score: 0.1152
Score for this ordering is now 0.5576 or the average of 1, 0.1152.
etc. until all orderings of the longer name are scored.
You get the picture. In this case, the final score will now be 0.9616.
Checks have a 100% match if the sanctioned entity name fully contains the customer's name you're screening. If a name
Abdulla
is screened and a sanctioned person has a name which containsAbdulla
, then this is a full match, as per the algorithm OFAC itself uses.Some names are so long that it's not possible to look at all the orderings under one second. Names that have more than 7 name parts are handled differently. This is rare, but the cases include some legal entity names or sometimes person names. One example of such a name is
Nesrine Bent Zine El Abidine Ben Haj Hamda BEN ALI
.Our system is optimised to do the checks faster, but this document gives a rough outline of what is happening when calculating the fuzziness score.