Fuzzy Name Searching

List overview All Threads
Download

newer

older

[NF] What's in your event log?

Re: Fuzzy Name Searching

Ken Dibble

12 Apr 2017 12 Apr '17

2:55 p.m.

Hi folks,

I've been thinking of how I can improve the ability of my users to find people's names in a system that has over 30,000 people in it.

I've looked at soundex, and I've considered munging names to remove spaces, apostrophes, hyphens, etc. The thing about those approaches is that in order to be efficient, they require pre-processing all of the names in the system and storing the results, which can then be queried to find matches.

Unfortunately, that would require modifications to the database, which I try to avoid due to the downtime they require.

I'm looking for suggestions on how to produce results that include close matches on last names that doesn't require pre-processing.

I've played with various schemes to assign "weights" to matches based on the number of matching letters, but they all end up being very slooooow and also producing too many false positives.

I suppose there are no easy answers, but if anyone has an algorithm for this kind of thing that they would be willing to share, I'd be grateful.

Thanks.

Ken Dibble www.stic-cil.org

Show replies by date

Alan Bourke

12 Apr 12 Apr

3:33 p.m.

FoxWeb has a full text search engine, free, that might help.

http://www.foxweb.com/fwFullText/

-- Alan Bourke alanpbourke (at) fastmail (dot) fm

Peter Cushing

4:22 p.m.

On 12/04/2017 15:55, Ken Dibble wrote:

...

<snip> I'm looking for suggestions on how to produce results that include close matches on last names that doesn't require pre-processing.

About 20 years ago I did some work on marketing databases and one of the big tasks is de-duping data. We did things like substituting vowels with a * in the search value, then you get a more general match that avoids some spelling mistakes. Unfortunately approaches like this would require you to store the processed data.

Do you have the surname is a separate field with an index? If so you could show the results of matches in a box below the search which changes as they type each letter. The user can then see the list going smaller as they type and will hopefully see what they are looking for. If they have typed the whole name in and can't see the match they can remove letters to look for a partial match. I would only match from say the 3rd character or more as it may be too slow to display all the matches.

Peter

This communication is intended for the person or organisation to whom it is addressed. The contents are confidential and may be protected in law. Unauthorised use, copying or disclosure of any of it may be unlawful. If you have received this message in error, please notify us immediately by telephone or email.

www.whisperingsmith.com

Whispering Smith Ltd Head Office:61 Great Ducie Street, Manchester M3 1RR. Tel:0161 831 3700 Fax:0161 831 3715

London Office:17-19 Foley Street, London W1W 6DW Tel:0207 299 7960

Stephen Russell

4:49 p.m.

I remember this joy of searching names in a system that had 2+ million customers and names were all varchar() instead of a key to a secondary table. My indexes sure took a beating when I got another "Williams", the number one last name in the system, and it had to tear a page to make a new page in this area.

I found that making a table called NAMES fixed the search time I was experiencing. Two text boxes had input for whatever they keyed. I added the % for wildcard after any text in each box and one of the keypress events was the trigger to run it.

Select <field_list> from customer where lNameID in ( select nameID from names where Name like @Lname) and fNameID in ( select nameID from names na where na.Name like @Fname)

That has been 10-13 years ago.

On Wed, Apr 12, 2017 at 9:55 AM, Ken Dibble krdibble@stny.rr.com wrote:

...

Hi folks,

I've been thinking of how I can improve the ability of my users to find people's names in a system that has over 30,000 people in it.

I've looked at soundex, and I've considered munging names to remove spaces, apostrophes, hyphens, etc. The thing about those approaches is that in order to be efficient, they require pre-processing all of the names in the system and storing the results, which can then be queried to find matches.

Unfortunately, that would require modifications to the database, which I try to avoid due to the downtime they require.

I'm looking for suggestions on how to produce results that include close matches on last names that doesn't require pre-processing.

I've played with various schemes to assign "weights" to matches based on the number of matching letters, but they all end up being very slooooow and also producing too many false positives.

I suppose there are no easy answers, but if anyone has an algorithm for this kind of thing that they would be willing to share, I'd be grateful.

Thanks.

Ken Dibble www.stic-cil.org

[excessive quoting removed by server]

Garrett Fitzgerald

7:31 p.m.

I wrote a FLL to do Levenshtein distances for fuzzy name matching, but everything was posted to my blog, which is no longer online. It wasn't amazingly hard to figure out, though, so it might be worth finding the algorithm in C and recreating my steps. It ran much faster than equivalent Fox code did.

On Wed, Apr 12, 2017 at 12:49 PM, Stephen Russell srussell705@gmail.com wrote:

...

I remember this joy of searching names in a system that had 2+ million customers and names were all varchar() instead of a key to a secondary table. My indexes sure took a beating when I got another "Williams", the number one last name in the system, and it had to tear a page to make a new page in this area.

I found that making a table called NAMES fixed the search time I was experiencing. Two text boxes had input for whatever they keyed. I added the % for wildcard after any text in each box and one of the keypress events was the trigger to run it.

Select <field_list> from customer where lNameID in ( select nameID from names where Name like @Lname) and fNameID in ( select nameID from names na where na.Name like @Fname)

That has been 10-13 years ago.

On Wed, Apr 12, 2017 at 9:55 AM, Ken Dibble krdibble@stny.rr.com wrote:

...
Hi folks,

I've been thinking of how I can improve the ability of my users to find people's names in a system that has over 30,000 people in it.

I've looked at soundex, and I've considered munging names to remove spaces, apostrophes, hyphens, etc. The thing about those approaches is

that

...
in order to be efficient, they require pre-processing all of the names in the system and storing the results, which can then be queried to find matches.

Unfortunately, that would require modifications to the database, which I try to avoid due to the downtime they require.

I'm looking for suggestions on how to produce results that include close matches on last names that doesn't require pre-processing.

I've played with various schemes to assign "weights" to matches based on the number of matching letters, but they all end up being very slooooow

and

...
also producing too many false positives.

I suppose there are no easy answers, but if anyone has an algorithm for this kind of thing that they would be willing to share, I'd be grateful.

Thanks.

Ken Dibble www.stic-cil.org

[excessive quoting removed by server]

Ted Roche

8:18 p.m.

Ah! The algorithm rang a bell!

Garrett: have you tried searching Archive.org? A LOT of your stuff appears archived: https://web-beta.archive.org/web/*/garrett%20fitzgerald%20

The equivalent FoxPro code, by the way is in the leafe downloads at https://leafe.com/dls/vfp. Bob Calco wrote it up.

Also on Fox Wikis at: http://fox.wikis.com/wc.dll?Wiki~LevenshteinAlgorithm

Craig Boyd's blog about Spell Checking at http://www.sweetpotatosoftware.com/spsblog/CommentView.aspx?guid=8800bdb9-a9...

On Wed, Apr 12, 2017 at 3:31 PM, Garrett Fitzgerald sarekofvulcan@gmail.com wrote:

...

I wrote a FLL to do Levenshtein distances for fuzzy name matching, but everything was posted to my blog, which is no longer online. It wasn't amazingly hard to figure out, though, so it might be worth finding the algorithm in C and recreating my steps. It ran much faster than equivalent Fox code did.

On Wed, Apr 12, 2017 at 12:49 PM, Stephen Russell srussell705@gmail.com wrote:

...
I remember this joy of searching names in a system that had 2+ million customers and names were all varchar() instead of a key to a secondary table. My indexes sure took a beating when I got another "Williams", the number one last name in the system, and it had to tear a page to make a new page in this area.

I found that making a table called NAMES fixed the search time I was experiencing. Two text boxes had input for whatever they keyed. I added the % for wildcard after any text in each box and one of the keypress events was the trigger to run it.

Select <field_list> from customer where lNameID in ( select nameID from names where Name like @Lname) and fNameID in ( select nameID from names na where na.Name like @Fname)

That has been 10-13 years ago.

On Wed, Apr 12, 2017 at 9:55 AM, Ken Dibble krdibble@stny.rr.com wrote:

...
Hi folks,

I've been thinking of how I can improve the ability of my users to find people's names in a system that has over 30,000 people in it.

I've looked at soundex, and I've considered munging names to remove spaces, apostrophes, hyphens, etc. The thing about those approaches is

that

...
in order to be efficient, they require pre-processing all of the names in the system and storing the results, which can then be queried to find matches.

Unfortunately, that would require modifications to the database, which I try to avoid due to the downtime they require.

I'm looking for suggestions on how to produce results that include close matches on last names that doesn't require pre-processing.

I've played with various schemes to assign "weights" to matches based on the number of matching letters, but they all end up being very slooooow

and

...
also producing too many false positives.

I suppose there are no easy answers, but if anyone has an algorithm for this kind of thing that they would be willing to share, I'd be grateful.

Thanks.

Ken Dibble www.stic-cil.org

[excessive quoting removed by server]

Garrett Fitzgerald

8:41 p.m.

Aha in turn! https://web-beta.archive.org/web/20101216062156/http://blog.donnael.com:80/2...

On Apr 12, 2017 4:19 PM, "Ted Roche" tedroche@gmail.com wrote:

...

Ah! The algorithm rang a bell!

Garrett: have you tried searching Archive.org? A LOT of your stuff appears archived: https://web-beta.archive.org/web/*/garrett%20fitzgerald%20

The equivalent FoxPro code, by the way is in the leafe downloads at https://leafe.com/dls/vfp. Bob Calco wrote it up.

Also on Fox Wikis at: http://fox.wikis.com/wc.dll? Wiki~LevenshteinAlgorithm

Craig Boyd's blog about Spell Checking at http://www.sweetpotatosoftware.com/spsblog/CommentView.aspx?guid= 8800bdb9-a9c2-484f-942f-6a08947d903a

On Wed, Apr 12, 2017 at 3:31 PM, Garrett Fitzgerald sarekofvulcan@gmail.com wrote:

...
I wrote a FLL to do Levenshtein distances for fuzzy name matching, but everything was posted to my blog, which is no longer online. It wasn't amazingly hard to figure out, though, so it might be worth finding the algorithm in C and recreating my steps. It ran much faster than

equivalent

...
Fox code did.

On Wed, Apr 12, 2017 at 12:49 PM, Stephen Russell <srussell705@gmail.com

wrote:

...
I remember this joy of searching names in a system that had 2+ million customers and names were all varchar() instead of a key to a secondary table. My indexes sure took a beating when I got another "Williams",

the

...
...
number one last name in the system, and it had to tear a page to make a

new

...
...
page in this area.

I found that making a table called NAMES fixed the search time I was experiencing. Two text boxes had input for whatever they keyed. I

added

...
...
the % for wildcard after any text in each box and one of the keypress events was the trigger to run it.

Select <field_list> from customer where lNameID in ( select nameID from names where Name like @Lname) and fNameID in ( select nameID from names na where na.Name like @Fname)

That has been 10-13 years ago.

On Wed, Apr 12, 2017 at 9:55 AM, Ken Dibble krdibble@stny.rr.com

wrote:

...
...
...
Hi folks,

I've been thinking of how I can improve the ability of my users to

find

...
...
...
people's names in a system that has over 30,000 people in it.

I've looked at soundex, and I've considered munging names to remove spaces, apostrophes, hyphens, etc. The thing about those approaches is

that

...
in order to be efficient, they require pre-processing all of the

names in

...
...
...
the system and storing the results, which can then be queried to find matches.

Unfortunately, that would require modifications to the database,

which I

...
...
...
try to avoid due to the downtime they require.

I'm looking for suggestions on how to produce results that include

close

...
...
...
matches on last names that doesn't require pre-processing.

I've played with various schemes to assign "weights" to matches based

on

...
...
...
the number of matching letters, but they all end up being very

slooooow

...
...
and

...
also producing too many false positives.

I suppose there are no easy answers, but if anyone has an algorithm

for

...
...
...
this kind of thing that they would be willing to share, I'd be

grateful.

...
...
...
Thanks.

Ken Dibble www.stic-cil.org

[excessive quoting removed by server]

Mike

9:22 p.m.

I found a Levenshtein function somewhere last year and have been using it with MariaDB as a function on the MariaDB server, called from my VFP 9 application. It's exceptionally fast and works pretty well.

My application needs to get "as close as" matches to a random string (for manufacturer product SKUs, which can be any length, any alphanumeric mishmash...example: RGB745WEHWW).

Sometimes the user enters everything right except one character, and this function returns a weighted list of "as close as I can find" known SKUs from a table of 45,000+.

I can send the function to anyone who is interested.

Here's an example of how I call it (the Levenshtien function is named klose, pcSearch is the text string to search for) select sku, klose(sku,?pcSearch) as score from (select sku from skus where soundex(sku) like soundex(?pcSearch)) as hits order by score desc limit 10

Mike Copeland

Garrett Fitzgerald wrote:

...

I wrote a FLL to do Levenshtein distances for fuzzy name matching, but everything was posted to my blog, which is no longer online. It wasn't amazingly hard to figure out, though, so it might be worth finding the algorithm in C and recreating my steps. It ran much faster than equivalent Fox code did.

On Wed, Apr 12, 2017 at 12:49 PM, Stephen Russell srussell705@gmail.com wrote:

...
I remember this joy of searching names in a system that had 2+ million customers and names were all varchar() instead of a key to a secondary table. My indexes sure took a beating when I got another "Williams", the number one last name in the system, and it had to tear a page to make a new page in this area.

I found that making a table called NAMES fixed the search time I was experiencing. Two text boxes had input for whatever they keyed. I added the % for wildcard after any text in each box and one of the keypress events was the trigger to run it.

Select <field_list> from customer where lNameID in ( select nameID from names where Name like @Lname) and fNameID in ( select nameID from names na where na.Name like @Fname)

That has been 10-13 years ago.

On Wed, Apr 12, 2017 at 9:55 AM, Ken Dibble krdibble@stny.rr.com wrote:

...
Hi folks,

I've been thinking of how I can improve the ability of my users to find people's names in a system that has over 30,000 people in it.

I've looked at soundex, and I've considered munging names to remove spaces, apostrophes, hyphens, etc. The thing about those approaches is

that

...
in order to be efficient, they require pre-processing all of the names in the system and storing the results, which can then be queried to find matches.

Unfortunately, that would require modifications to the database, which I try to avoid due to the downtime they require.

I'm looking for suggestions on how to produce results that include close matches on last names that doesn't require pre-processing.

I've played with various schemes to assign "weights" to matches based on the number of matching letters, but they all end up being very slooooow

and

...
also producing too many false positives.

I suppose there are no easy answers, but if anyone has an algorithm for this kind of thing that they would be willing to share, I'd be grateful.

Thanks.

Ken Dibble www.stic-cil.org

mbsoftwaresolutions＠mbsoftwaresolutions.com

14 Apr 14 Apr

3:23 p.m.

On 2017-04-12 17:22, Mike wrote:

...

I found a Levenshtein function somewhere last year and have been using it with MariaDB as a function on the MariaDB server, called from my VFP 9 application. It's exceptionally fast and works pretty well.

My application needs to get "as close as" matches to a random string (for manufacturer product SKUs, which can be any length, any alphanumeric mishmash...example: RGB745WEHWW).

Sometimes the user enters everything right except one character, and this function returns a weighted list of "as close as I can find" known SKUs from a table of 45,000+.

I can send the function to anyone who is interested.

Here's an example of how I call it (the Levenshtien function is named klose, pcSearch is the text string to search for) select sku, klose(sku,?pcSearch) as score from (select sku from skus where soundex(sku) like soundex(?pcSearch)) as hits order by score desc limit 10

Mike Copeland

Hi Mike,

I'd love to see it. How about posting it on the ProFox downloads page so that it's easy for anybody to get?

Thanks! --Mike

3232

Age (days ago)

3234

Last active (days ago)

profox@leafe.com

8 comments

8 participants

tags (0)

participants (8)

Alan Bourke
Garrett Fitzgerald
Ken Dibble
mbsoftwaresolutions＠mbsoftwaresolutions.com
Mike
Peter Cushing
Stephen Russell
Ted Roche