Welcome to Atalasoft Community Sign in | Help

String Similarity and Extension Methods

This is a post about using extension methods to build string similarity tools.

I recently saw this post on Stack Overflow about word similarity algorithms.  There were several answers that included various techniques linked through Wikipedia and other sources.  I looked at this and saw an opportunity to unify them all into C#.  I took also as an opportunity to play with extension methods.

Extension methods are a C# 3.0 feature that allows anyone to syntactically add a new method to an existing class.  I say ‘syntactically’ because an extension method doesn’t really add to the class.  It’s a convention for creating new utility functions.  The actual implementation in C# is nothing more than sugar.

You could and always have written extension methods.  An extension method is nothing more than a static method that takes the object type you want to extend as the first argument.  This is actually a lie, but it is very close to the truth.  Bear with me.

public static double Power(double d, double e)
{
   return Math.Pow(d, e);
}

Now this is a trivial case since the Math class already has this method in it, but if we transform it into an extension method, we get this:

public static double ToThe(this double d, double e)
{
    return Math.Pow(d, e);
}

Now you can see the lie – extension methods contain “this” on the first argument to mark it as such.  In addition, the entire containing class needs to be static.

The name is funny though.  ToThe?  You’re thinking “What the…?”

The reason is that in usage it becomes more natural to read:

public static double ExponentialGrowth(double k, double t)
{
    return Math.E.ToThe(k * t);
} 

this gives us a method we can tack onto any double to raise it ToThe power of some exponent in a way that, while not quite a exponent operator, reads a lot closer to one.

To this end, I’ve ported, repackaged or otherwise refactored implementations of:

The usage is simple:

Console.WriteLine("Similarity of steve to joe " + "steve".SimilarText("joe"));
Console.WriteLine("Similarity of steve to steve " + "steve".SimilarText("steve"));
Console.WriteLine("Soundex of steve " + "steve".SoundEx());
Console.WriteLine("Soundex of stove " + "stove".SoundEx());
Console.WriteLine("Soundex difference of steve and stove " + "steve".SoundEx("stove"));
Console.WriteLine("Soundex difference of steve and stinky " + "steve".SoundEx("stinky"));
KeyValuePair<string, string> dm1 = "steve".DoubleMetaphone();
KeyValuePair<string, string> dm2 = "store".DoubleMetaphone();
Console.WriteLine("Double metaphone of steve " + dm1.Key + " " + dm1.Value);
Console.WriteLine("Double metaphone of store " + dm2.Key + " " + dm2.Value);
Console.WriteLine("Levenshtein Distance of steve and sleeve " + "steve".LevenshteinDistance("sleeve"));

In addition to these extension methods, I also added a few utility extension methods:

public static bool StartsWith(this string s, params string[] candidate)
{
    string match = candidate.FirstOrDefault(t => s.StartsWith(t));
    return match != default(string);
}

This StartsWith returns true if a given string starts with any of the candidate strings.

public static bool SubstringIs(this string s, int start, int length, params string[] candidate)
{
     if (start < 0)
          return false;
     string sub = s.Substring(start, length);
     string match = candidate.FirstOrDefault(t => t == sub);
     return match != default(string);
}

SubstringIs returns true if any one of the candidate strings equals a given substring.

Note that in each of these I’m using lambda expressions to LINQify the code making these close to one liners.  If you compare my source to the other source, you’ll see a number of changes.  Since extension methods are static, I had to undo some of the OOPness of the Double Metaphone implementation and I generally tried to clean up things as I passed them, be refactoring methods, simplifying, removing redundant code and method calls etc.

I have to say that I’m very happy with the way this turned out.  I am by no means an expert in this type of work and I very well may have introduced a bug or two in the process of packaging this code.  Please contact me if you find a bug or want to add more extensions.

Here’s the source, with a VS 2008 project, zipped up.

Published Monday, January 26, 2009 11:39 AM by Steve Hawley

Comments

Tuesday, May 05, 2009 3:07 PM by Denny Dot Net

# Levenshtein formula and string similarity

I'm working on a system that allows users to edit titles on certain pieces of information. One of the

Saturday, July 25, 2009 9:11 AM by SmartK8

# Nice

Thanks a lot, you saved me a lot of time. I registered just to thank you! Good work!

Wednesday, August 12, 2009 12:45 AM by mleachpdx

# Double Metaphone Bug?

Either this is an ironic bug, or I'm not understanding the appropriate usage of Double Metaphone.

The double metaphone for "crash" throws exception System.ArgumentOutOfRangeException.

Example:

KeyValuePair<string, string> dm1 = "crash".DoubleMetaphone();

Anonymous comments are disabled