Multilingual OCR
I frequently look back at the code I've designed with an eye toward improvement. Is it all that it can be? Does it do what it says with few surprises? Is it appropriately extensible?
Today, I'll look at the Atalasoft OcrEngine object. This was my first big project at Atalasoft. I designed it and built it to be interfaced as flexibly as possible with any arbitrary OCR engine, and to that end it has worked well. We have gotten prototype or production engines running with seven different OCR engines. That's seven completely different engines with different needs all running under one uniform interface. I'm proud of that - it was no mean feat.
The hardest part, however, is starting the actual engines. Each has different needs in terms of loading/locating DLLs or COM objects or finding resources. That part is pretty painful and sadly that pain gets transferred to our customers. Worse still, I can't really make it much better since those needs are out of my control. When OEM engine manufacturers ask for API feedback, I usually want changes in loading/unloading or licensing. None have made those changes (and why should they?).
Another problem has to do with how the engines express what languages they can recognize. Here's the issue - for any language, there are at least three different dimensions that determine how to recognize it: the alphabet, the lexical structure, and the syntactic structure. It looks like you should be able to bind that up into one neat package (eg, 'English' or 'Portuguese') but does that cover the differences between Brazilian Portuguese or Portuguese spoken in Portugal? What about languages that are written in both Cyrillic and Latin alphabets? What about dialects like Creole or Pidgin Languages?
In the name of expedience and keeping the .NET flavor as much as possible, I chose to use the class CultureInfo to represent the language(s) that an OCR engine can recognize. It covers most of the ground necessary in that it can distinguish between locale-based differences in cultures that share the same language: Arabic-Egyptian is different from Arabic-Jordan. The problem is that many OCR engines have feature creep in their published languages. For example, I know three engines that publish that they recognize Wolof. I won't diminish the language - 7 million people speak it - but I haven't found a way to construct a CultureInfo object that will represent Wolof. Further, I'm pretty sure that none of these engines has a dictionary for Wolof, they just take advantage of the fact that Wolof is printed with a Latin alphabet.
What should really happen is that there should be another object that represents what you really want: alphabet, dictionary or dictionaries, and possibly grammar rules. Therefore, the correct request for any Wolof-like language would be { Latin, null, null }. The problem is that, in theory, one or more of those underlying engines could magically provide a dictionary for Wolof. The problem from my end is how to expose this now. I have no engines that will tell me if there is a dictionary for a particular language - just the open ended white lie that they can recognize Moldovan or Nahuatl. In fact, I have one engine that doesn't use dictionaries ever. So here I am stuck. I know how to represent the request, but
have no engine that can help me fully honor it, so I slink back to CultureInfo and hope it's good enough.