sane-project-website/old-archive/1997-11/0107.html

125 wiersze
6.3 KiB
HTML

<!-- received="Thu Nov 13 04:04:51 1997 PST" -->
<!-- sent="Thu, 13 Nov 1997 12:55:15 +0100 (MET)" -->
<!-- name="becka@rz.uni-duesseldorf.de" -->
<!-- email="becka@rz.uni-duesseldorf.de" -->
<!-- subject="Re: OCR Software..?!" -->
<!-- id="m0xVxrf-000BW0C@charon.beck-sw.de" -->
<!-- inreplyto="199711130017.TAA14813@lemur.magnet.com" -->
<title>sane-devel: Re: OCR Software..?!</title>
<h1>Re: OCR Software..?!</h1>
<a href="mailto:becka@rz.uni-duesseldorf.de"><i>becka@rz.uni-duesseldorf.de</i></a><br>
<i>Thu, 13 Nov 1997 12:55:15 +0100 (MET)</i>
<p>
<ul>
<li> <b>Messages sorted by:</b> <a href="date.html#107">[ date ]</a><a href="index.html#107">[ thread ]</a><a href="subject.html#107">[ subject ]</a><a href="author.html#107">[ author ]</a>
<!-- next="start" -->
<li> <b>Next message:</b> <a href="0108.html">Joey Nelson: "Re: Problem Scanning with UMAX S6E."</a>
<li> <b>Previous message:</b> <a href="0106.html">Michael Burghart: "Re: OCR Software..?!"</a>
<li> <b>In reply to:</b> <a href="0100.html">Andrew Kuchling: "Re: OCR Software..?!"</a>
<!-- nextthread="start" -->
<li> <b>Next in thread:</b> <a href="0105.html">Jonathan Buzzard: "Re: OCR Software..?!"</a>
<!-- reply="end" -->
</ul>
<!-- body="start" -->
<i>&gt; my idea was that scanned data would wind up in a Tk</i><br>
<i>&gt; text editing box, with possible errors (where the confidence value of</i><br>
<i>&gt; the recognition is low) highlighted in red.</i><br>
<p>
You might evetually need a "segmentation preview" which allows (optionally)<br>
to manually interfere with the separation of text and graphics and the<br>
sequence in which the textboxes are to be processed.<br>
<p>
Moreover it would be nice, if you could turn on and off every manual<br>
step. So you could simply make a "quick-and-dirty" mass conversion<br>
and correct errors the next morning when the stack of sheets has been<br>
fed through the scanner as well as interactive operation.<br>
<p>
<i>&gt; Recognition is the complicated part, of course. First you need to</i><br>
<i>&gt; scan the image, then it's usually converted from grey-scale to 2-level</i><br>
<i>&gt; black-and-white. Documents are often not perfectly aligned when</i><br>
<i>&gt; they're scanned, so the angle at which they're tilted (called the</i><br>
<i>&gt; "skew angle") has to be measured and compensated for.</i><br>
<p>
Yeah. If you want to compensate on the image side, do so before converting <br>
to b/w. Less quality loss.<br>
<p>
Moreover a "de-noise" filter would be appropriate to remove speckles.<br>
<p>
At small text sizes, it would eventually be nice to keep a grayscale image<br>
(though this considerably complicates algorithms). At least you should<br>
use an appropriate combined sharpening/smoothing filter (which preserves<br>
edges, but smooths areas) to get a good image of the letters.<br>
<p>
<i>&gt; Then the image has to be segmented into words, and words into letters; </i><br>
Or digraphs. Many printed typefaces use this. An example is the combination<br>
"fi". In printed form, the dot of the i is often made up of a dot attached <br>
to the upper end of the f. Set a word containing this combination with TEX<br>
to see what I mean.<br>
<p>
<i>&gt; each letter is then recognized, and usually a confidence value is </i><br>
<i>&gt; attached to each letter.</i><br>
Yep. The same should happen on word level.<br>
<p>
<i>&gt; Often there's a post-processing step which uses a language dictionary </i><br>
<i>&gt; to correct errors; for example, if you're scanning English text, 'rn' </i><br>
<i>&gt; might be a scanning error for "m".</i><br>
<p>
Yes. The matching algorithm for the dictionary search needs to be<br>
chosen in a way that takes typical scanning/matching errors into account.<br>
<p>
On letter level you could use language specific hidden-markov-chains to<br>
predict the possibility of certain next letters, which can be helpful for <br>
deciding between several possibilities. E.g. if the last recognized<br>
character was "q", the possibility for the next one being "u" is magnitudes<br>
higher than for it being "n".<br>
<p>
<i>&gt; The two major techniques for recognizing letters seems to be either</i><br>
<i>&gt; neural networks, or making a vector from easily measured</i><br>
<i>&gt; characteristics of the bitmap containing a letter; for example, xocr</i><br>
<i>&gt; takes a histogram of the letter at 128 different angles. This</i><br>
<i>&gt; technique dates back at least to the 1970s, but neural networks seem</i><br>
<i>&gt; to be what all modern systems use.</i><br>
<p>
The XOCR technique is not good. If it wasn't changed since my last look<br>
it _counted_pixels_ (!) from these angles. This doesn't even distinguish<br>
and O from a dot. Using the number of black/white transitions is a better<br>
measure.<br>
<p>
But do not make the standard OCR mistake to simply feed the character <br>
matrix to a neural net and then try to train it like mad.<br>
<p>
Feature recognition is still the most important part for a good OCR<br>
program. If you classify them using a neural net or something simpler like<br>
some weighted vector matching isn't too important. If your feature-<br>
recognition is not good, neither of them will work well.<br>
<p>
Neural nets can compensate a bit better for a bad recognizer, but<br>
at the price of additional training time and eventually less predictable<br>
behaviour.<br>
<p>
<i>&gt; We should approach him, and get a freeware-OCR mailing list set up.</i><br>
Definitely a good idea. It is one of the few things missing in freeware.<br>
<p>
CU, Andy<br>
<p>
<pre>
--
Andreas Beck | Email : &lt;<a href="mailto:becka@sunserver1.rz.uni-duesseldorf.de">becka@sunserver1.rz.uni-duesseldorf.de</a>&gt;
<p>
<pre>
--
Source code, list archive, and docs: <a href="http://www.mostang.com/sane/">http://www.mostang.com/sane/</a>
To unsubscribe: echo unsubscribe sane-devel | mail <a href="mailto:majordomo@mostang.com">majordomo@mostang.com</a>
</pre>
<!-- body="end" -->
<p>
<ul>
<!-- next="start" -->
<li> <b>Next message:</b> <a href="0108.html">Joey Nelson: "Re: Problem Scanning with UMAX S6E."</a>
<li> <b>Previous message:</b> <a href="0106.html">Michael Burghart: "Re: OCR Software..?!"</a>
<li> <b>In reply to:</b> <a href="0100.html">Andrew Kuchling: "Re: OCR Software..?!"</a>
<!-- nextthread="start" -->
<li> <b>Next in thread:</b> <a href="0105.html">Jonathan Buzzard: "Re: OCR Software..?!"</a>
<!-- reply="end" -->
</ul>