|
Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.
This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Developer's Guide". It's currently unfinished. First-time users are invited to read "The Clara OCR Tutorial". There is also an advanced manual known as "The Clara OCR Advanced User's Manual".
| CONTENTS |
| 1. Introducing the source code |
This Guide is a collection of entry points to the Clara OCR source code. Some notes explain punctual details about how this or that feature was implemented. Others are higher-level descriptions about how one entire subsystem works.
| 1.1 Language and environment |
Clara OCR is written in ANSI C (with some GNU extensions) and requires the services of the C library and the Xlib. The development is using 32-bit Intel GNU/Linux (various different distributions), GCC, Gnu Make, Bash, XFree86 and Perl 5 (required for producing the documentation).
| 1.2 Modularization |
Clara source code started, of course, as being one only file named clara.c. At some point we divided it into smaller pieces. Currently there are 18 files:
book.c .. Documentation only build.c .. The function build clara.c .. Startup and OCR run control cml.c .. ClaraML dumper and recover common.h .. Common declarations consist.c .. Consistency tests event.c .. GUI initialization and event handler gui.h .. Declarations that depend on X11 html.c .. HTML generation and parse pattern.c .. Book font stuff pbm2cl.c .. Import PBM pgmblock.c .. grayscale loading and blockfinding preproc.c .. internal preprocessor redraw.c .. The function redraw revision.c .. Revision procedures skel.c .. Skeleton computation symbol.c .. Symbol stuff welcome.c .. Welcome stuff |
Note that there are only two headers: common.h and gui.h. It's complex to maintain one header for each module. Most functions are not prototyped, but we intend to prototype all them in the near future.
| 1.3 The memory allocator |
Clara OCR relies on the memory allocator both for allocation or resizing of some large blocks used by the main data structures, and for allocation of a large number of small arrays. Currently Clara OCR does not include or use an special memory allocator, but implements an interface to realloc. The alloca function is also used sometimes along the code, generally to allocate buffers for sorting arrays.
The interface is the function c_realloc. The function c_free must be used to free the blocks allocated or resized by c_realloc. In the near future, c_realloc will build a list of the currently allocated blocks, their sizes and some bits more in order to help to trace flaws.
| 1.4 Security notes |
Concerning security, the following criteria is being used:
1. string operations are generally performed using services that accept a size parameter, like snprint or strncpy, except when the code itself is simple and guarantees that a overflow won't occur.
2. The CGI clara.pl invokes write privileges through sclara, a program specially written to perform only a small set of simple operations required for the operation of the Clara OCR web interface.
The following should be done:
1. Memory blocks should be cleared before calling free().
| 1.5 Runtime index checking |
A naive support for runtime index checking is provided through the macro checkidx. This checking is performed only if the code is compiled with the macro MEMCHECK defined and the command-line switch '-X 1' is used.
In fact, only those points on the source code where the macro checkidx is explicitly used will perform index checking. We've added calls to checkidx on some critical functions due to its complexity, or because segfaults were already were detected there.
| 1.6 Background operation |
Clara OCR decides at runtime if the GUI will be used or not. So even when using Clara OCR in batch mode (-b command-line switch), linking with the X libraries is required.
When the -b command-line switch is used, Clara OCR just won't make calls to X services. The source code tests the flag "batch_mode" before calling X services. So it won't create the application window on the X display, and automatically starts a full OCR operation on all pages found (as if the "OCR" button was pressed with the "work on all pages" option selected). Upon completion, Clara OCR will exit.
| 1.7 Global variables |
Clara OCR uses a lot of global variables. Large data structures, flags, paths, etc, use stored on global variables. In some cases we use a naming strategy to make the code more readable. The important cases are:
a. The main data structures of Clara OCR are global arrays that grow as required. The following a convention was created for the names associated with these arrays:
structure type array top size
--------------------------------------------
act adesc act topa actsz
closure cldesc cl topcl clsz
symbol sdesc mc tops ssz
pattern pdesc pattern topp psz
link ldesc lk toplk lksz
ptype ptdesc pt toppt ptsz
|
b. Menus are referred by their registration indexes. These indexes are stored on variables named CM_X. The menu items registration indexes are stored on variables named CM_X_SOMETHING (all capital). If the item has an associated flag, the flag is named cm_x_something (all small).
| 1.8 Path variables |
Most path variables are computed from the path given through the -f command line option. The variable "pagename" is the filename of the PBM image of the page being processed, not including the path eventually specified through the -f switch. For instance, if the OCR is started with
clara -f mydocs/test.pbm
|
Clara stores on the variable pagelist the null-separated list of all names of pbm files found on this directory. Even in this case, the variable pagename will store the filename of the page being processed (at any moment Clara will be processing one and only one page).
The directory that contains the pbm files that Clara will process is stored on the variable pagesdir. In the example above, the value of the variable pagesdir is "mydocs/".
The variable workdir stores the path of the directory where Clara will create the files *.html, *.session, "patterns" and "acts". This path is assumed to be equal to pagesdir, unless another path is given through the -w switch. The variable doubtsdir will be the concatenation of workdir with the string "doubts/" (doubtsdir is ignored if -W is not used).
| 1.9 Bitmaps |
Clara stores bitmaps in a linear array of bytes, following closely the pbm raw format. The first line of a bitmap with width w is stored on the first (w/8)+((w%8)!=0) bytes of the array. The remaining bits (if any) are left blank, and so on. The leftmost bit on each byte is the most significative one (black, or "on", is 1, and white, or "off" is 0). An example follows:
76543210765432
+--------------+
| | 00000000 00000000 = 0 0
| XX XXXX | 00011011 11000000 = 27 192
| XX XX | 00001100 01100000 = 12 96
| XX XX | 00001100 01100000 = 12 96
| XX XX | 00001100 01100000 = 12 96
| XX XX | 00001100 01100000 = 12 96
| | 00000000 00000000 = 0 0
+--------------+
stored as: 0 0 27 192 12 96 12 96 12 96 12 96 0 0
|
| 1.10 Execution model |
In order to allow the GUI to refresh the application window while one OCR run is in course, Clara does not use multiple threads. The main function alternates calls to xevents() to receive input and to continue_ocr() to perform OCR. As the OCR operations may take long to complete, a very simple model was implemented to allow the OCR services to execute only partially.
Such services (for instance load_page()) accept a "reset" parameter to allow resetting all static data, and they're expected to return 0 when finished, or nonzero otherwise. Therefore, a call to such services must loop until completion. The continue_ocr() calls the OCR steps using this model, and some OCR steps call other services (like load_page()) that implement this model too.
| 1.11 Return codes |
When Clara OCR exits, the exit code will diagnose the finalization status:
0 clean 1 data inconsistency 2 buffer overflow 3 invalid field 4 internal error 5 memory exhausted 6 X error 7 I/O error 8 bad input |
| 2. Internal representation of pages |
Even for non-developers, a knowledge of the internal data structures used by Clara OCR is required for fine tuning and to make simple diagnostics.
The basic elements stored are the "closures". Sets of one or more closures are called "symbols". Symbols are arranged in lists forming "words". The words are arranged in lists forming "lines".
| 2.1 Closures |
Closures of black pixels by contiguity are a first attempt to identify the atomic symbols of the document. The name "closure" is of course due to the consideration of the contiguity as a relation (in the mathematical sense of the word). Starting (for instance) from (i,j), we compute the set of black pixels ("X" and "*" in the figure). The limits (l,r,t,b) define the bounding box of the closure.
l i r
+---+-+----+---+
| |
t + XX XXXX |
| XX XX |
j + X* XX |
| XX XX |
b + XX XX |
| |
+--------------+
|
| 2.2 Symbols |
As one character of the document may be composed by two or more closures (for instance when it's broken), it's convenient to work not with closures, but with sets of closures. So we define the concept of "symbol" as being a set of one or more closures. Initially, the OCR generates one unitary symbol for each closure. Subsequent steps may define new symbols composed by two or more closures.
For instance, let's present three closures that do not correspond to atomic symbols: "a" and "i" linked (one closure) and a broken "u" (two closures). As a principle, Clara OCR do not try to break closures into smaller closures. Instead of that, the classification heuristic try to compose various patterns to resolve symbols like the "ai" in the figure. Concerning the "u", the classification heuristic is expected to merge the two closures into one symbol and apply a "u" pattern to resolve it.
l r l r l r
+-----+------------+-----+-+-+----+--+
| |
t + XX |
| XX |
| |
| XXXXX XXX XXX XXX + t
| X XX XX XX XX |
| XX XX XX XX |
| XXXXXXX XX XX XX |
| X XX XX XX XX |
| X XX XX XX XX |
b + XXXXX XXXXXXX XX XXXX + b
| |
+------------------------------------+
|
l r
+---+--+-------+
| |
t + XX |
| XX |
| |
| XXX |
| XX |
| XX |
| XX |
| XX |
b + XXXX |
| |
+--------------+
|
| 2.3 The sdesc structure and the mc array |
Each symbol is stored in a sdesc structure. Those structures form the mc array. Once created, a symbol is never deleted. So it's index on the mc array identifies it (this is important for the web-based revision procedure). Note that closures and symbols are numbered on a document-related basis. The set of closures that define one symbol never changes. So the symbol bounding box and the total number of black pixels also won't change either.
So two different entries of the mc array never have the same set of closures. The entries of the mc array are created by the new_mc service. When some procedure tries to create a new symbol informing a list of closures for which already exists a symbol, the service new_mc detects it and returns to the caller not the index of a newly created symbol, but the index of that already created one.
| 2.4 The preferred symbols |
One same closure may belong to more than one symbol. This is important in order to allow various heuristic trials. For instance, the left closure of the "u" on the preceding section could be identified as the body of an "i". In this case however we would not find its dot. So the heuristic could try by chance another solution, for instance to join it with the nearest closure (in that case, the right closure of the "u") and try to match it with some pattern of the font.
So the OCR will need to choose, from all symbols that contain a given closure, the one to be preferred. In fact, Clara OCR maintains dynamically a partition of the set of closures on "preferred" symbols. This is the ps array. Some manual operations, like fragment merging and symbol disassembling (activated by the context menu on the page tab), change that partition dinamically, as well as some automatic procedures, like the merge step on the OCR run.
| 2.5 Font size |
The font size is important for classifying all book symbols on pattern "types". For instance, books generally use smaller letters for footnotes. This classification is performed automatically by Clara OCR and presented by the "PATTERN (types)" window.
Clara OCR generally uses millimeters for presenting sizes, but we'll soon express sizes in "points". Let's see an example. One inch corresponds to 72.27 printer's point (pt) (The METAFONTBook pg 21, note). So when using 600 dpi, each pt will correspond to 600/72.27 = 8.3 pixels. For 10 point roman characters, Knuth defines the height of lowercase letters as being 155/36 pt, so 35.7 pixels for us. Therefore, to compute the font size (f) from the height in pixels (h) of one lowercase letter, the formula is f = 10*h/35.7.
| 2.6 Symbol alignment |
The vertical alignment of symbols is important for various heuristics. For instance, the vertical line from a broken "p" matches an "l", but using alignement tests we're able to refuse this match.
The current Clara OCR alignment support was developed for the Latin alphabet, and is being adapted for other alphabets. Four vertical alignemnt positions are considered. These positions are referred as usual (ascent, baseline and descent). We use the Knuth's identifier "x_height" to refer the height of lowercase letters without ascenders.
A XXX XXXXXXXXX
XX XX X
XX XX XX
XX XX XX
X XX XXXXX XX XXXXX XX X XXXX
XXX X XXX X XXXXXXXX XX XX
XX XX XX XX XX X XX XX
XX XX XX XX XX XX XXXXXXXXXX
XX XX XX XX XX XX XX
XX XX XX XX XX XX XX
XXX X XXX X XX X XX XX XX
B XX XXXXX XX XXXXX XXXXXXXXX XXXX XXX
XX X
XX X
XX X
D XXXX
A (0) .. ascent (Knuth asc_height)
X (1) .. x_height
B (2) .. baseline
D (3) .. descent (Knuth desc_depth)
|
| 2.7 Words and lines |
Clara OCR applies The concept of "symbol" to atomic symbols like letters, digits or punctuation signs. Words (as "house" or "peace"), are handled by Clara OCR as sequences of symbols.
It's very important to compute the words of the page. They provide a context both to the OCR and to the reviewer. For instance, if the known symbols of some word were identified as bold, then Clara will automatically make the bold button on when someone tries to review the unknown symbols of that word. The same applies to prefer the recognition of one symbol as the digit "1" instead of the character "l" if the known symbols of the "word" are digits. Words are also the basis for revision based on spelling. Each words is stored on a wdesc structure on the "word" array.
When building the OCR output, Clara will combine words in lines. Each line is a sequence of words (that is, wdesc structures). The array "line" is the sequence of the heads of the detected lines. Each entry of this array is a lndesc structure. The left and right limits of words must be carefully computed and compared in order to the OCR partitionate then in columns, when dealing with multi-column pages.
| 2.8 Acts and transliterations |
The "acts" or "revision acts" are the human interventions for training a symbol, merging a fragment to one symbol, etc.
As the human interventions are the more precious source of information, Clara logs all revision acts, in order to be able to reuse them.
The transliterations are obtained from the revision acts, so each transliteration refers one (or more) revision acts, and also inherits some properties from that act (or those acts).
The acts are on the book scope, and not on the page scope. The acts are stored on the file "acts" on the work directory.
Each act stores some data about the reviewer and also the submission date. As we plan to reuse revision data, each act also stores some data about the "original reviewer" and the "original submission date". These fields are meaningful only for reused acts.
| 2.9 Symbol transliterations |
Clara OCR maintains a list of 0 or more proposed or deduced transliterations for each symbol. Along the OCR process, each transliteration receives "votes" from reviewers (REVISION votes) or from machine deduction heuristics, based on shape similarity (SHAPE votes) or on spelling (SPELLING votes).
So the choice of the "best" transliteration is performed through election. Votes are stored on structures of type vdesc, and transliterations are stored on structures of type trdesc. Each symbol stores a pointer for a (possibly empty) list of transliterations and each transliteration stores a pointer for a (possibly empty) list of votes.
So, for instance, when one classifier deduces that one symbol is "a", it generates a "vote" for the transliteration of that symbol to be "a". At the same time, another heuristic could generate another vote for the transliteration to be, say, "o". The diagram illustrates this situation:
sdesc ---> trdesc ("a") ---> trdesc ("o")
| |
+-vdesc + vdesc
|
+-vdesc
|
As the total stored information about one symbol may be large, Clara maintains for each symbol its "transliteration class", used by the heuristics to categorize each symbol and also to test the current transliteration status (is it known? is it dubious?), frequently used along the source code.
| 2.10 Transliteration preference |
The election process used to choose the "best" transliteration for one symbol (from those obtained through human revision or heuristics based on shape similarity or spelling) consists in computing the "preference" of each transliteration and choosing the one with maximum preference.
The transliteration preference is the integer
UTSEAN
|
U is 1 if the transliteration was confirmed by the arbiter, or 0 otherwise.
T is 0 if this transliteration was confirmed by no trusted source, 1 if it was confirmed by some trusted source.
S is 0 if this transliteration was not shape-deduced from trusted input, or 1 if it was shape-deduced from trusted input.
E is 1 if this transliteration was deduced from spelling, or 0 otherwise.
A is 0 if this transliteration was confirmed by no anon source, 1 if it was confirmed by some anon source.
N is 0 if this transliteration was not shape-deduced from anon input, or 1 if it was shape-deduced from anon input.
| 2.11 Transliteration class computing |
Once we have computed the "best" transliteration, we can compute its transliteration class, important for various heuristics. From the transliteration class it's possible test things like "do we know the transliteration of this symbol?" or "is it an alphanumeric character?" or "concerning dimension and vertical alignment could it be an alphanumeric character?", and others.
There are two moments where the transliteration class is computed. The first is when a transliteration is added to the symbol, and the second is when the CHAR class is propagated.
The first uses the following criteria to compute the transliteration class:
1. If the symbol has no transliteration at all, its class is UNDEF.
2. On all other cases, the transliteration with largest preference will be classified as DOT, COMMA, NOISE, ACCENT and others. This search is implemented by the classify_tr function in a straightforward way.
Just before the distribution of all symbols on words we propagate CHARs. All CHAR symbols are searched, and for each one we look its neighbours that seem to compose with it one same word. Such neighbours, if untransliterated, will be classified as SCHARs.
| 2.12 The zones |
Clara OCR allows to create "zones". Zones are usually used to identify one text block in the page. For instance, a page containing two text columns should use one zone to limit each column. The zone limits are given by the "limits" array. The top left is (limits[0],limits[1]) as presented by the figure:
+---------------------------+
| (0,1) (6,7) |
| +-----------+ |
| |this is a | |
| |text block | |
| |identifyed | |
| |by a | |
| |rectangular| |
| |zone. | |
| +-----------+ |
| (2,3) (4,5) |
| |
+---------------------------+
|
| 3. Heuristics |
| 3.1 Skeleton pixels |
The first method implemented by Clara OCR for symbol classification was skeleton fitting. Two symbols are considered similar when each one contains the skeleton of the other.
Clara OCR implements five heuristics to compute skeletons. The heuristic to be used is informed through the command-line option -k as the SA parameter. The value of SA may be 0, 1, 2, 3 or 4.
Heuristics 0, 1 and 2 consider a pixel as being a skeleton pixel if it is the center of a circle inscribed within the closure, and tangent to the pattern boundary in more than one point.
The discrete implementation of this idea is as follows: for each pixel p of the closure, compute the minimum distance d from p to some boundary pixel. Now try to find two pixels on the closure boundary such that the distance from each of them to p does not differ too much from d (must be less than or equal to RR). These pixels are called "BPs".
To make the algorithm faster, the maximum distance from p to the boundary pixels considered is RX. In fact, if there exists a square of size 2*BT+1 centered at p, then p is considered a skeleton pixel.
As this criteria alone produces fat skeletons and isolated skeleton pixels along the closure boundary, two other conditions are imposed: the angular distance between the radiuses from p to each of those two pixels must be "sufficiently large" (larger than MA), and a small path joining these two boundary pixels (built only with boundary pixels) must not exist (the "joined" function computes heuristically the smallest boundary path between the two pixels, and that distance is then compared to MP).
The heuristics 1 and 2 are variants of heuristic 0:
1. (SA = 1) The minimum linear distance between the two BPs is specified as a factor (ML) of the square of the radius. This will avoid the conversion from rectangular to polar coordinates and may save some CPU time, but the results will be slightly different.
2. (SA = 2) No minimum distance checks are performed, but a minimum of MB BPs is required to exist in order to consider the pixel p a skeleton pixel.
The heuristic 3 is very simple. It computes the skeleton removing BT times the boundary.
The heuristic 4 uses "growing lines". For angles varying in steps of approximately 22 degrees, a line of lenght RX pixels is drawn from each pixel. The heuristic check if the line can or cannot be entirely drawn using black pixels. Depending on the results, it decides pixel.uropean languages.
l r
+---+--+-------+
| |
t + XX |
| XX |
| |
| XXX |
| XX |
| XX |
| XX |
| XX |
b + XXXX |
| |
+--------------+
|
| 2.3 The sdesc structure and the mc array |
Each symbol is stored in a sdesc structure. Those structures form the mc array. Once created, a symbol is never deleted. So it's index on the mc array identifies it (this is important for the web-based revision procedure). Note that closures and symbols are numbered on a document-related basis. The set of closures that define one symbol never changes. So the symbol bounding box and the total number of black pixels also won't change either.
So two different entries of the mc array never have the same set of closures. The entries of the mc array are created by the new_mc service. When some procedure tries to create a new symbol informing a list of closures for which already exists a symbol, the service new_mc detects it and returns to the caller not the index of a newly created symbol, but the index of that already created one.
| 2.4 The preferred symbols |
One same closure may belong to more than one symbol. This is important in order to allow various heuristic trials. For instance, the left closure of the "u" on the preceding section could be identified as the body of an "i". In this case however we would not find its dot. So the heuristic could try by chance another solution, for instance to join it with the nearest closure (in that case, the right closure of the "u") and try to match it with some pattern of the font.
So the OCR will need to choose, from all symbols that contain a given closure, the one to be preferred. In fact, Clara OCR maintains dynamically a partition of the set of closures on "preferred" symbols. This is the ps array. Some manual operations, like fragment merging and symbol disassembling (activated by the context menu on the page tab), change that partition dinamically, as well as some automatic procedures, like the merge step on the OCR run.
| 2.5 Font size |
The font size is important for classifying all book symbols on pattern "types". For instance, books generally use smaller letters for footnotes. This classification is performed automatically by Clara OCR and presented by the "PATTERN (types)" window.
Clara OCR generally uses millimeters for presenting sizes, but we'll soon express sizes in "points". Let's see an example. One inch corresponds to 72.27 printer's point (pt) (The METAFONTBook pg 21, note). So when using 600 dpi, each pt will correspond to 600/72.27 = 8.3 pixels. For 10 point roman characters, Knuth defines the height of lowercase letters as being 155/36 pt, so 35.7 pixels for us. Therefore, to compute the font size (f) from the height in pixels (h) of one lowercase letter, the formula is f = 10*h/35.7.
| 2.6 Symbol alignment |
The vertical alignment of symbols is important for various heuristics. For instance, the vertical line from a broken "p" matches an "l", but using alignement tests we're able to refuse this match.
The current Clara OCR alignment support was developed for the Latin alphabet, and is being adapted for other alphabets. Four vertical alignemnt positions are considered. These positions are referred as usual (ascent, baseline and descent). We use the Knuth's identifier "x_height" to refer the height of lowercase letters without ascenders.
A XXX XXXXXXXXX
XX XX X
XX XX XX
XX XX XX
X XX XXXXX XX XXXXX XX X XXXX
XXX X XXX X XXXXXXXX XX XX
XX XX XX XX XX X XX XX
XX XX XX XX XX XX XXXXXXXXXX
XX XX XX XX XX XX XX
XX XX XX XX XX XX XX
XXX X XXX X XX X XX XX XX
B XX XXXXX XX XXXXX XXXXXXXXX XXXX XXX
XX X
XX X
XX X
D XXXX
A (0) .. ascent (Knuth asc_height)
X (1) .. x_height
B (2) .. baseline
D (3) .. descent (Knuth desc_depth)
|
| 2.7 Words and lines |
Clara OCR applies The concept of "symbol" to atomic symbols like letters, digits or punctuation signs. Words (as "house" or "peace"), are handled by Clara OCR as sequences of symbols.
It's very important to compute the words of the page. They provide a context both to the OCR and to the reviewer. For instance, if the known symbols of some word were identified as bold, then Clara will automatically make the bold button on when someone tries to review the unknown symbols of that word. The same applies to prefer the recognition of one symbol as the digit "1" instead of the character "l" if the known symbols of the "word" are digits. Words are also the basis for revision based on spelling. Each words is stored on a wdesc structure on the "word" array.
When building the OCR output, Clara will combine words in lines. Each line is a sequence of words (that is, wdesc structures). The array "line" is the sequence of the heads of the detected lines. Each entry of this array is a lndesc structure. The left and right limits of words must be carefully computed and compared in order to the OCR partitionate then in columns, when dealing with multi-column pages.
| 2.8 Acts and transliterations |
The "acts" or "revision acts" are the human interventions for training a symbol, merging a fragment to one symbol, etc.
As the human interventions are the more precious source of information, Clara logs all revision acts, in order to be able to reuse them.
The transliterations are obtained from the revision acts, so each transliteration refers one (or more) revision acts, and also inherits some properties from that act (or those acts).
The acts are on the book scope, and not on the page scope. The acts are stored on the file "acts" on the work directory.
Each act stores some data about the reviewer and also the submission date. As we plan to reuse revision data, each act also stores some data about the "original reviewer" and the "original submission date". These fields are meaningful only for reused acts.
| 2.9 Symbol transliterations |
Clara OCR maintains a list of 0 or more proposed or deduced transliterations for each symbol. Along the OCR process, each transliteration receives "votes" from reviewers (REVISION votes) or from machine deduction heuristics, based on shape similarity (SHAPE votes) or on spelling (SPELLING votes).
So the choice of the "best" transliteration is performed through election. Votes are stored on structures of type vdesc, and transliterations are stored on structures of type trdesc. Each symbol stores a pointer for a (possibly empty) list of transliterations and each transliteration stores a pointer for a (possibly empty) list of votes.
So, for instance, when one classifier deduces that one symbol is "a", it generates a "vote" for the transliteration of that symbol to be "a". At the same time, another heuristic could generate another vote for the transliteration to be, say, "o". The diagram illustrates this situation:
sdesc ---> trdesc ("a") ---> trdesc ("o")
| |
+-vdesc + vdesc
|
+-vdesc
|
As the total stored information about one symbol may be large, Clara maintains for each symbol its "transliteration class", used by the heuristics to categorize each symbol and also to test the current transliteration status (is it known? is it dubious?), frequently used along the source code.
| 2.10 Transliteration preference |
The election process used to choose the "best" transliteration for one symbol (from those obtained through human revision or heuristics based on shape similarity or spelling) consists in computing the "preference" of each transliteration and choosing the one with maximum preference.
The transliteration preference is the integer
UTSEAN
|
U is 1 if the transliteration was confirmed by the arbiter, or 0 otherwise.
T is 0 if this transliteration was confirmed by no trusted source, 1 if it was confirmed by some trusted source.
S is 0 if this transliteration was not shape-deduced from trusted input, or 1 if it was shape-deduced from trusted input.
E is 1 if this transliteration was deduced from spelling, or 0 otherwise.
A is 0 if this transliteration was confirmed by no anon source, 1 if it was confirmed by some anon source.
N is 0 if this transliteration was not shape-deduced from anon input, or 1 if it was shape-deduced from anon input.
| 2.11 Transliteration class computing |
Once we have computed the "best" transliteration, we can compute its transliteration class, important for various heuristics. From the transliteration class it's possible test things like "do we know the transliteration of this symbol?" or "is it an alphanumeric character?" or "concerning dimension and vertical alignment could it be an alphanumeric character?", and others.
There are two moments where the transliteration class is computed. The first is when a transliteration is added to the symbol, and the second is when the CHAR class is propagated.
The first uses the following criteria to compute the transliteration class:
1. If the symbol has no transliteration at all, its class is UNDEF.
2. On all other cases, the transliteration with largest preference will be classified as DOT, COMMA, NOISE, ACCENT and others. This search is implemented by the classify_tr function in a straightforward way.
Just before the distribution of all symbols on words we propagate CHARs. All CHAR symbols are searched, and for each one we look its neighbours that seem to compose with it one same word. Such neighbours, if untransliterated, will be classified as SCHARs.
| 2.12 The zones |
Clara OCR allows to create "zones". Zones are usually used to identify one text block in the page. For instance, a page containing two text columns should use one zone to limit each column. The zone limits are given by the "limits" array. The top left is (limits[0],limits[1]) as presented by the figure:
+---------------------------+
| (0,1) (6,7) |
| +-----------+ |
| |this is a | |
| |text block | |
| |identifyed | |
| |by a | |
| |rectangular| |
| |zone. | |
| +-----------+ |
| (2,3) (4,5) |
| |
+---------------------------+
|
| 3. Heuristics |
| 3.1 Skeleton pixels |
The first method implemented by Clara OCR for symbol classification was skeleton fitting. Two symbols are considered similar when each one contains the skeleton of the other.
Clara OCR implements five heuristics to compute skeletons. The heuristic to be used is informed through the command-line option -k as the SA parameter. The value of SA may be 0, 1, 2, 3 or 4.
Heuristics 0, 1 and 2 consider a pixel as being a skeleton pixel if it is the center of a circle inscribed within the closure, and tangent to the pattern boundary in more than one point.
The discrete implementation of this idea is as follows: for each pixel p of the closure, compute the minimum distance d from p to some boundary pixel. Now try to find two pixels on the closure boundary such that the distance from each of them to p does not differ too much from d (must be less than or equal to RR). These pixels are called "BPs".
To make the algorithm faster, the maximum distance from p to the boundary pixels considered is RX. In fact, if there exists a square of size 2*BT+1 centered at p, then p is considered a skeleton pixel.
As this criteria alone produces fat skeletons and isolated skeleton pixels along the closure boundary, two other conditions are imposed: the angular distance between the radiuses from p to each of those two pixels must be "sufficiently large" (larger than MA), and a small path joining these two boundary pixels (built only with boundary pixels) must not exist (the "joined" function computes heuristically the smallest boundary path between the two pixels, and that distance is then compared to MP).
The heuristics 1 and 2 are variants of heuristic 0:
1. (SA = 1) The minimum linear distance between the two BPs is specified as a factor (ML) of the square of the radius. This will avoid the conversion from rectangular to polar coordinates and may save some CPU time, but the results will be slightly different.
2. (SA = 2) No minimum distance checks are performed, but a minimum of MB BPs is required to exist in order to consider the pixel p a skeleton pixel.
The heuristic 3 is very simple. It computes the skeleton removing BT times the boundary.
The heuristic 4 uses "growing lines". For angles varying in steps of approximately 22 degrees, a line of lenght RX pixels is drawn from each pixel. The heuristic check if the line can or cannot be entirely drawn using black pixels. Depending on the results, it decides pixel.uropean languages.
l r
+---+--+-------+
| |
t + XX |
| XX |
| |
| XXX |
| XX |
| XX |
| XX |
| XX |
b + XXXX |
| |
+--------------+
|
| 2.3 The sdesc structure and the mc array |
Each symbol is stored in a sdesc structure. Those structures form the mc array. Once created, a symbol is never deleted. So it's index on the mc array identifies it (this is important for the web-based revision procedure). Note that closures and symbols are numbered on a document-related basis. The set of closures that define one symbol never changes. So the symbol bounding box and the total number of black pixels also won't change either.
So two different entries of the mc array never have the same set of closures. The entries of the mc array are created by the new_mc service. When some procedure tries to create a new symbol informing a list of closures for which already exists a symbol, the service new_mc detects it and returns to the caller not the index of a newly created symbol, but the index of that already created one.
| 2.4 The preferred symbols |
One same closure may belong to more than one symbol. This is important in order to allow various heuristic trials. For instance, the left closure of the "u" on the preceding section could be identified as the body of an "i". In this case however we would not find its dot. So the heuristic could try by chance another solution, for instance to join it with the nearest closure (in that case, the right closure of the "u") and try to match it with some pattern of the font.
So the OCR will need to choose, from all symbols that contain a given closure, the one to be preferred. In fact, Clara OCR maintains dynamically a partition of the set of closures on "preferred" symbols. This is the ps array. Some manual operations, like fragment merging and symbol disassembling (activated by the context menu on the page tab), change that partition dinamically, as well as some automatic procedures, like the merge step on the OCR run.
| 2.5 Font size |
The font size is important for classifying all book symbols on pattern "types". For instance, books generally use smaller letters for footnotes. This classification is performed automatically by Clara OCR and presented by the "PATTERN (types)" window.
Clara OCR generally uses millimeters for presenting sizes, but we'll soon express sizes in "points". Let's see an example. One inch corresponds to 72.27 printer's point (pt) (The METAFONTBook pg 21, note). So when using 600 dpi, each pt will correspond to 600/72.27 = 8.3 pixels. For 10 point roman characters, Knuth defines the height of lowercase letters as being 155/36 pt, so 35.7 pixels for us. Therefore, to compute the font size (f) from the height in pixels (h) of one lowercase letter, the formula is f = 10*h/35.7.
| 2.6 Symbol alignment |
The vertical alignment of symbols is important for various heuristics. For instance, the vertical line from a broken "p" matches an "l", but using alignement tests we're able to refuse this match.
The current Clara OCR alignment support was developed for the Latin alphabet, and is being adapted for other alphabets. Four vertical alignemnt positions are considered. These positions are referred as usual (ascent, baseline and descent). We use the Knuth's identifier "x_height" to refer the height of lowercase letters without ascenders.
A XXX XXXXXXXXX
XX XX X
XX XX XX
XX XX XX
X XX XXXXX XX XXXXX XX X XXXX
XXX X XXX X XXXXXXXX XX XX
XX XX XX XX XX X XX XX
XX XX XX XX XX XX XXXXXXXXXX
XX XX XX XX XX XX XX
XX XX XX XX XX XX XX
XXX X XXX X XX X XX XX XX
B XX XXXXX XX XXXXX XXXXXXXXX XXXX XXX
XX X
XX X
XX X
D XXXX
A (0) .. ascent (Knuth asc_height)
X (1) .. x_height
B (2) .. baseline
D (3) .. descent (Knuth desc_depth)
|
| 2.7 Words and lines |
Clara OCR applies The concept of "symbol" to atomic symbols like letters, digits or punctuation signs. Words (as "house" or "peace"), are handled by Clara OCR as sequences of symbols.
It's very important to compute the words of the page. They provide a context both to the OCR and to the reviewer. For instance, if the known symbols of some word were identified as bold, then Clara will automatically make the bold button on when someone tries to review the unknown symbols of that word. The same applies to prefer the recognition of one symbol as the digit "1" instead of the character "l" if the known symbols of the "word" are digits. Words are also the basis for revision based on spelling. Each words is stored on a wdesc structure on the "word" array.
When building the OCR output, Clara will combine words in lines. Each line is a sequence of words (that is, wdesc structures). The array "line" is the sequence of the heads of the detected lines. Each entry of this array is a lndesc structure. The left and right limits of words must be carefully computed and compared in order to the OCR partitionate then in columns, when dealing with multi-column pages.
| 2.8 Acts and transliterations |
The "acts" or "revision acts" are the human interventions for training a symbol, merging a fragment to one symbol, etc.
As the human interventions are the more precious source of information, Clara logs all revision acts, in order to be able to reuse them.
The transliterations are obtained from the revision acts, so each transliteration refers one (or more) revision acts, and also inherits some properties from that act (or those acts).
The acts are on the book scope, and not on the page scope. The acts are stored on the file "acts" on the work directory.
Each act stores some data about the reviewer and also the submission date. As we plan to reuse revision data, each act also stores some data about the "original reviewer" and the "original submission date". These fields are meaningful only for reused acts.
| 2.9 Symbol transliterations |
Clara OCR maintains a list of 0 or more proposed or deduced transliterations for each symbol. Along the OCR process, each transliteration receives "votes" from reviewers (REVISION votes) or from machine deduction heuristics, based on shape similarity (SHAPE votes) or on spelling (SPELLING votes).
So the choice of the "best" transliteration is performed through election. Votes are stored on structures of type vdesc, and transliterations are stored on structures of type trdesc. Each symbol stores a pointer for a (possibly empty) list of transliterations and each transliteration stores a pointer for a (possibly empty) list of votes.
So, for instance, when one classifier deduces that one symbol is "a", it generates a "vote" for the transliteration of that symbol to be "a". At the same time, another heuristic could generate another vote for the transliteration to be, say, "o". The diagram illustrates this situation:
sdesc ---> trdesc ("a") ---> trdesc ("o")
| |
+-vdesc + vdesc
|
+-vdesc
|
As the total stored information about one symbol may be large, Clara maintains for each symbol its "transliteration class", used by the heuristics to categorize each symbol and also to test the current transliteration status (is it known? is it dubious?), frequently used along the source code.
| 2.10 Transliteration preference |
The election process used to choose the "best" transliteration for one symbol (from those obtained through human revision or heuristics based on shape similarity or spelling) consists in computing the "preference" of each transliteration and choosing the one with maximum preference.
The transliteration preference is the integer
UTSEAN
|
U is 1 if the transliteration was confirmed by the arbiter, or 0 otherwise.
T is 0 if this transliteration was confirmed by no trusted source, 1 if it was confirmed by some trusted source.
S is 0 if this transliteration was not shape-deduced from trusted input, or 1 if it was shape-deduced from trusted input.
E is 1 if this transliteration was deduced from spelling, or 0 otherwise.
A is 0 if this transliteration was confirmed by no anon source, 1 if it was confirmed by some anon source.
N is 0 if this transliteration was not shape-deduced from anon input, or 1 if it was shape-deduced from anon input.
| 2.11 Transliteration class computing |
Once we have computed the "best" transliteration, we can compute its transliteration class, important for various heuristics. From the transliteration class it's possible test things like "do we know the transliteration of this symbol?" or "is it an alphanumeric character?" or "concerning dimension and vertical alignment could it be an alphanumeric character?", and others.
There are two moments where the transliteration class is computed. The first is when a transliteration is added to the symbol, and the second is when the CHAR class is propagated.
The first uses the following criteria to compute the transliteration class:
1. If the symbol has no transliteration at all, its class is UNDEF.
2. On all other cases, the transliteration with largest preference will be classified as DOT, COMMA, NOISE, ACCENT and others. This search is implemented by the classify_tr function in a straightforward way.
Just before the distribution of all symbols on words we propagate CHARs. All CHAR symbols are searched, and for each one we look its neighbours that seem to compose with it one same word. Such neighbours, if untransliterated, will be classified as SCHARs.
| 2.12 The zones |
Clara OCR allows to create "zones". Zones are usually used to identify one text block in the page. For instance, a page containing two text columns should use one zone to limit each column. The zone limits are given by the "limits" array. The top left is (limits[0],limits[1]) as presented by the figure:
+---------------------------+
| (0,1) (6,7) |
| +-----------+ |
| |this is a | |
| |text block | |
| |identifyed | |
| |by a | |
| |rectangular| |
| |zone. | |
| +-----------+ |
| (2,3) (4,5) |
| |
+---------------------------+
|
| 3. Heuristics |
| 3.1 Skeleton pixels |
The first method implemented by Clara OCR for symbol classification was skeleton fitting. Two symbols are considered similar when each one contains the skeleton of the other.
Clara OCR implements five heuristics to compute skeletons. The heuristic to be used is informed through the command-line option -k as the SA parameter. The value of SA may be 0, 1, 2, 3 or 4.
Heuristics 0, 1 and 2 consider a pixel as being a skeleton pixel if it is the center of a circle inscribed within the closure, and tangent to the pattern boundary in more than one point.
The discrete implementation of this idea is as follows: for each pixel p of the closure, compute the minimum distance d from p to some boundary pixel. Now try to find two pixels on the closure boundary such that the distance from each of them to p does not differ too much from d (must be less than or equal to RR). These pixels are called "BPs".
To make the algorithm faster, the maximum distance from p to the boundary pixels considered is RX. In fact, if there exists a square of size 2*BT+1 centered at p, then p is considered a skeleton pixel.
As this criteria alone produces fat skeletons and isolated skeleton pixels along the closure boundary, two other conditions are imposed: the angular distance between the radiuses from p to each of those two pixels must be "sufficiently large" (larger than MA), and a small path joining these two boundary pixels (built only with boundary pixels) must not exist (the "joined" function computes heuristically the smallest boundary path between the two pixels, and that distance is then compared to MP).
The heuristics 1 and 2 are variants of heuristic 0:
1. (SA = 1) The minimum linear distance between the two BPs is specified as a factor (ML) of the square of the radius. This will avoid the conversion from rectangular to polar coordinates and may save some CPU time, but the results will be slightly different.
2. (SA = 2) No minimum distance checks are performed, but a minimum of MB BPs is required to exist in order to consider the pixel p a skeleton pixel.
The heuristic 3 is very simple. It computes the skeleton removing BT times the boundary.
The heuristic 4 uses "growing lines". For angles varying in steps of approximately 22 degrees, a line of lenght RX pixels is drawn from each pixel. The heuristic check if the line can or cannot be entirely drawn using black pixels. Depending on the results, it decides pixel.uropean languages.
l r
+---+--+-------+
| |
t + XX |
| XX |
| |
| XXX |
| XX |
| XX |
| XX |
| XX |
b + XXXX |
| |
+--------------+
|
| 2.3 The sdesc structure and the mc array |
Each symbol is stored in a sdesc structure. Those structures form the mc array. Once created, a symbol is never deleted. So it's index on the mc array identifies it (this is important for the web-based revision procedure). Note that closures and symbols are numbered on a document-related basis. The set of closures that define one symbol never changes. So the symbol bounding box and the total number of black pixels also won't change either.
So two different entries of the mc array never have the same set of closures. The entries of the mc array are created by the new_mc service. When some procedure tries to create a new symbol informing a list of closures for which already exists a symbol, the service new_mc detects it and returns to the caller not the index of a newly created symbol, but the index of that already created one.
| 2.4 The preferred symbols |
One same closure may belong to more than one symbol. This is important in order to allow various heuristic trials. For instance, the left closure of the "u" on the preceding section could be identified as the body of an "i". In this case however we would not find its dot. So the heuristic could try by chance another solution, for instance to join it with the nearest closure (in that case, the right closure of the "u") and try to match it with some pattern of the font.
So the OCR will need to choose, from all symbols that contain a given closure, the one to be preferred. In fact, Clara OCR maintains dynamically a partition of the set of closures on "preferred" symbols. This is the ps array. Some manual operations, like fragment merging and symbol disassembling (activated by the context menu on the page tab), change that partition dinamically, as well as some automatic procedures, like the merge step on the OCR run.
| 2.5 Font size |
The font size is important for classifying all book symbols on pattern "types". For instance, books generally use smaller letters for footnotes. This classification is performed automatically by Clara OCR and presented by the "PATTERN (types)" window.
Clara OCR generally uses millimeters for presenting sizes, but we'll soon express sizes in "points". Let's see an example. One inch corresponds to 72.27 printer's point (pt) (The METAFONTBook pg 21, note). So when using 600 dpi, each pt will correspond to 600/72.27 = 8.3 pixels. For 10 point roman characters, Knuth defines the height of lowercase letters as being 155/36 pt, so 35.7 pixels for us. Therefore, to compute the font size (f) from the height in pixels (h) of one lowercase letter, the formula is f = 10*h/35.7.
| 2.6 Symbol alignment |
The vertical alignment of symbols is important for various heuristics. For instance, the vertical line from a broken "p" matches an "l", but using alignement tests we're able to refuse this match.
The current Clara OCR alignment support was developed for the Latin alphabet, and is being adapted for other alphabets. Four vertical alignemnt positions are considered. These positions are referred as usual (ascent, baseline and descent). We use the Knuth's identifier "x_height" to refer the height of lowercase letters without ascenders.
A XXX XXXXXXXXX
XX XX X
XX XX XX
XX XX XX
X XX XXXXX XX XXXXX XX X XXXX
XXX X XXX X XXXXXXXX XX XX
XX XX XX XX XX X XX XX
XX XX XX XX XX XX XXXXXXXXXX
XX XX XX XX XX XX XX
XX XX XX XX XX XX XX
XXX X XXX X XX X XX XX XX
B XX XXXXX XX XXXXX XXXXXXXXX XXXX XXX
XX X
XX X
XX X
D XXXX
A (0) .. ascent (Knuth asc_height)
X (1) .. x_height
B (2) .. baseline
D (3) .. descent (Knuth desc_depth)
|
| 2.7 Words and lines |
Clara OCR applies The concept of "symbol" to atomic symbols like letters, digits or punctuation signs. Words (as "house" or "peace"), are handled by Clara OCR as sequences of symbols.
It's very important to compute the words of the page. They provide a context both to the OCR and to the reviewer. For instance, if the known symbols of some word were identified as bold, then Clara will automatically make the bold button on when someone tries to review the unknown symbols of that word. The same applies to prefer the recognition of one symbol as the digit "1" instead of the character "l" if the known symbols of the "word" are digits. Words are also the basis for revision based on spelling. Each words is stored on a wdesc structure on the "word" array.
When building the OCR output, Clara will combine words in lines. Each line is a sequence of words (that is, wdesc structures). The array "line" is the sequence of the heads of the detected lines. Each entry of this array is a lndesc structure. The left and right limits of words must be carefully computed and compared in order to the OCR partitionate then in columns, when dealing with multi-column pages.
| 2.8 Acts and transliterations |
The "acts" or "revision acts" are the human interventions for training a symbol, merging a fragment to one symbol, etc.
As the human interventions are the more precious source of information, Clara logs all revision acts, in order to be able to reuse them.
The transliterations are obtained from the revision acts, so each transliteration refers one (or more) revision acts, and also inherits some properties from that act (or those acts).
The acts are on the book scope, and not on the page scope. The acts are stored on the file "acts" on the work directory.
Each act stores some data about the reviewer and also the submission date. As we plan to reuse revision data, each act also stores some data about the "original reviewer" and the "original submission date". These fields are meaningful only for reused acts.
| 2.9 Symbol transliterations |
Clara OCR maintains a list of 0 or more proposed or deduced transliterations for each symbol. Along the OCR process, each transliteration receives "votes" from reviewers (REVISION votes) or from machine deduction heuristics, based on shape similarity (SHAPE votes) or on spelling (SPELLING votes).
So the choice of the "best" transliteration is performed through election. Votes are stored on structures of type vdesc, and transliterations are stored on structures of type trdesc. Each symbol stores a pointer for a (possibly empty) list of transliterations and each transliteration stores a pointer for a (possibly empty) list of votes.
So, for instance, when one classifier deduces that one symbol is "a", it generates a "vote" for the transliteration of that symbol to be "a". At the same time, another heuristic could generate another vote for the transliteration to be, say, "o". The diagram illustrates this situation:
sdesc ---> trdesc ("a") ---> trdesc ("o")
| |
+-vdesc + vdesc
|
+-vdesc
|
As the total stored information about one symbol may be large, Clara maintains for each symbol its "transliteration class", used by the heuristics to categorize each symbol and also to test the current transliteration status (is it known? is it dubious?), frequently used along the source code.
| 2.10 Transliteration preference |
The election process used to choose the "best" transliteration for one symbol (from those obtained through human revision or heuristics based on shape similarity or spelling) consists in computing the "preference" of each transliteration and choosing the one with maximum preference.
The transliteration preference is the integer
UTSEAN
|
U is 1 if the transliteration was confirmed by the arbiter, or 0 otherwise.
T is 0 if this transliteration was confirmed by no trusted source, 1 if it was confirmed by some trusted source.
S is 0 if this transliteration was not shape-deduced from trusted input, or 1 if it was shape-deduced from trusted input.
E is 1 if this transliteration was deduced from spelling, or 0 otherwise.
A is 0 if this transliteration was confirmed by no anon source, 1 if it was confirmed by some anon source.
N is 0 if this transliteration was not shape-deduced from anon input, or 1 if it was shape-deduced from anon input.
| 2.11 Transliteration class computing |
Once we have computed the "best" transliteration, we can compute its transliteration class, important for various heuristics. From the transliteration class it's possible test things like "do we know the transliteration of this symbol?" or "is it an alphanumeric character?" or "concerning dimension and vertical alignment could it be an alphanumeric character?", and others.
There are two moments where the transliteration class is computed. The first is when a transliteration is added to the symbol, and the second is when the CHAR class is propagated.
The first uses the following criteria to compute the transliteration class:
1. If the symbol has no transliteration at all, its class is UNDEF.
2. On all other cases, the transliteration with largest preference will be classified as DOT, COMMA, NOISE, ACCENT and others. This search is implemented by the classify_tr function in a straightforward way.
Just before the distribution of all symbols on w