Meeting interesting characters
(Document analysis at DFKI - Thomas Pötter)
Summary
The presentation deals with different methods for optical characters
recognition (OCR). The most important methods are:
- Shape-oriented recognition: The inner and outer boundaries
of characters are detected and saved as vectors or splines. This
detection method requires several passes with relaxations in
order to eliminate accidental roughness due to errors in the scanning
process or due to letters of bad quality in the original document.
On the other hand it mustn't be relaxed so far that important
features of character disappear and are lost for the further recognition
process. This method is particularly useful for recognizing characters
at different sizes and for new characters because it captures
all the character's characteristics -- on the other hand it is
slow and hard to implement.
- Crossing counts ("Winkelschnittanalyse"): The character
is intersected with several parallel lines. For each line the
number of intersections is recorded. The intersections are calculated
with parallel lines from/at(?) different angles (usually 8 different
angles). This method is faster and can recognize characters at
any size as well. Italics and obliques are problematic.
- Neural network based on pixmap (perceptron): The pixel matrix
of the original characters are learned by the neural network.
Characters to be recognized are given to this neural network as
pixmap. Noisy or blurred characters are easily recognized if most
of a character's pixels remain in place. Different sizes and characters
enlarged in just one direction (along x- or y-axis) have to calculated
back to the original size of the pixmap. This method is easy to
implement and works fine for printed characters at a uniform size
only.
- Hybrid methods: Some criterion like the position of stems,
round corners, line ends or data determined by the crossing counts
method is taken as input for a neural network.
Hand-written text: Deskewing is required to cope with slanted
handwriting and it is difficult to break up words into characters.
Varying characters are the hardest problem: Notepad computers
record the whole process of writing and often require that the
user writes each character into a separate bounding box provided
by the system to facilitate OCR.
Stages in text processing (at DFKI):
- Segmentation: pre-classifying objects as graphic or text (lines,
words, characters).
- Recognition/Classification: Characters are recognized: For
each character different possibilities are provided together with
certainty factors which can be viewed as probabilities for each
alternative.
- Post-classification: Taking advantage of the graphical and
textual environment to cut down the number of alternatives and
to calculate better certainty factors. The graphical information
used is the relative position to the halfline, capline, baseline
of a character: That way ,'cCoOxXsSvVwWpP can be classified properly.
Data from the textual environment is used to determine the syllables
to which a character might belong, to determine the correct punctuation
marks and the spacing in between the words as well as to make
a dictionary lookup to make sure that all the characters of a
word a recognized properly. Interesting concepts in this concept
are: confusion matrices and the calculation of a smallest edit
distance between the recognized word (possibly misspelled) and
the words in the dictionary.
- Text analysis phase: E. g. classification of a business letter
as invoice, order, inquiry, etc. Data like sender; subject, date,
ordered products, related letters are determined. This information
will be stored in a database. Additional functionality is to "understand"
the text and to answer queries automatically and to take further
steps necessary to process orders and invoices.
Applications:
- Archiving business letters, faxes, e-mails, etc. consistently
- Office automation: answering inquiries automatically and doing
automatic processing depending on the nature of the letter.
- Processing filled in blanks automatically (for use in the
tax office or other public bodies)
- Classifying receipts to automate the creation of balances
in business.
- As substitute for typing written information into the computer.