Meeting interesting characters

(Document analysis at DFKI - Thomas Pötter)

Summary

The presentation deals with different methods for optical characters recognition (OCR). The most important methods are:

Hand-written text: Deskewing is required to cope with slanted handwriting and it is difficult to break up words into characters. Varying characters are the hardest problem: Notepad computers record the whole process of writing and often require that the user writes each character into a separate bounding box provided by the system to facilitate OCR.

Stages in text processing (at DFKI):

  1. Segmentation: pre-classifying objects as graphic or text (lines, words, characters).
  2. Recognition/Classification: Characters are recognized: For each character different possibilities are provided together with certainty factors which can be viewed as probabilities for each alternative.
  3. Post-classification: Taking advantage of the graphical and textual environment to cut down the number of alternatives and to calculate better certainty factors. The graphical information used is the relative position to the halfline, capline, baseline of a character: That way ,'cCoOxXsSvVwWpP can be classified properly. Data from the textual environment is used to determine the syllables to which a character might belong, to determine the correct punctuation marks and the spacing in between the words as well as to make a dictionary lookup to make sure that all the characters of a word a recognized properly. Interesting concepts in this concept are: confusion matrices and the calculation of a smallest edit distance between the recognized word (possibly misspelled) and the words in the dictionary.
  4. Text analysis phase: E. g. classification of a business letter as invoice, order, inquiry, etc. Data like sender; subject, date, ordered products, related letters are determined. This information will be stored in a database. Additional functionality is to "understand" the text and to answer queries automatically and to take further steps necessary to process orders and invoices.

Applications:

  1. Archiving business letters, faxes, e-mails, etc. consistently
  2. Office automation: answering inquiries automatically and doing automatic processing depending on the nature of the letter.
  3. Processing filled in blanks automatically (for use in the tax office or other public bodies)
  4. Classifying receipts to automate the creation of balances in business.
  5. As substitute for typing written information into the computer.



www.compris.com  | Language Material  | Contact/Map  | About Compris Text Technologies GmbH