Information Extraction
One of the most flexible systems for information extraction and information acquisition: Processing is done on the level of content / the meaning of texts. Consequently differences in wording or formatting of texts can easily be abstracted. Rules for information extraction are more generally valid and will have to be adapted rarely even when the data from which information is to be extracted changes.
Our software tools can extract the following types of information from most kinds of textual documents - Text, RTF, HTML, SGML, XML, PDF, PostScript:
- Addresses (contact persons, candidates for jobs, responsible persons for Web servers)
- Product information (e.g. product name, product description, listing of the properties / features, price, availability).We use this within our competitor surveillance system FirmWatch.com.
- The automatically gathering of information on the individual's linguistic preferences and writing style, automatic learning of linguistic information to extend lexica.
- Automatic learning from parallel texts by comparing existing translations of the same texts in several languages.
- Recognition, classification and intelligent distribution of relevant messages (news, e-mail, mailing list contributions, internet news)
- Information extraction and meta search engines for all kinds of Internet information; especially value added services based on existing search engines, online shops, and information services.
We work with ultra modern and very flexible declarative systems, based on a combination of rules with a probability pattern which allows an optimized assignment of data to information slots needing to be filled. For example the extraction of addresses poses the following major difficulties to classical information extraction approaches:
- The German names "Rudolf", "Dieter", "Thomas" may be either first names or surnames. Each name is with the probability x a first name and with the probability y a surname. Comparing the probability for both variables determines whether "Rudolf Dieter" or "Dieter Rudolf" is the better choice.
- Addresses may be divided into various lines/columns in a table.
- Street and town names usually cannot be distinguished from comments for delivery.
- On various web sites addresses are often included in fragments only. You may for example on one web site only find the general mailing address –whilst on others, the names of the contact persons may be included. Occasionally, there may be the street and the house number of the contact person displayed in case of a company that has several departments or affiliated firms.
- Often the entire address or parts of it are only graphically present (e.g. company logos bearing the company name). Due to the graphical layout and esthetical modifications of the names (e.g. DfB (German soccer association) or VW (Volkswagen)) even the most advanced text recognition systems are no longer able to identify the underlying characters. This problem can be ameliorated by extracting the information elements which are available in textual form and assigning these in a robust fashion to the appropriate information slots. The information which is only present in graphical form is guessed based on other information sources like other web pages or web domain names.
Through our novel basic approaches we offer solutions to the problems shown above.
Our exceptional strengths in the extraction of content consist in:
- Flexible synergetic combination of various basic approaches of information extraction
- More general and abstract extraction approaches that tolerate differences, inconsistencies or changes in the underlying data or only require minor cost-efficient adaptations
- Application of language understanding technology for the extraction of content for increased robustness, accuracy and a higher level of abstraction. Regard is automatically taken of synonyms, the relation between sub- and superordinate concepts, and the linguistic interconnections between words. We use this technology in our automatic intelligent question answering software FactMind.
Address
Compris Intelligence GmbH
Rheingönheimer Str. 79
67065 Ludwigshafen am Rhein
Germany
phone: (+49) 0700-COMPRISTel (0700-26677478)
fax: (+49) 0700-COMPRISFax (0700-26677473)
Internet: www.FactMiner.com
E-Mail:products@compris.com
Information & Questions: products@compris.com