Language Independence
The core concept matching of Autonomy's technology does not depend on an intimate knowledge of English grammatical structure or that of any particular language. It treats words as abstract symbols of meaning, deriving its understanding through the context of their occurrence rather than a rigid definition of grammar. Slang and other variations in language will not confuse the software. The IDOL Server already has a statistical understanding of the patterns of 'typical' English, but the engine can be trained on the patterns of any language (German, Spanish, Portuguese, Arabic, Italian, French, Japanese, Norwegian, etc).
Issues
As companies realize the value of presenting their sites to global markets, the dominance of the English language is set to decline. Companies are understanding that they need to approach customers in their native/own language and large multi-nationals can broaden/increase the knowledge available in the enterprise in all languages.
Whether you are implementing a portal site, an E-Commerce service or a corporate Knowledge Management system, you will be faced with the issue of internationalization.
This White Paper describes the language independent feature of Autonomy's technology that allows Autonomy to fully support a variety of worldwide languages, providing benefits such as:
- An international exchange of expertise
- Access to your global information assets
- Growth by reaching new market opportunities
Definitions
Introduction
Language can be defined as 'the use by human beings of spoken and written symbols in organized combinations and patterns in order to express and communicate thoughts and feelings.'
As its definition implies, language is used in set patterns to express the abstract notion of knowledge and information. To fully realize the potential of stored knowledge bases worldwide, the knowledge itself needs to be distributed/shared amongst its global user base, regardless of the language in which it was originally presented.
Currently most enterprises need to manage content that is represented in more than one language. Solutions that are independent of language constructs are therefore of utmost importance (and value). Companies can no longer afford to incur extra costs to utilize or provide new or existing information in another language, made necessary by opening branch offices in new market areas and countries. Autonomy's technology and architecture is ideally positioned; as it is designed to be completely modular, enabling to scale for content exploitation both today and in the future, in any language, in no time.
Key Factors
Internationalization is the process of developing a program core whose feature and code design does not make any assumptions based on a single language's characteristics.
When developing software depending on information the key questions are:
- Does the core algorithm make any assumptions about language constructs, digital representation of symbols, etc.?
- Does the core algorithm depend on the constructs of a particular language? That is, is any major system redesign and development required in order to support a new language?
Autonomy's Approach to Language
The Dynamic Reasoning Engine™ is based on advanced pattern-matching technology (non-linear adaptive digital signal processing) that exploits high-performance probabilistic modeling techniques to extract a document's digital essence and determine the characteristics that give the text meaning. As this technology is based on probabilistic modeling, it does not use any form of language dependent parsing or dictionaries. The IDOL Server treats words as abstract symbols of meaning deriving its understanding through the context of their occurrence rather than a rigid definition of the language grammar.
The IDOL Server develops a statistical understanding of the patterns that occur in the content that it sees over time. The more information the IDOL Server has about a particular type of information (e.g. legal terms, pharmaceutical developments, technology, etc.) the more understanding it will have of those topics. A new language can be thought of as simply another 'type' of information that the IDOL Server needs enough material to learn from. Therefore, it is possible to mix more than one language in an IDOL Server as long as the amounts for each language are sufficient to build its understanding.
The choice of language does not compromise the accuracy of the concepts extracted by the IDOL Server. The underlying algorithm is the same regardless of the language used.
Autonomy's Use of Language Dependent Features
Although, Autonomy's technology is totally language independent, it is often beneficial to use language dependent features in order to increase performance. Autonomy provides the choice of the following features in order to optimize your system:
- Stop lists: every language has 'empty words' that do not carry any significant meaning. In grammatical terms these would normally be prepositions, conjunctions, auxiliary verbs, etc. e.g. In English words such as 'the', 'a', 'and', 'to', etc. These words can be safely ignored when processing content.
- Stemming: in most languages certain variations of a word can be stripped to obtain the main stem of the word. In English for example, the words 'run', 'runner' and 'running' can all be stripped down to its stem 'run' without much loss of meaning. Stemming rules can be safely used when processing text in order to obtain a list of unique words.
Autonomy does not require the use of stop lists or stemming rules, as the statistical analysis would normally determine the importance and relationship of those words. However, an initial configuration of a stop list and stemming rules allows the IDOL Server to ignore empty words and treat a set of words as one so that storage resources and processing time can be reduced.
Autonomy provides as standard a set of stop lists and stemming algorithms for most commonly used languages.
- Transliteration Schemes: transliteration is the ability to represent letters or words in the corresponding characters of another alphabet. Some languages make the use of transliteration schemes in order for people to be able to write text without the need of a special keyboard that supports the original alphabet. Autonomy supports most transliteration schemes used in languages such as Greek, Russian, etc.
- Upper/Lower case character matching: words written in either upper or lower case mean exactly the same thing. Casing is only used to ease readability and to emphasize the type of information used (i. e. proper names, start of a paragraph, etc.). It is important to consider different possible cases in characters so that words can be matched regardless of their case. This feature is called case insensitivity. Upper/lower case characters vary depending on the language, and even some languages do not have the notion of case sensitivity.
- Canonicalization of characters: some languages may have more than one way of representing a character. For example in Japanese you can have the katakana script written in full width or half width characters. Regardless of its width the character in itself carries the same meaning. This is the same for numbers and letters of the Roman alphabet. Because of the nature of double byte languages, double byte representation of those characters can be used. Autonomy products make sure that all forms are canonicalized into one single form so that they are treated equally.
Benefits of Autonomy's Technology
Global Implementation
More content is becoming available in a particular/number of language(s) and an increasing number of users speak a language other than English. This is particularly true as more companies around the world put and use their information online and start to conduct E-Business. To succeed in a global internationalization strategy, enterprises need to impose rigorous requirements on the underlying technology and necessitate industry-leading functionality and performance.
Autonomy's technology and architecture is ideally positioned; as it is designed to be completely modular, enabling to scale for content exploitation both today and in the future, in any language, in no time.
Expansion to Cross-Lingual Systems
Autonomy's core technology can be used to set up cross lingual systems. This will allow, for example, a user to look at a document in English, and be suggested with similar information both in English and Spanish.
The Dynamic Reasoning Engine™ can be used to establish a correlation between one or more languages. To achieve this, a training dataset is needed, where each document is expressed in the required languages. Ideally each extract will be a direct translation of the other. Electronic dictionaries can also be used successfully for this purpose. The cross lingual content can be indexed into an IDOL Server database which will give the engine a general understanding of the concepts involved in both languages. Once this cross lingual database is set up the IDOL Server will be able to correlate terms in more than one language making it possible to retrieve content in more than one language at a time.
Single Language Case
When the IDOL™ Server has aggregated data in one language it has a conceptual understanding of that content in that particular language.
When the IDOL™ Server looks for query results or suggestions of related documents, it uses the concept of the query or document to look for best matching answers. The results will therefore be in the same language.
In the following diagram the IDOL™ Server has N databases all of which being in the same language.
Figure 1: Operation on a single language system
When you query the system in English on an English Database, the IDOL™ Server simply takes the English concept in the query and matches it to any relevant concepts found in the English content. The results are therefore in English. In this case, if a Spanish query was made you would not get any results as the terms in the IDOL™ Server are in English with no Spanish concepts to match.
Multi-Language Case
In order to give the IDOL™ Server the capability to automatically understand concepts in more than one language (e.g., English and Spanish), we pre-index a special multi-language database containing general data (e.g., encyclopaedia data and general world wide news) in both languages. This database is simply for training purposes, and it does not need to contain the documents that you will eventually be querying on.
Each one of the "training documents" contains plain text in both languages, each being a direct translation of the other.
This multi-language database gives the engine a general understanding of a wide range of concepts in both languages. The engine can then use this special multi-language database internally when dealing with queries and suggestions.
In the following diagram the IDOL™ Server has N databases each of which is in either language contained in the Multi-language Database.
Figure 2: Operation on a multi-language system
The IDOL™ Server first looks for concepts that closely match the query from the multi-language database, giving the engine the concepts in both languages. It can then use those concepts to carry on with the original query In this way, a query in one language can yield automatic results in both languages.
For example, you can use an English sentence to query a database with data in Spanish, and vice versa.
Please note that this method does not use direct translation of keywords, it uses translation of general concepts.
For example, if you are querying a Spanish Database:
Figure 3: Example with English/Spanish
The IDOL™ Server takes the English Query and looks for matching concepts in the English-Spanish Database. The IDOL™ Server will match the concepts in English, but because each document in the English-Spanish Database is in both languages, the concepts obtained are in both English and Spanish.
If you then query the Spanish Database, the Spanish concepts obtained above will find relevant documents in the Spanish Database.
Languages Supported
Single Byte/Double Byte Languages (SBDB)
In computers characters in a language are typically:
- Single-byte: one byte is used to represent a character
- Double-byte: two bytes used to represent a character
- Multi-byte: the combination of single and double byte
- Unicode: 2 or more bytes used per character (UCS2) or other variants such as UTF8 which can have 1 to 4 bytes per character
European languages are single byte whereas some Asian languages use multi-byte encoding such as the Japanese Shift-JIS character set. The Autonomy IDOL™ Server can deal with both types of single and double byte character sets.
Word Boundaries
Information is represented with words that together represent a concept. In most languages each word can be identified easily as in written text these words are normally separated by spaces.
Certain languages such as Thai, Japanese, Chinese, Korean, etc. are written without the use of spaces to delimit words. A sentence is normally a continuous flow of characters with some punctuation used for readability. The individual words are normally discerned by the context of the text. In order to support this type of language Autonomy uses well known third party APIs to perform sentence segmentation.
Autonomy's core technology, the IDOL™ Server does not make any assumptions about the language of the content and it does not depend on the symbols used to represent a particular language.
Architecture
Each one of the tuning steps outlined below is performed internally by the IDOL™ Server if necessary:
Figure 4: Architecture
Supported Platforms
Autonomy provides support for:
- Microsoft Windows NT
- Microsoft Windows 2000
- SUN Solaris
- LINUX
- HP-UX
- Any other POSIX compliant of UNIX on request
Application Examples
Autonomy software has already been widely deployed to solve a wide variety of business problems. Some examples are:
| Customer | Languages | Description |
| BBC Online News Site |
Chinese Arabic |
Innovative, progressive and pioneering - the British Broadcasting Corporation has proven a powerful force in the 20th century - providing entertainment, education and information, and captivating millions of viewers and listeners at home and abroad. Autonomy was selected to power their Chinese and Arabic sections of their news site. |
Figure 5: BBC Online - Chinese News |
||
| Customer | Languages | Description |
| TOM.COM | Chinese | Asian telecommunications giant Hutchison Whampoa has set up the first series of portals designed specifically for the Chinese community. The portals, which will represent Chinese interests throughout Asia Pacific, will automatically personalize content to users'interests and needs. The portals will make extensive use of Autonomy's technology infrastructure to turn the users' interaction with the site into a productive and interest-focused experience. |
Figure 6: Tom. com - Chinese Internet Portal |
||
| Customer | Languages | Description |
| Yatack E-Commerce Site | Scandinavian Languages | Yatack is a Scandinavian E-Commerce site. Autonomy's technology allows Online Club to deliver the most personalized online shopping experience, by guiding users through the buying process based on an automatically derived understanding of their interests. |
Figure 7: Yatack - Scandinavian E-Commerce |
||
Other Examples
Figure 8: French Portal - http://www.eurosport.fr/
Figure 9: Italian Shopping Site - http://www.kataweb.it/
Figure 10: German News Portal - http://www.tomorrowbusiness.de/












