Switch to EnglishSwitch to JapaneseSwitch to Chinese
Technical Benefits
Autonomy Service Dashboard
User Interfaces
Connectors
Administration
Voice & Video
Europe
Asia-Pacific
VAR & SI Partners
Technical Benefits
 

Language Independence

The core concept matching of Autonomy's technology does not depend on an intimate knowledge of English grammatical structure or that of any particular language. It treats words as abstract symbols of meaning, deriving its understanding through the context of their occurrence rather than a rigid definition of grammar. Slang and other variations in language will not confuse the software. The IDOL Server already has a statistical understanding of the patterns of 'typical' English, but the engine can be trained on the patterns of any language (German, Spanish, Portuguese, Arabic, Italian, French, Japanese, Norwegian, etc).

Issues

As companies realize the value of presenting their sites to global markets, the dominance of the English language is set to decline. Companies are understanding that they need to approach customers in their native/own language and large multi-nationals can broaden/increase the knowledge available in the enterprise in all languages.

Whether you are implementing a portal site, an E-Commerce service or a corporate Knowledge Management system, you will be faced with the issue of internationalization.

This White Paper describes the language independent feature of Autonomy's technology that allows Autonomy to fully support a variety of worldwide languages, providing benefits such as:

Definitions

Introduction

Language can be defined as 'the use by human beings of spoken and written symbols in organized combinations and patterns in order to express and communicate thoughts and feelings.'

As its definition implies, language is used in set patterns to express the abstract notion of knowledge and information. To fully realize the potential of stored knowledge bases worldwide, the knowledge itself needs to be distributed/shared amongst its global user base, regardless of the language in which it was originally presented.

Currently most enterprises need to manage content that is represented in more than one language. Solutions that are independent of language constructs are therefore of utmost importance (and value). Companies can no longer afford to incur extra costs to utilize or provide new or existing information in another language, made necessary by opening branch offices in new market areas and countries. Autonomy's technology and architecture is ideally positioned; as it is designed to be completely modular, enabling to scale for content exploitation both today and in the future, in any language, in no time.

Key Factors

Internationalization is the process of developing a program core whose feature and code design does not make any assumptions based on a single language's characteristics.

When developing software depending on information the key questions are:

Autonomy's Approach to Language

The Dynamic Reasoning Engine™ is based on advanced pattern-matching technology (non-linear adaptive digital signal processing) that exploits high-performance probabilistic modeling techniques to extract a document's digital essence and determine the characteristics that give the text meaning. As this technology is based on probabilistic modeling, it does not use any form of language dependent parsing or dictionaries. The IDOL Server treats words as abstract symbols of meaning deriving its understanding through the context of their occurrence rather than a rigid definition of the language grammar.

The IDOL Server develops a statistical understanding of the patterns that occur in the content that it sees over time. The more information the IDOL Server has about a particular type of information (e.g. legal terms, pharmaceutical developments, technology, etc.) the more understanding it will have of those topics. A new language can be thought of as simply another 'type' of information that the IDOL Server needs enough material to learn from. Therefore, it is possible to mix more than one language in an IDOL Server as long as the amounts for each language are sufficient to build its understanding.

The choice of language does not compromise the accuracy of the concepts extracted by the IDOL Server. The underlying algorithm is the same regardless of the language used.

Autonomy's Use of Language Dependent Features

Although, Autonomy's technology is totally language independent, it is often beneficial to use language dependent features in order to increase performance. Autonomy provides the choice of the following features in order to optimize your system:

Autonomy does not require the use of stop lists or stemming rules, as the statistical analysis would normally determine the importance and relationship of those words. However, an initial configuration of a stop list and stemming rules allows the IDOL Server to ignore empty words and treat a set of words as one so that storage resources and processing time can be reduced.

Autonomy provides as standard a set of stop lists and stemming algorithms for most commonly used languages.

Benefits of Autonomy's Technology

Global Implementation

More content is becoming available in a particular/number of language(s) and an increasing number of users speak a language other than English. This is particularly true as more companies around the world put and use their information online and start to conduct E-Business. To succeed in a global internationalization strategy, enterprises need to impose rigorous requirements on the underlying technology and necessitate industry-leading functionality and performance.

Autonomy's technology and architecture is ideally positioned; as it is designed to be completely modular, enabling to scale for content exploitation both today and in the future, in any language, in no time.

Expansion to Cross-Lingual Systems

Autonomy's core technology can be used to set up cross lingual systems. This will allow, for example, a user to look at a document in English, and be suggested with similar information both in English and Spanish.

The Dynamic Reasoning Engine™ can be used to establish a correlation between one or more languages. To achieve this, a training dataset is needed, where each document is expressed in the required languages. Ideally each extract will be a direct translation of the other. Electronic dictionaries can also be used successfully for this purpose. The cross lingual content can be indexed into an IDOL Server database which will give the engine a general understanding of the concepts involved in both languages. Once this cross lingual database is set up the IDOL Server will be able to correlate terms in more than one language making it possible to retrieve content in more than one language at a time.

Single Language Case

When the IDOL™ Server has aggregated data in one language it has a conceptual understanding of that content in that particular language.

When the IDOL™ Server looks for query results or suggestions of related documents, it uses the concept of the query or document to look for best matching answers. The results will therefore be in the same language.

In the following diagram the IDOL™ Server has N databases all of which being in the same language.

Figure 1: Operation on a single language system

When you query the system in English on an English Database, the IDOL™ Server simply takes the English concept in the query and matches it to any relevant concepts found in the English content. The results are therefore in English. In this case, if a Spanish query was made you would not get any results as the terms in the IDOL™ Server are in English with no Spanish concepts to match.

Multi-Language Case

In order to give the IDOL™ Server the capability to automatically understand concepts in more than one language (e.g., English and Spanish), we pre-index a special multi-language database containing general data (e.g., encyclopaedia data and general world wide news) in both languages. This database is simply for training purposes, and it does not need to contain the documents that you will eventually be querying on.

Each one of the "training documents" contains plain text in both languages, each being a direct translation of the other.

This multi-language database gives the engine a general understanding of a wide range of concepts in both languages. The engine can then use this special multi-language database internally when dealing with queries and suggestions.

In the following diagram the IDOL™ Server has N databases each of which is in either language contained in the Multi-language Database.

Figure 2: Operation on a multi-language system

The IDOL™ Server first looks for concepts that closely match the query from the multi-language database, giving the engine the concepts in both languages. It can then use those concepts to carry on with the original query In this way, a query in one language can yield automatic results in both languages.

For example, you can use an English sentence to query a database with data in Spanish, and vice versa.

Please note that this method does not use direct translation of keywords, it uses translation of general concepts.

For example, if you are querying a Spanish Database:

Figure 3: Example with English/Spanish

The IDOL™ Server takes the English Query and looks for matching concepts in the English-Spanish Database. The IDOL™ Server will match the concepts in English, but because each document in the English-Spanish Database is in both languages, the concepts obtained are in both English and Spanish.

If you then query the Spanish Database, the Spanish concepts obtained above will find relevant documents in the Spanish Database.

Languages Supported

Single Byte/Double Byte Languages (SBDB)

In computers characters in a language are typically:

European languages are single byte whereas some Asian languages use multi-byte encoding such as the Japanese Shift-JIS character set. The Autonomy IDOL™ Server can deal with both types of single and double byte character sets.

Word Boundaries

Information is represented with words that together represent a concept. In most languages each word can be identified easily as in written text these words are normally separated by spaces.

Certain languages such as Thai, Japanese, Chinese, Korean, etc. are written without the use of spaces to delimit words. A sentence is normally a continuous flow of characters with some punctuation used for readability. The individual words are normally discerned by the context of the text. In order to support this type of language Autonomy uses well known third party APIs to perform sentence segmentation.

Autonomy's core technology, the IDOL™ Server does not make any assumptions about the language of the content and it does not depend on the symbols used to represent a particular language.

Architecture

Each one of the tuning steps outlined below is performed internally by the IDOL™ Server if necessary:

Figure 4: Architecture

Supported Platforms

Autonomy provides support for:

Application Examples

Autonomy software has already been widely deployed to solve a wide variety of business problems. Some examples are:

CustomerLanguagesDescription
BBC Online News Site Chinese
Arabic
Innovative, progressive and pioneering - the British Broadcasting Corporation has proven a powerful force in the 20th century - providing entertainment, education and information, and captivating millions of viewers and listeners at home and abroad. Autonomy was selected to power their Chinese and Arabic sections of their news site.

Figure 5: BBC Online - Chinese News

CustomerLanguagesDescription
TOM.COM Chinese Asian telecommunications giant Hutchison Whampoa has set up the first series of portals designed specifically for the Chinese community. The portals, which will represent Chinese interests throughout Asia Pacific, will automatically personalize content to users'interests and needs. The portals will make extensive use of Autonomy's technology infrastructure to turn the users' interaction with the site into a productive and interest-focused experience.

Figure 6: Tom. com - Chinese Internet Portal

CustomerLanguagesDescription
Yatack E-Commerce Site Scandinavian Languages Yatack is a Scandinavian E-Commerce site. Autonomy's technology allows Online Club to deliver the most personalized online shopping experience, by guiding users through the buying process based on an automatically derived understanding of their interests.

Figure 7: Yatack - Scandinavian E-Commerce

Other Examples

Figure 8: French Portal - http://www.eurosport.fr/

Figure 9: Italian Shopping Site - http://www.kataweb.it/

Figure 10: German News Portal - http://www.tomorrowbusiness.de/