| Technology |
|
Explicit Thesauri | | | The Vector Method | | | The "OneBox" Model |
|
The Vector Method
The Vector method is concerned with the partitioning of data, or categorization. This is done by imagining documents as points in a multidimensional space which are then divided into categories. Categories must be taught to the system so that the more training that occurs, the more accurate the categorization can be. Many of today's search engines use a combination of Vector and Boolean methods.
Language Dependent
The system needs to be trained in its target language, and will only recognize words it has been taught. There is no inherent understanding of synonyms or related words. For example, it would be unable to deduce that "Creutzfeldt-Jakob" and "mad cow" are related terms.
Inaccurate
The Vector method is inaccurate because it is unable to perfectly divide categories and has particular trouble with documents that fit into more than one category. It will classify such documents under one category or another, but not both. There is also no notion of threshold or relevance, so if a document is put into a particular category, there is no indication of how relevant it is within that category. Does it mention the topic only a couple of times, or is it entirely focused on it? The Vector Method is unable to tell.
Manual
All categories must be defined manually by administrators and the system requires constant monitoring and maintenance to ensure it keeps functioning. Any time there is a change in the categorization, the whole training process must begin again from scratch as there is no ability to make updates to just one area of the system.
Ranking Discrimination
The importance and relevance of one word compared to another is not understood. To combat this effect, common words can be ignored, and the focus placed on rare words, assuming they will give more insight into the theme of a document. However, this is not always accurate and can result in weight being placed on inappropriate words resulting in categorization errors.
Autonomy's Approach
Autonomy's technology can understand the content of a document probabilistically, without depending on an understanding of a particular language, and create categories accordingly. Where necessary, a document can be classified in more than one category. Autonomy's automatic categorization functionality ensures that taxonomies are created and maintained with as much or as little human interference as desired.
| Technology |
|
Explicit Thesauri | | | The Vector Method | | | The "OneBox" Model |
|
















