Limitations of Other Approaches
Many companies claim to have solutions that solve the challenge of managing unstructured information or have promised technologies to deliver personalised information services. However, most of these systems and approaches have severe limitations particularly where scalability and cost are concerned. For example:
-
Keyword Searching or Boolean Query
The most common approach to information management is through traditional keyword search. This simple method involves asking a user to enter some terms into a text field. It then searches through a list of documents to return with a list of those containing the search terms.
-
Collaborative Filtering or Social Agents
Collaborative Filtering is an attempt to allow computers to make personal recommendations to users based on their similarity to other users. The basic principle is quite simple: by getting a large number of users to give information about their preferences (usually by filling out forms and checking boxes) the system endeavours to make recommendations.
An example serves to clarify the basic principle. Imagine three users: Mick, Bud, and Brad have been asked to give their three favourite musicians.
Mick's favorite musicians:
- Elvis
- Buddy Holly
- Little Richard
Bud's favorite musicians:
- Jimi Hendrix
- James Brown
- Aretha Franklin
Brad's favorite musicians:
- Elvis
- Jerry Lee Lewis
- Little Richard
In collaborative filtering the computer compares the results, finds that Mick and Brad are similar and so swaps each other 's suggestions: "Mick, you may like Jerry Lee Lewis"; "Brad, you may like Buddy Holly".
-
Parsing and Natural Language Analysis
For the last twenty years much effort has been put into an obvious approach to deal with unstructured information called parsing (also semantic or lexical analysis). Rules of grammar and lexicons are applied to try to explicitly understand textual information.
Example:
The cat sat patiently on the mat = (The cat = subject) (sat = verb) (patiently = adverb) (on = preposition) (the mat = object).
-
Manual Tagging
With an upswing in enterprise portal, creating taxonomies that address various information types (including documents, structured data, HTML, XML and multimedia) is imperative. Manual tagging schemes are becoming an increasingly popular method of labelling digital material. However, there are significant barriers to ensuring that it increases the efficiency of managing information: the costs!
Keyword Searching or Boolean Query
The most common approach to information management is through traditional keyword search. This simple method involves asking a user to enter some terms into a text field. It then searches through a list of documents to return with a list of those containing the search terms.
Limitations
-
No context
The most common attempts to manage unstructured data employ keyword search. Search methods often exacerbate information overload. Although they can identify documents in which a search term appears, they cannot tell how relevant the document is to the subject being researched. They simply look for the occurrence of keywords and are unable to decipher whether the concept represented by a search term is related to the main idea of a document.
In addition, keyword-based approaches sometimes make the mistake of assuming that the more often a term is mentioned in a document, the more relevant the document is to the search. This is not always the case. Consider the following phrase: "I was walking down the street the other night. It was a long street, a dark street...and at the end of the street I was attacked by a mugger." Although the word "street" is mentioned several times, the phrase is really about a crime.
-
Inaccurate
Reliance on inefficient computer linguistics, keyword definitions and tags produces inaccurate results and is normally costly to implement and maintain. It is not scalable nor deployable as a solution to the challenges that unstructured information presents in every market sector.
-
Manual
Keyword engines do nothing more complex than look for a few words, which is very manually intensive at the back end, requiring humans to continually manage and update keyword associations or "topics".
-
Intensive user participation
Keyword methodologies rely heavily on the sophistication of the end user to be able to author queries in fairly complex and specific language (also known as Boolean form) i.e. "CD AND (NOT (financial OR money OR invest*) AND music."
-
Do not learn
Keyword search engines cannot "learn" through use, or be exposed to queries on the word dog and learn that when dog is input, it is the four legged, furry kind of sheepdog for which information is being sought without user intervention.
It is also very difficult for keyword search systems to find things by being shown an example. Typically a "more like this..." function will simply increase the number of keywords in the query based on what terms appear most frequently in the example document. This will often result in more documents instead of less, which is what the user really wants.
Autonomy's Approach
Autonomy's concept matching technology avoids these problems by matching concepts instead of simple keywords, although it does have the ability to perform standard Boolean text queries as well.
Autonomy takes into account the context in which terms appear. This eliminates many false hits while also catching documents that may not contain the specific term, but do include the concept.
Manual Tagging
With an upswing in enterprise portal, creating taxonomies that address various information types (including documents, structured data, HTML, XML and multimedia) is imperative. Manual tagging schemes are becoming an increasingly popular method of labelling digital material. However, there are significant barriers to ensuring that it increases the efficiency of managing information: the costs!
Limitations
-
Descriptive inconsistency
One example of the effect of human behaviour and the inherent limitations of manually describing information - albeit from existing descriptions - is illustrated by the results of a US Department of Defense edict, mandating that internal users responsible for authoring documents also create an appropriate description of the content of the document. At first glance, a seemingly sensible and pragmatic decision. However, after many months of activity, it was discovered that the vast majority of documents had been loosely described and tagged as "general". Whilst tagging schemes - and particularly XML - attempts to break away from such generalist terms, it remains dependant upon the same shortcomings of human behaviour that manifest themselves as "inconsistency". An individual's ability to describe information is dependent upon their personal experience, knowledge and opinions. Such "intangibles" vary from person to person and are also dependent upon circumstance, dramatically reducing the effectiveness of the results.
Further complications arise when subjects incorporate multiple themes. Should an article about "technology development in Russia within the context of changing foreign policy" be classified as (i)Russian technology (ii)Russian foreign policy, or (iii)Russian economics?
The decision process is both complex and time consuming and introduces yet more inconsistency, particularly when the sheer number of options available to a user is considered. For example, over 800 tags for general newspaper subjects make the task of choosing a potentially basic subject description in a reasonable time-scale, an even more challenging process.
-
Idea Distancing
Tags also fail to highlight the relationships between subjects. Termed "idea distancing", there are often vital relationships between seemingly separately tagged subjects such as wing design/low drag and aerofoil/efficiency. The first category may contain information about the way the wings are designed to achieve low air resistance. The latter category discusses ways in which efficient aerofoils are made. Obviously, there will be a degree of overlap between these categories and because of this, a user may be interested in the contents of both. However, without understanding the meanings of the category names, there is no clear correlation between the two.
-
Not scalable
In order to be very specific in retrieval and processing of tagged-based documents, the number of tags will need to be very high. For example, tag numbers in a company such as Reuters run into the tens of thousands. However, as the number of tags increases, so does both the effort and the likelihood of misclassification.
-
High labour costs
Taxonomy creation and tagging is still a predominantly manual effort requiring input from librarians, users, and IT staff. This means large labour costs involved in making sense of information.
Autonomy's Approach
Autonomy addresses the inefficiencies introduced by many of the manual issues associated with creating tags by adding a layer of intelligence to the management of XML: understanding the content and purpose of either the tag itself, or related information, or both.












