From the time of the very first computers, their inability to process human-friendly, "unstructured" information has posed a considerable challenge. The modern IT industry was founded on the principle that, for example, if the number in "Column 3" goes to zero, the computer will automatically order more stock for the warehouse - in other words, the position of a piece of information tells the computer what to do with it - and a tremendous amount of effort has been poured into sorting and distilling unstructured information into tidy rows and columns.
Increasingly, structuring information in this way does not represent a viable solution, not only because of the incredible amount of manual effort required but because by organizing information in this way, its richness and subtleties are lost. Consequently, attention has turned to finding alternative, more intelligent solutions to the problem of unstructured information and the journey towards integrated MBC began...
Because computers were unable to understand the meaning of information, the seemingly obvious alternative was to simply search it in order to locate any keywords relevant to the desired subject. The problem with this approach is that the computer has no way of identifying what a given keyword means, and therefore cannot process the information afterwards. For example, if a user types in the letters "D-O-G" the computer has no concept of what that word means; it will simply identify all of the documents which contain that combination of letters, which might produce a list of results thousands of pages long.
In order to improve the results from straight-forward keyword searches, the technique was enhanced by adding a series of arbitrary rules so that the most relevant results would appear at the top of the list. For example, if the search term appears in the title of a document, five points are added to that result, and if it appears three times within the document, one point. This works to a certain extent but the important issue is that there is still no understanding of what a "D-O-G" is, or does. In addition, the rules have to be modified manually and become very costly to maintain every time a subject develops.
On the Internet there is a simple trick to get around this problem because, in many cases, the most popular information is also the most relevant. The importance or popularity of a Web page is approximated by counting the number of other pages that are linked to it, and by how frequently those pages are viewed by other users. This works quite well on the Internet but in the enterprise it is doomed to failure. Firstly, there are no native links between information in the enterprise. Secondly, if a user happens to be an expert, perhaps in the field of gallium arsenide laser diodes, there may be no one else interested in the subject, but it is still imperative that they find relevant information.
As a result of new regulatory drivers such as the FRCP, enterprises need to be able to guarantee that a search has covered absolutely every piece of relevant information across potentially hundreds of different repositories throughout the enterprise. Most search engines are not actually capable of doing this so they ask the original repositories to perform the search - a process known as federated search.
Federated search is often advertised as an asset. However, it creates significant problems because it generates vast increases in network traffic. Every time the user enters a query, each and every repository has to do a search, so a repository that previously ran a search perhaps 0.01 times per user per day, starts to glow white-hot. More importantly, all of the results are searched using different algorithms which means that all of their relevance rankings are different and incompatible when compiling a results list. In addition, most of the underlying search algorithms used in the repositories are not compliant with the new FRCP. Consequently, federated search is not compatible with a pan-enterprise platform.
All of the approaches described up to this point fit squarely into the mid-enterprise search market. A technology which is limited to these capabilities is not suitable for a true pan-enterprise deployment, for reasons that will now become clear.
A critical leap forward came with the ability to actually "understand" the idea behind a given phrase, and retrieve information which is conceptually related, even when a particular keyword is not used. So for example, if the user types in the letters "D-O-G", a conceptual search engine will retrieve all the information conceptually related to but not confined to the word "D-O-G", perhaps information about a "hound" as well as "walks" and different breeds of dog, because it understands the idea represented by the word. This is incredibly powerful because critical information is often missed because users do not always use the same search terms.
Security is absolutely paramount to the enterprise and the challenge this poses is staggeringly complex, from protecting the enterprise's intellectual property from unauthorized access, to ensuring internal compliance with an ever-growing list of regulatory requirements. Most users are not permitted to view most documents or even be aware that they exist. Typically, around 1/1000 documents should be available to each user and access privileges must be specific to each of the many underlying repositories in the enterprise. Achieving air-tight security without significant performance degradation is a considerable challenge.
In order to scale without impeding performance, some technologies fail to search each document in its entirety. This prevents users from retrieving valuable information and it exposes the enterprise to significant compliance risk. Such technologies begin to calculate the relevance of each document at indexing time; however, if at the beginning of the calculation a particular result appears to be irrelevant, the engine will stop calculating, effectively assuming the result is not relevant without reading all the way through. Consequently, a relevant snippet of information on the last page of a hundred page report could be overlooked and the legal consequences could be absolutely catastrophic. In fact, the company CEO could go to jail because the search failed to retrieve all of the information required by the court.
The full potential of multimedia content is often not utilized due to the fact that it has traditionally taken considerable manual involvement to process. Consequently, intelligence lies dormant in resources such as recorded meetings, training videos and broadcast content. True Pan-Enterprise Search technology automatically captures, encodes and indexes television, video and audio content from any source and provides users with the ability to search this with pinpoint accuracy and treat rich media content in the same way as more traditional forms of information.
When computers "understand" information, they can start to automatically process it and begin to bring information to the user rather than the other way round. For example, through forming an understanding, computers can automatically create taxonomies, alert users to new and relevant information in real-time or automatically profile an individual's interests based on what they read and write, offering them interesting information without the need to search or connect with similar people.
Clustering, Scene Detection, Speaker Identification and Sentiment Analysis
Understanding information allows computers to cluster information, identifying inherent themes or clusters of conceptually similar information. In addition, using this approach it is possible to detect irregularities in everyday scenes for security purposes, identify well-known speakers in broadcast media and analyze conversations to detect positive or negative sentiment.
In examining the different approaches to the challenge of unstructured information, it becomes clear that the solution does not boil down to plain search. It is only through understanding the meaning of ALL information that computers are able to automatically process it and provide users with the ability to handle and maximize the value of this rich resource. MBC addresses the full range of information challenges and consequently forms the central requirement of major enterprise deployments all over the world.
Summary: ...words and the highly specialized language of pharmaceutical research and development. For example, when a scientist looks for the latest clinical trial findings on a treatment for schizophrenia known to researchers as “aripiprazole,” IDOL K2 expands the term to include synonyms like “BMS-337039”...
Summary: ...system. In my previous experience, disk I/O was always a bottleneck.” The idiosyncrasies of time-series data created extreme performance demands – a challenge even to an expert in network-performance systems like Harvell. For example, in time-series analysis, there are no off-peak periods. Whenever...
Summary: ...Scrutton Bland Case Study. All documents, including e-mails, had to be fled away, but still be available when needed for re-use or editing. Consequently, the frm used over 60 cabinets for fling which required two rooms just to house them. Every client engagement had a large paper fle associated with it,...
Summary: ...do little to peer inside the model to understand its workings using empirical evidence. But now, hMetrix’ clients are able to participate in the model building, often providing insight that enhances the accuracy and value of the solution. The Final Word Chalissery concludes: "Providing strategic advice...
Summary: ...sectors, including over 75% of the FTSE 100, most major government departments and agencies and over 100 of the UK’s top 150 Public Relations agencies. Precise monitors and assesses the impact of PR and communications activity across all media channels, including press, online, broadcast and social...
Summary: ...engine designed for radio. Furthermore, IDOL-powered search allows listeners to not only find a specific feature programme, but also a specific place within the programme that mentions the topic they are searching for. As IDOL’s search functionality is based on the underlying meaning of words and concepts...
Summary: ...documents for each activity. While employees can create, print and read documents within the building, they must destroy them afterwards. “Paper is still a very convenient medium,” says Kosminsky, but the firm now makes sure not to create and store any more documents than absolutely necessary.
...
Summary: ...original form, BP’s video collection was vulnerable to the hazards of physical degradation. Despite being of extraordinary historical and cultural significance, these videos were an untapped resource requiring costly and prolonged manual intervention to process and index. Consequently, much of the footage...
Summary: ...are getting current, accurate information. “Previously, when we updated a marketing brochure, for example, we had to manually update that information everywhere it appeared online. This was an incredibly labor and time-intensive process that often resulted in out-of-sync information across our different...
Summary: ...American Hospital Association Case Study - Health Information Portal. autonomy@autonomy.com Other Offices Autonomy has additional o?ces in Boston, New York, Sunnyvale, Vista and Washington DC, as well as in Amsterdam, Beijing, Brussels, Hamburg, London, Madrid, Milan, Munich, Oslo, Paris, Rome, Shanghai,...
Summary: ...South Yorkshire Police - Autonomy Case Study. Because it is format agnostic, it is able to analyze all forms of data, including audio and video content, and automatically cross reference this with other conceptually related data. Information attached to different aliases is automatically identified and...
Summary: ...linking of information with innovative analytics features which have added an unprecedented level of intelligence to KLA’s data access and processing. For example: • Conceptual Search retrieves documents which are thematically similar even if they don't feature the keywords used in the original query...
This is a small selection of the Autonomy case studies available, please visit our publications site at http://publications.autonomy.com/ for more information.
Summary: ...relevant information undiscovered. Autonomy is the acknowledged leader in the rapidly growing area of Meaning Based Computing (MBC). MBC refers to the ability to form an understanding of content and recognize the relationships that exist within it. MBC extends far beyond traditional keyword searching...
Summary: ...iManage Universal Search. Traditional search engines also fail to filter out documents that use the same words, but with an entirely different meaning than the user’s query. Simple keyword search techniques and relevance algorithms are unable to return quality search results, particularly when legacy...
Summary: ...iManage Universal Search. Traditional search engines also fail to filter out documents that use the same words, but with an entirely different meaning than the user’s query. Simple keyword search techniques and relevance algorithms are unable to return quality search results, particularly when legacy...
Summary: ...document. The ability to reuse previously approved language reduces review and approval times freeing up high value resources and allowing for more business transactions. Key Features Feature What It Does Integrated User Interface Template Manager is seamlessly embedded into Microsoft Word, providing...
Summary: ...Intelligent Data Operating Layer), IUS understands and identifies the context and concepts within structured and unstructured information and delivers alerts based on news, social media or industryspecific data sources, without relying on manual tagging or keywords. Inside the enterprise, IUS can search...
Summary: ...weighting of metadata to weight individual keywords, special metadata fields or entire documents more or less than others • AFTER - similar to NEAR, only the first (left-hand) term before this operator has to occur within a specified word distance AFTER the term on the right side of this operator in...
Summary: ...index with conceptual and keyword search Automatic classification and clustering to create and extend records management fileplans and taxonomies Automatic alerts for document custodians if deletion of important information is attempted Close integration with email messaging/IM archives both on-site and...
Summary: ...root. Autonomy provides stemming algorithms that reduce words to this form. This is useful because it allows concepts to be matched regardless of the grammatical use of words. In English for example, the words "run", "runner" and "running" can all be stripped down to their stem "run" without significant...
Summary: ...in real-time and automatically alerts compliance officers, lawyers, managers and employees in order to mitigate potential risk and governance violations. Policies can be created with keywords, metadata, and/or example documents using a simple Web-based, wizards-rich dashboard. Designed for nonIT users...
Summary: ...in real-time and automatically alerts compliance officers, lawyers, managers and employees in order to mitigate potential risk and governance violations. Policies can be created with keywords, metadata, and/or example documents using a simple Web-based, wizards-rich dashboard. Designed for nonIT users...
Summary: ...Autonomy Introspect - Product Brief. Product Brief Top Provider S o c h a - G e l b m a n n E L E C T R O N C I D I S C O V E R Y S U R V E Y Autonomy received the highest possible ratings in Gartner’s recent report, “MarketScope for eDiscovery Software Vendors.” Manage-In-Place Manage-In-Place...
Summary: ...Autonomy Virage also uses sophisticated techniques to create an individual profile of each user and generate automatic, real-time alerts as relevant content emerges to ensure users are kept constantly up-to-date on evolving stories. Instant Content Access Rich media assets have traditionally been considered...
This is a small selection of the Autonomy Product Briefs available, please visit our publications site at http://publications.autonomy.com/ for more information.
Summary: ...related to but not confined to the word “D-OG”, perhaps information about a “hound” as well as “walks” and different breeds of dog, because it understands the idea represented by the word. This is incredibly powerful because critical information is often missed because users do not always...
Summary: ...files involves considerable manual labor and does not scale. Consequently, intelligence lies dormant in common resources such as recorded conference calls, meetings, training videos and broadcast content. The growth of SharePoint as a collaboration and content management platform has made it a de facto...
Summary: ...true in the age of social media, where new slang is continually emerging. Even within the same phrase, a single word can have multiple meanings based on the intent behind it. Take the tweet “Saw Red Riding Hood, the wicked wolf got boiled - it was really wicked.” The word wicked can mean either bad...
Summary: ...the inability to process human-friendly, unstructured information has posed a considerable challenge. The modern IT industry was founded on the principle that, for example, if the number in Column 3 of a database goes to zero, the computer should initiate a stock order. In other words, the position of...
Summary: ...audio recognition can be trained to enable individual speakers to be identified • Word spotting and phrase recognition: Virage can search audio by standard keyword as well as conceptual methods. Conceptual searching returns references to conceptually related information ranked by relevance or contextual...
Summary: ...cannot compensate for inaccuracies caused by homophones, homonyms, and linguistic complexities. For example, phonetic approaches often cannot recognize when a base phoneme is actually a part of a larger, more complex word, such as “cat” in the word “catastrophe” or “category”. Word spotting...
Summary: ...the use of UTF-8) and to the underlying algorithm itself (in the case of Autonomy, intrinsically language independent). Above and beyond these mechanical requirements, however, IDOL's ability to conceptually understand information rather than be forced to doggedly consume information on a keyword basis...
Summary: ...hoc queries against structured and unstructured data. To increase the accuracy of search-based queries, these tools use fuzzy matching to adjust for spelling errors and semantic technology to infer the meaning of words that users type into a keyword search box. The search interface is so simple and familiar...
Summary: ...people make up words the fly, including new codes that function as language. People in different parts of the country, in different parts of an organization, or in different age groups devise their own private languages for the context of their then current environment. For example, what does POS mean?...
Summary: ...where a lawyer gives instructions to destroy electronic documents, “deleting” an electronic record from a hard drive does not mean that a record is actually destroyed in the same way that paper is destroyed. With appropriate tools, “deleted” data that has not been overwritten may be accessed....
Summary: ...Autonomy ControlPoint: Information Governance and eDiscovery Solution for SharePoint. This means that 80% of search results that the user must wade through are completely irrelevant to their purpose. This failing becomes especially magnified and costly in an eDiscovery case. One can imagine that if a...
Summary: ...are casting about for relief by ensuring true single instance storage. While marketing pundits often purport that Exchange facilitates single instance storage, the truth is that PST files encourage duplication and message redundancy is often rampant. The numbers can be staggering. Organizations can often...
This is a small selection of the Autonomy White Papers available, please visit our publications site at http://publications.autonomy.com/ for more information.