Power

Solutions Solutions | Understanding "Human Information" | Enterprise Search
Overview
Related Events
Related Case Studies
Related Resources
Related News

Human Information

Why is it different?

When we consider human information and its dominance in today's enterprise, it is natural to first wonder how we can effectively search and find the information we need. Typically, people search or analyze data using an attribute such as the date a video was taken, who is in a photo, or whether a blog gives a positive view of a product. Since computers have historically used databases to increase search efficiency, finding human information raises a number of new questions regarding how we can organize, process, and search it.

Human Information comes in two categories:

  1. Unstructured Text Data. Unstructured text data includes content posted to blogs, news feeds, documents, and social media interactions such as those occurring via Twitter and Facebook.
  2. Unstructured Rich Media. Unstructured rich media includes photos, videos, sound files, and other forms of information that by default do not have any text information on their subject beyond simple metadata.

Using simple search methods, such as keywords, computers can send back every instance of a particular word or combination of words. Because these methods do not understand the meaning of the word "DOG," you get results that contain the word, but you will still have to sift through the results to find what you want. More recent methods of using rules, popularity ranking, federation, and other basic functions are now used to improve the search process. But these methods still have limitations.

To recognize the importance of understanding the concepts contained in information, it is important to first understand the unique challenges posed by human information.

Information is diverse. Human information is not limited to one file type or source. It represents all types of information and does not fit neatly into a structured database. It includes text in the form of emails, documents, IMs, social media, SMS messages, audio in the form of speech and sounds, video, XML, and images.
Ideas do not match, they have a distance. No two ideas are exactly the same, but they have degrees of similarity based on how conceptually close they are to each other. Consider the description "low-drag wing design expert" versus "high-efficiency aerofoil designer." These words do not match, but the ideas are conceptually "close." In turn, a very different idea, such as safari animals, would be conceptually "distant."
Context is important. Distances between ideas change with the context around them. When the story "Clinton Arrives by Car to Meet the Chinese Premier, Drives Up in Black Lincoln" appears, the main point changes based on who reads it. For most people, the news is that Clinton has met with the Chinese Premier. For the subscribers to Limousine, Charter & Tour Magazine, the real news is that Clinton arrived in a black Lincoln. When analyzing human information, the context must be understood to grasp the meaning of the information.
Information does not match exactly. There is a definitional problem when dealing with human information. When a user poses a query, information never matches exactly the way structured information would. The question "Is Snoopy a dog?" does not have a simple answer, as there are many ways to define Snoopy. You must take into account why he would or would not be considered a dog. The answer to a question is dependent on other pieces of information. For instance, if the answer is "No, he is a cartoon character," then Snoopy is not a dog. This demonstrates the relative nature of information.
Meaning is dynamic. In the age of social media, new slang terms are continually emerging. Even within the same phrase, a single word can have multiple meanings based on its intent. Take the tweet "Saw Red Riding Hood, the wicked wolf got boiled—it was really wicked." The word "wicked" meant both bad and good in the same post, depending on where it appears. The ever-changing nature of a word's meaning makes it especially difficult to understand and process human information without the ability to understand context.
Meaning is multi-layered. Within the same set of phrases or words, there can be multiple levels or layers of meaning. We can see this principle best in poetry, where complex metaphors can run through a set of text, building on each other and adding depth.
Meaning is relative. What something means is closely related to your own perspective, such as your social or cultural viewpoint. Two opposing cultural groups will view a set of results very differently, and meaning changes over time and is subject to historical perspective.

The ability to understand concepts

The leap forward in the ability to understand human information comes with conceptual search. When a computer can understand that the letters "D-O-G" mean a dog, man's best friend, a Labrador, or an animal that likes to go for walks, the process becomes more human and the computer can do more of the work for us.

Yet the lack of structure in human information still makes the search process challenging for the simple reason that people search or analyze data using an attribute of the data, such as the date a recording was captured, the people pictured in a photo, or whether a website gives a positive review of a product. This requires some form of metadata (data about data) to be tagged to the item or generated on the fly as the item is saved. If no such metadata exists, you will have difficulty finding it, or may not be able to find it at all.

This issue is not an easy one to solve without human involvement. For example, how can software tell if a picture is of a yellow rose, a yellow Labrador named Rose, or a girl named Rose in a yellow dress? Compounding these challenges, human information is often more difficult to manage than structured or semi-structured information in terms of size, organization, and availability.

Revealing strong concepts in weak information

Autonomy takes a unique approach to leveraging the power of weak information. We use a theory that says the less frequently a unit of communication occurs, the more information it conveys. By using a larger amount of conceptually-related weak information to drive a search, you can yield more relevant results than a smaller amount of seemingly strong keywords. For example, when you search for the word "penguin", there is about an 85 percent chance of bringing back a document about the flightless bird. But your search may also return information on the Batman villain, the publishing house, and the hockey team, along with the bird. On the other hand, a group of weak terms like "a black and white flash jumped into the sea and appeared with a fish in its beak" paints a much more accurate picture. In this case, the probability that this document is about the flightless bird is about 98 percent. Although each word is much weaker, and does not even include the word "penguin," together they offer much clearer information. But to understand what these words are describing, you have to understand their context.

"A black and white flash jumped into the sea and appeared with a fish in its beak" Is it a penguin?

This approach is similar to understanding a conversation in a noisy room, where you can grasp the context of the discussion even when some of the words cannot be heard—or grasping the essence of a news article simply by skimming over the text. Autonomy creates a framework for extracting the concepts from content to determine the meaning of information.

Meaning Based Computing

The ability to derive meaning, spot patterns, 'connect the dots,' and automate business processes is now possible using the technology developed by Autonomy and the power of Meaning-Based Computing (MBC). Autonomy, an HP Company, and a pioneer in the area of MBC, provides technology that allows you to derive insight, sentiment, and concepts from structured and unstructured human information to drive better enterprise decisions.

Autonomy's core technology, its Intelligent Data Operating Layer (IDOL 10), understands any type of unstructured information, including text, voice, audio and video—as well as structured application data—to give you the power to perform automatic operations such as hyperlinking, agents, summarization, taxonomy generation, clustering, eduction, profiling, alerting, and retrieval. For instance, Autonomy's core technology, allows text to be searched and processed from databases, audio, video, text files, or click streams. Autonomy IDOL, combined with the latest advances in hardware and software, enables massive amounts of constantly created and updated unstructured, structured, and rich media to be analyzed in real time.

This is a selection of our forthcoming events, please visit our seminars page for more information.

Automatic Hyperlinks provided by IDOL Server 10

This is a small selection of the Autonomy case studies available, please visit our publications site at http://publications.autonomy.com/ for more information.

Automatic Hyperlinks provided by IDOL Server 10

Solutions Solutions | Understanding "Human Information" | Enterprise Search
About Us
Technology
Functionality
Products
Solutions
Services
Customers
Partners
News & Events
Contact Us