Unstructured Data Challenges

Posted May 17, 2016 | Technology |

In practice, content and information management systems today haven’t fulfilled their promise. They don’t understand unstructured data, and they can’t directly act upon it. They work well only when people follow defined information governance processes, including:

  • Baselining and versioning documents, instead of copying and pasting them
  • Updating embedded metadata (e.g., author name) in documents as necessary
  • Attaching accurate, appropriate, and shared external metadata to documents uploaded to enterprise content management (ECM) repositories

Organizations that follow these processes minimize the amount of duplicated unstructured data they produce and create structured information that content management systems can index and search. In reality, however, such manual information governance is unsustainable. Consider for the moment the challenge of attaching descriptive metadata to a document. Which tags are most appropriate for the document at hand? This is important! Fail to attach an appropriate tag, or attach the wrong tag, and the document won’t be found in a subsequent search when it’s needed. If the metadata is hierarchical, attach too broad a tag and the document may be returned when it’s not relevant.

Now consider cost. Suppose an assistant classified documents for you. Assuming one million docu­ments in an enterprise content repository, at an impossibly fast one minute per document, it would take a person eight years to finish the task. At an almost reasonable 10 minutes per document, a team of 20 employees would finish in four years. Assuming US $40 per hour, this is a cost of $6.4 million. Ignoring inconsistencies that would result across a large team of people, the cost would likely be viewed as prohibitive.

Clearly, automation, or automated assistance, is the only practical solution today. As Cutter Senior Consultant Curt Hall wrote in his recent Update about intelligent enterprise and BI search

We are now seeing advanced enterprise search tools employ text mining and analysis engines that utilize sophisticated algorithms based on natural language processing (NLP) and other linguistic analysis techniques derived from AI and advanced statistics — including machine learning (ML) and dynamic data visualization — undergo increasing use in the enterprise.

Cognitively, NLP has become very successful at parsing text at the sentence level. It’s fast and reliable, and there are readily available implementations and word lexicons. NLP is a fundamental technique used for automatically classifying content and assigning descriptive metadata. However, to truly understand a document, sentence-level understanding is insufficient due to the following:

  • Idioms and metaphors abound. Sports metaphors, for example, are common in business writing; misclassifying a document or news article as being about football isn’t helpful.
  • Recognizing and properly tagging named entities is very useful for indexing, especially for internal documents in a corporate repository, where specific named entities might be refer­enced frequently. Text search isn’t good enough; it’s easy to miss implied, contextually abbreviated references to entities, as well as references to more than a single entity in a sentence. Although the best named-entity recognition (NER) systems have become quite good, there are some industries for which accurate NER requires considerable domain knowledge. (For example, in oil and gas exploration, it’s sometimes difficult to distinguish between the names of oil fields and the names of people.)
  • Context is important. Is a document describing the past, present, or future? Is a news article referring to the status quo or a critical situation requiring immediate attention? Getting this right is crucial for systems that promise to serve up “what matters today.”

If manual information governance is unsustainable, so are manual processes to capture the knowledge required by automated content classifiers. A content classifier powerful enough to handle most documents with acceptable precision and recall might require well in excess of 100,000 elements of basic language knowledge. (Consider as a rule of thumb that a content classifier needs a lexicon to cover the language in use, in multiple word combinations (“n-grams”) to disambiguate meaning. Regardless of how the knowledge is represented, for a lexicon of 35,000 words, used perhaps three times each in combination, and sometimes still having ambiguous meaning, it’s easy to see how such a knowledge base can grow to over 100,000 entries.) Such a knowledge base would be difficult to assemble and maintain by hand.

Machine learning, enabled by recent advances in high-performance computing and AI, can auto­mate knowledge acquisition for document understanding. (“Latent semantic analysis” is a common technique for identifying concepts from terms in documents.) ML technologies can be designed to run autonomously or be guided by subject matter experts (e.g., when the content demands good choices for representative training material, when concepts learned automatically must be validated before use, or when the system must not only classify content, but also support its classifications with explanations meaningful to people).

[For more from the author on unstructured data, see "Extracting Value from Unstructured Data."]

About The Author
Eric Schoen
Eric Schoen is the Director of Engineering at i2k Connect, where he is responsible for delivering the i2k Connect Platform. He plays a major role in ensuring that the architecture, AI science, and processing algorithms are robust, reliable, and scalable from cloud-based to on-premise installations. Before joining i2k Connect, Mr. Schoen spent over 30 years at Schlumberger, in both research and engineering functions — most recently as its Chief… Read More