Scientio Logo Scientio
Sign Up/In
 

Concept Mining

It can be argued that humans don't think in words, they think in concepts.

It's true, the voice in your head, that by-product of consciousness, uses words; but behind the words is the flow of concepts. Whenever you think rapidly and intuitively the words get left behind; at least, that is the experience of most of us, and we think in a string of concepts.

Of course, these mental concepts are not accessible to others, they are our own private language of thought, but because we share the same world and because we share common languages these concepts are heavily influenced by the outside world. They have order: they have structure.

Anyone who has used a thesaurus has seen part of this structure. A thesaurus has two elements, a dictionary that acts as a look up table and a massive tree of meanings.

One uses a thesaurus by looking up a word in the dictionary section, finding the possible meanings as tree locations, and then finding other words in the same location of the tree that have the same meaning. In Concept mining, we consider concepts to be identical to these thesaurus meanings.

So, we're trying to say that our mental concepts are strongly related to these thesaurus concepts, and that the structure of meanings in a thesaurus map onto these common mental structures we all use.

Of course, on the other hand, language has lots of ambiguities. If we convert what another person is saying into concepts inside our head as we listen, then there will be times that we will select the wrong concept and have to backtrack when a later part of the text makes it clear that we confused the sound of a word, or one of several meanings attached to it. This re-adjustment is one of the mechanisms of humor.

It's also in this area that the most rapid growth of languages occurs. New words are frequently invented, but we even more rapidly give new meanings to existing words, especially in technical jargon.

When we convert words to concepts with machines we are liable to make mistakes, or have to deal with many different interpretations of a given sentence simultaneously.

It is the computational complexity of this that has put off text mining researchers, who have concentrated on words, word frequencies and word co-occurrences as the raw material of their analyses.

But even given that, because of the ambiguity of language, one sentence might be converted into many different strings of concepts, there are still benefits to processing in concepts rather than words. Concepts have strong structure, where words don't.

A thesaurus is organized using one kind of tree, using "is a kind of" relationships (known as hypernymy), where concepts further down the tree are examples of object further up. But there are other relationships that linguists use that add more structure, like "is a part of" relationships (meronymy) or "is the opposite of" relationships (antonymy).

The key insight of concept mining is that the benefits we get by being able to manipulate the structures is far greater than the deficit we get from the ambiguity of text. In short, Concept Mining has the potential to be much more powerful than text mining, and a much more fruitful area of research.

At Scientio we've put a great deal of work into this area. We built a commercial system several years ago that used concept mining to find similar documents in large corpora in Olog(n) time. This is still the fastest algorithm to do this in existence. We've built up a tool kit of concept mining tools in our product ConceptMine that enables the user to join us in researching new products in this exciting new technology.

Concept Strings

Part of this work has been the invention of a new kind of data structure called a "Concept string" this is a structure that contains one or more of the alternate readings of a piece of text, expressed as a string of parts of speech and related concepts.

These concept strings can be compared and stored in efficient structures so that matching strings of concepts can be found in better than linear time. This offers an efficient method to scan text at high speed to find elements with similar concept strings, and thus meanings, with stored concept strings. This could be used for a brute-force approach to language processing, for instance for ChatBot style systems, and also for security purposes. The key benefit of a concept string is that it matches multiple variations of text that have the same meaning. Thus the task of detecting particular kinds of conversations in large volumes of data can be dramatically simplified.

Block diagram of a concept string

Block diagram of concept string structure

In the diagram above, each path between the possible concepts, one of which is represented in red, represents one possible meaning for the string.

One very useful task for concept strings is to compare them to see if any of the possible meanings match. Although a first glance this is a computationally tortuous task, Scientio have developed algorithms that perform this process efficiently.

Concept String comparison

Diagram of the process of comparing two concept strings.

Being able to compare one concept string to another is only the start. Suppose in a practical application you have thousands or millions of strings, one to one comparison would be computational suicide!

Scientio has created a storage structure for concept strings based the algorithms used in gene sequencing that indexes and looks up matching strings in time proportional to the length of the looked up string.

Simplified diagram of a concept string storage tree

Concept string lookup tree

Uses and applications of Concept Strings

Applications for concept strings are manifold. They offer the ability for the user to create simple template text that can then be matched against other text in a filtering process. The matches are based on concepts, rather than words, and so we can expect to match phrases and sentences with the same or similar meaning. While the user needs to supply template text, which may just be previously detected examples of the kind of text that they wish to filter, they do not need to consider every combination of words that might convey a particular meaning; rather they need to consider far fewer template texts that explore variations in grammar. Thus new recognition schemes can be set up rapidly with minimal effort.

The direct applications of this technique include homeland security, detecting unauthorized disclosures or comments in outgoing emails, detecting cyber bullying and abuse in emails, games and chat rooms, ‘Smart’ user interfaces such as Chat bots and many more.

Read the Concept Strings white paper for more information