Scientio Logo Scientio
Sign Up/In
 

Purchase the class library

Buy here

Use the web service

Web service charges

View demos

Related solutions

Concept Mining
Natural Language processing

Further information

White paper
Briefing
Concept Structures first paper
Concept Strings white paper

ConceptMine

Concept mine and compare documents efficiently for similarity with ConceptMine.

Concept Mine is a .Net class library and Web Service that enables you to efficiently compare and index documents for similarity in both lexical terms and with the concepts held within them. It does this using new techniques in Concept Mining.
It uses a model of the English language (Spanish also available) to calculate a short numeric 'signature' that categorizes the concepts found within a given document.
This signature enables simple and efficient lookup of documents in concept space, and thus efficient comparison, indexing and retrieval. The class library contains indexing code based on KD Trees to efficiently index documents and locate similar documents in O(log n) time.

Applications are:

  • Detection of duplicate or near duplicate documents in large corpora
  • Anti-plagiarism source matching
  • Indexing of documents by concept
  • Concept driven search engines
  • Document clustering
  • Document topic inference
  • Text and Concept Mining

Features are:

  • Easy extension to further languages for which a WordNet model exists (all major languages).
  • Efficient, low overhead document matching.
  • Can be configured to locate documents from a data source that are near duplicates, differing in formatting or containing typos or revisions.
  • Can be configured to provide a conceptual analysis of the concepts in a document, using concept mining, and generate a list of key words such as proper nouns from a given document.
  • Can be configured to locate documents that are similar in embedded concepts but lexically dissimilar.
  • Easy integration with common databases, ASP.Net based websites, .Net applications etc.
  • Thread-safe for simple multi-user support.
  • Java version in development.
  • Signatures consist of 9-16 floating point values, depending on language.
  • Signature size is independent of document length.
  • Signatures are not influenced by formatting or white space.
  • Defeats common plagiarism tactics: for anti-plagiarism applications, documents that have been modified using thesaurus based substitution of words, or by the re-arrangement of sentence or paragraph order will still generate identical signatures to the originals.