Scientio
Home
Products
Solutions
Demos
Consultancy
Contact
Sign Up/In
Home
>
Demos
>
Text Mining
XML Miner mining the Reuters 21758 data set.
This is an example of text mining. The Reuters set contains around 10,000 news stories in XML, annotated with a title and one or more topic headings, out of a set of 10.
The topics have short names like "earn" and "oil". XML Miner must learn to associate the topic headings with the text of the story. XML Miner trains on 9,000 of the stories, and tests how it has done on the remaining 1,000. Finally you can test the rule set on 10 example news stories
Pressing the button below will start the training process. After approximately fifteen seconds the calculated results on the training and test sets will be displayed. The system is trained on 90% of the data and tested on the remainder. Each time the demo is run a different random selection of the training and test sets will be generated, and so the results may be slightly different. Once XML Miner is trained the ruleset generated will be available from the link below. Note this is a large file, around 750K. Finally you can observe the operation of the runtime processor on 10 different data items drawn from the source dataset. Pressing "next" will cycle through the examples.
Please note:
Training on both the title and the body takes > 1 Min. This example trains on just the title and has a performance approximately 5% below that of training on the full set. The performance of 80-85% is due to the noisy nature of the source dataset and inaccuracies and inconsistencies in the original assignment of topics. Much better performance can be expected from a properly labeled data set.
Results:
Title:
-
Story body
-
Actual topic
-
Predicted topic
-