HITEC is a
software package for very high accuracy automatic text
categorization . The engine of HITEC is the implementation of UFEX
(Universal Feature EXtractor) for textual documents. UFEX is a very
sophisticated learning method that ensures the outstanding categorizing
performance of HITEC, hence HITEC outperforms its competitors in case
of
all investigated document collections.
(For
further details, read the white paper).
HITEC applies supervised learning method, that is it learns based on
training data (learning phaase), and is able to classify new documents
to known categories (operational phase). Obviously, the performace of
categorization strongly depends on the quality of training data. For
efficient training HITEC requires
- fixed category system (usually ordered in hierarchy); during the
operational phase the new, ``unknown'' documents will be classified
into
that system;
- some relevant training documents for each category of the category
system.
During the operation, HITEC returns an ordered list of most relevant
categories for unknown documents based on confidence values. The
greater
is this value HITEC deems the more relevant the corresponding category
to the document. The returned list if documents can be further
processed
depending on the nature of classification problem. If perfect accuracy
is required for the classification, an expert can accept, revise, or
reject categories proposed by HITEC. If the accuracy of around 90\%
having been experienced at tests is sufficient, then proposed
categories
can be accepted based upon their confidence value.
HITEC is programmed very efficiently, therefore its high performace
comes with fast operation even on very large document collections. Once
the training of HITEC has been done for a document collection, the
operation phase is performed in real-time (see also test pages).
It is
able process hunderds of gigabytes in reasonable time (training phase)
and work with thousands of categories on an average PC.
The software, techniques and algorithms presented
here
are the property of the developers and hence are protected by copyright
law.
Please turn to the developers if you intend to apply or utilize any
product of
this project in any possible manner.