Text Topic Classifier:
Text Topic Classifier uses sets of relevant and irrelevant training files to train and run classifiers which decide if text files are about a given topic or not.
For more information please refer to the paper
Classifying domain-specific text documents containing ambiguous keywords
Sample data files are included in the package. Each Python file includes sample command line arguments:
1) text_classifier.py creates a number of classifiers using sklearn libraries. Positive (relevant) texts are asigned the class 1, while negative (irrelevant) texts are assigned the class 0.
2) single_classify.py uses one of the classifiers created by text_topic_classifier.py to assign a class (0 or 1) to text contained in a single file. The class is the returned value of the script.
3) bach_classify.py uses one the the classifiers created by text_topic_classifier.py to assign a class to a set of text files. Positive and negative file names are saved in separate text files.
Optionally, positive and negative files can be copied to separate directories too.
4) LiteratureLoader.java is sample code to show how single_classify.py can be called from other programs as part of a pipeline.
Example data are from NCBI's PubMed database, and contain relevant papers for echinoderms, as well as papers irrelevant to echinoderms.
Download
text-topic-classifier-1.0.zip