TextClustering example using a Corpus object - GeorgeDittmar/TextClustering GitHub Wiki

Below is a simple example using the frameworks sample-data to load and process a series of documents and save them in a Corpus and then call the Kmeans algorithm.

package com.examples;

import com.data.Corpus;
import com.data.DataLoader;
import com.data.Document.TextDocument;
import com.data.processors.NLPProcessorFactory;
import com.data.processors.OpenNLPProcessor;
import com.data.processors.StopWordsFilter;
import com.data.processors.TFIDFProcessor;
import com.textclustering.kmeans.KMeans;
import org.apache.commons.io.FileUtils;

import java.io.File;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;

/**
 * This is an example class that shows how to process set of documents through the framework and
 * use kmeans clustering to try to learn document classification labels.
 * Created by george on 12/16/13.
 */
public class ExampleKMeans {

    public static void main(String[] args) throws IOException {


        // load text data
        Map<String,List<File>> data = DataLoader.loadDataSet(new File("src/test/sample-data/dataLoader-test"));
        Corpus corpus = new Corpus();
        corpus.setFeatureProcessor(new TFIDFProcessor());
        // setup the NLP and stopwords processors
        OpenNLPProcessor processor = (OpenNLPProcessor) NLPProcessorFactory.initNLPProcessor(OpenNLPProcessor.class);
        processor.init();

        StopWordsFilter filter = new StopWordsFilter(new File("./src/resources/stopwords.txt"));
        List<TextDocument> documentList = new LinkedList<TextDocument>();

        for(String classLabel: data.keySet()){
            // grab each document from each class label and process them through the ingestion pipeline
            for(File document: data.get(classLabel)){
                TextDocument tDoc = new TextDocument();

                //process each text document through the processors
                String rawText = FileUtils.readFileToString(document,"utf-8");
                List<String[]> parsedDoc = processor.processDocument(rawText);
                List<List<String>> filteredDoc = filter.filterStopWords(parsedDoc);

                tDoc.setOriginalDocumentString(rawText);
                tDoc.setProcessedDocument(filteredDoc);
                documentList.add(tDoc);
                // now add the document to the corpus with the given class label
                corpus.addDocument(classLabel,tDoc);
            }
        }

        // calculate the feature weights.
        corpus.calculateFeatureWeights();

        // now call the KMeans object
        KMeans kmeans = new KMeans(2,corpus);

    }
}

The above code is fairly standard template for how to use the framework. Below lets go over each section of the example so it is clear what is going on.

       // load text data
        Map<String,List<File>> data = DataLoader.loadDataSet(new File("src/test/sample-data/dataLoader-test"));
        Corpus corpus = new Corpus();
        corpus.setFeatureProcessor(new TFIDFProcessor());

The above code is a fairly simple boilerplate that uses the DataLoader object to load a directory of text documents that are stored in their own sub folders. These folder structures define the class labels for each subset of documents.

        // setup the NLP and stopwords processors
        OpenNLPProcessor processor = (OpenNLPProcessor) NLPProcessorFactory.initNLPProcessor(OpenNLPProcessor.class);
        processor.init();

        StopWordsFilter filter = new StopWordsFilter(new File("./src/resources/stopwords.txt"));
        List<TextDocument> documentList = new LinkedList<TextDocument>();

Next we must setup the series of processors to run on the documents. You can use as many or as few as you wish but generally you will want to use some sort of tokenization step.

        for(String classLabel: data.keySet()){
            // grab each document from each class label and process them through the ingestion pipeline
            for(File document: data.get(classLabel)){
                TextDocument tDoc = new TextDocument();

                //process each text document through the processors
                String rawText = FileUtils.readFileToString(document,"utf-8");
                List<String[]> parsedDoc = processor.processDocument(rawText);
                List<List<String>> filteredDoc = filter.filterStopWords(parsedDoc);

                tDoc.setOriginalDocumentString(rawText);
                tDoc.setProcessedDocument(filteredDoc);
                documentList.add(tDoc);
                // now add the document to the corpus with the given class label
                corpus.addDocument(classLabel,tDoc);
            }
        }

        // calculate the feature weights.
        corpus.calculateFeatureWeights();

        // now call the KMeans object
        KMeans kmeans = new KMeans(2,corpus);

The final block of code does the meat of the example. We take each subset of documents associated with the class label key in the map and create a new TextDocument object. Then each document is read in as a string, processed by the NLP processor and then stopwords are removed. In addition the stopword filter removes words that are only 1 char in length. Once the document has been fully processed and filtered for stopwords, we add the document and its class label to the corpus and then call the corpus's calculate feature weights method.

Once the feature weights have been calculated, we can begin to the learn the cluster groups by using the kmeans algorithm or any other algorithm to learn classification boundaries. Please note that every time you add a new document, or new set of documents to the corpus, you will need to relearn all the term weights given the new data that has been added. This could potentially be a costly operation if you have a large number of documents in your corpus.

⚠️ **GitHub.com Fallback** ⚠️