Decision Tree in Weka - clumsyspeedboat/Decision-Tree-Neo4j GitHub Wiki
1 Introduction
Waikato Environment for Knowledge Analysis (Weka), developed at the University of Waikato, New Zealand, is free software licensed under the GNU. It comes with GUI, which makes it easy to visualize datasets, train and compare classifiers and much more. It’s a great handy tool to learn machine learning.
source: "https://en.wikipedia.org/wiki/Weka_(machine_learning)"
2 Downloading and installing Weka
Downloading and installing weka is pretty simple, what’s neat is that it comes nicely as a packed application that runs on Mac, Windows, or Linux. There is also a Java API.
link to download: "https://waikato.github.io/weka-wiki/downloading_weka/"
2.1 Importing datasets
When weka starts there are different interfaces, we used the explorer. we can find a bunch of proposed datasets on this page["https://waikato.github.io/weka-wiki/datasets/"].
2.2 Learning about Datasets
Weka prefers to load data in the "ARFF format". ARFF is an acronym for Attribute-Relation File Format. It is an augmentation of the CSV file format. In these file formats, a header is used that provides metadata about the data types in the columns. After unfolding "heart_failure_clinical_records_dataset" locally in weka explorer, we learned about the attributes, instances, and the correlation between the attributes. Usually, in weka the last attribute in the dataset is taken as a class or label, clicking on these histograms can be visualized. Through histograms, it can be seen how different values correlate to the class that needs to be predicted.
2.3 Classifying the Datasets
In Weka under the classifier tab, there is a whole bunch of built-in classifiers. Under decision trees, there are different types of decision tree algorithms namely Random Forest, DecisionStump, J48, etc.
3 J48 Decision Tree
J48 is the implementation of algorithm ID3 developed by the WEKA project team. J48 is one type of decision tree that does pruning. Just by selecting the tree and clicking start, in a fraction of seconds, we could train the decision tree on the mentioned dataset. In weka, we could flip back and forth between different classifiers and can compare the results. There are many types of classifiers in weka, from naive Bayes to basic neural networks.
3.1 Analysing Classifiers output
The classifier's output screen depicts a J48 trained tree. As always to analyze, the tree is read from the top down. The tree starts by placing the most predictive attribute in the dataset at the top, scrolling down a further, summary of correctly classified instances, accuracy, confusion matrix, etc.. could be seen.
3.2 Accuracy
Weka gives us options to evaluate accuracy on:-
Accuracy on the training set: This isn't really useful in the real world. As always in machine learning our goal is to generalize from the training data. Ideally, we want a model that performs well on the unseen data. To simulate this either we test the model on a separate test set or separate the existing file into training and test data. In Weka we can see Stats like precision, recall for any classifier we train along with accuracy. Although accuracy is the first metric that comes into play while evaluating a classifier but it doesn't always give us the entire information, especially in datasets where one class is rare. Hence when we evaluate classifiers we have to look at accuracy, both on positive and negative cases.
3.3 Cross-validation
The cross-validation technique iteratively divides datasets into two chunks. The larger chunk is used for training and the smaller ones for testing. The model is trained and evaluated and the process is repeated a number of times and the average of the results are taken. In weka, this could be browned by setting the cross-validation tab.
3.3 Feature Selection
Weka provides a platform to choose the attributes that are of utmost importance. A method not for training the model but just to explore the dataset. Under the filter tab, we can hit the supervised>attribute>attribute selection section. Just for example under attribute selection, we can choose info gain as a method and ranker as a search. After complying we can see, how the attributes are sorted by how useful they are in predicting the label.