Decision Tree in R : Metaprotein - clumsyspeedboat/Decision-Tree-Neo4j GitHub Wiki

R code

Python code


Decision Tree

We have implemented the following algorithms of Decision Trees for comparison of accuracy

  • CART
  • C 4.5
  • C 5.0

Metaprotein Dataset

Metaproteins as rows and, patient data of three types with samples from each being tested for the presence of metaproteins in columns (along with metaprotein demographics)

To suit our decision tree model, we removed the demographic columns from the dataset and have transposed the data frame to turn metaproteins into columns/variables & patients as rows.

We created a class label "Patient type" which has 3 factors - C, UC & CD

The metaprotein dataset has 2970 variables and 48 instances after the removal of demographic information and the addition of the class label. A vast majority of these variables are completely filled with 0 and they have an adverse effect on the accuracy of decision tree algorithms. We have chosen the 50 most abundant variables in the dataset and created our algorithms with them.

Metaprotein Dataset with 50 most abundant variables

We have taken half (1/2) of our Metaprotein Dataset to be used as Training Dataset & (1/2) to be used as Testing Dataset


CART - Classification & Regression Trees

  • depth = 2
  • leaf nodes = 2

CART_Metaprotein

Confusion Matrix: Prediction on Test Dataset

Accuracy = 66.667 %


Decision Tree (C 4.5)

  • depth = 3
  • leaf nodes = 3

C4 5_Metaprotein

Confusion Matrix: Prediction on Test Dataset

Accuracy = 70.8 %


Decision Tree (C 5.0)

  • depth = 3
  • leaf nodes = 3

C4 5_Metaprotein

Confusion Matrix: Prediction on Test Dataset

Accuracy = 70.8 %