Dev classification guide - achakra/seck GitHub Wiki

Classification, also referred to as categorization, is an automated process of applying labels to the data: documents/web pages, images, emails, etc. Classification techniques can be supervised or unsupervised. Supervised learning is when the classification categories are predefined and items are fully labeled, also known as training set.

Unsupervised learning is the process of grouping the documents by its similarities without predetermined classes, and each document may belong to more than one group. It is also known as clustering, which is not in the scope if this class.

The classification is done by identifying the number of important features that will help us to label the documents. The document is labeled with the particular class using the highest probability of being related to that class in given predetermined set of classes.

Classification has different labeling/categorizing schemes, also called ontologies:
  • Binary (ex: spam or non-spam).
  • Multivalued (ex: categorizing the documents into different languages).
  • Hierarchical (ex: splits each category into subcategories).
There are different classification algorithms:
(courtesy of http://www.comp.dit.ie/btierney/oracle11gdoc/datamine.111/b28129/classify.htm#CHDHBEGJ)
  • Decision Tree: automatically generate rules, which are conditional statements that reveal the logic used to build the tree.
  • Naïve Bayes: uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data.
  • Generalized Linear Models (GLM): is a statistical technique for linear modeling.
  • Support Vector Machines (SVM): is a powerful algorithm based on linear and nonlinear regression.

It will depend on available libraries that have implementations for the above algorithms to be used in the scope of SECK.

⚠️ **GitHub.com Fallback** ⚠️