List of Data Mining algorithms - SoojungHong/TextMining GitHub Wiki

Data mining is known as an interdisciplinary subfield of computer science and basically is a computing process of discovering patterns in large data sets. It is considered as an essential process where intelligent methods are applied in order to extract data patterns.

Given below is a list of Top Data Mining Algorithms:

1. C4.5:

C4.5 is an algorithm that is used to generate a classifier in the form of a decision tree and has been developed by Ross Quinlan. And in order to do the same, C4.5 is given a set of data that represent things that have already been classified.

C4.5 that is often referred to as a statistical classifier is basically an extension of Quinlan's ID3 algorithm. The decision trees that are generated by C4.5 can be further used for classification. The C4.5 algorithm has also been described as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date" by the authors of the Weka machine learning software.

2. k-means:

k-means clustering that is also known as nearest centroid classifier or The Rocchio algorithm is a method of vector quantization, that is considerably popular for cluster analysis in data mining.

k-means is used to create k groups from a set of objects just so that the members of a group are more similar. It’s a well known popular cluster analysis technique used for exploring a dataset.

3. Support vector machines:

When it comes to machine learning, support vector machines that are also known as support vector networks are basically supervised learning models that come with associated learning algorithms which then analyze data that are used for the analysis of regression and classification.

An SVM model is created that is a representation of the examples as points in space, that are further mapped so that the examples of the separate categories are then divided by a clear gap that is ought to be as wide as possible.

4. Apriori:

Apriori is an algorithm that is used for frequent itemset mining and association rule learning overall transactional databases. The algorithm is proceeded by the identification of the individual items that are frequent in the database and then extending them to larger itemsets as long as sufficiently those item sets appear often enough in the database. These frequent itemsets that are determined by Apriori can be used for the determination of association rules which then highlight general trends.

5. EM(Expectation-Maximization):

An expectation–maximization (EM) algorithm, when it comes to statistics is an iterative method that is used to find maximum a posteriori(MAP) or maximum likelihood estimates of parameters in statistical models, that basically depends on unobserved latent variables.

6. PageRank(PR):

PageRank (PR) that was named after Larry Page who is one of the founders of Google is an algorithm that is used by Google Search to rank the websites in their search engine results. PageRank, that is the first algorithm that was used by the company is not the only algorithm that is being used by Google to order search engine results, but it is the best-known way of measuring the importance of website pages.

7. AdaBoost:

Adaptive Boosting or AdaBoost, that has been formulated by Yoav Freund and Robert Schapire is a machine learning meta-algorithm, that won the founders the 2003 Godel Prize for the same. The algorithm can be used in composition with many other types of learning algorithms in order to improve performance. AdaBoost is sensitive to noisy data as well as outliers.

8. kNN:

The k-nearest neighbors algorithm (k-NN) is a type of lazy learning or instance-based learning and is considered as a non-parametric method that is used for classification and regression.\In both the mentioned cases, the input consists of the k closest training examples in the feature space and the output depends on whether the algorithm is being used for classification or regression. This kNN Algorithm is considered and is also among the simplest of all machine learning algorithms.

9. Naive Bayes:

When it comes to machine learning, Naive Bayes classifiers that are considered to be highly scalable are known to be a family of simple probabilistic classifiers that are based on the application of Bayes' theorem with the help of strong independent assumptions between the features.

10. CART:

CART is an algorithm that basically stands for classification and regression trees. It is a decision tree learning technique that either outputs classification or regression trees and similarly like C4.5, CART is also a classifier.

Many of the reasons that a user would use C4.5 for also apply to that of CART, since both of them are decision tree learning techniques and features like ease of interpretation and explanation are also applied to CART as well.