decisionTree - juedaiyuer/researchNote GitHub Wiki
#决策树#
经常使用的数据挖掘算法
专家系统经常使用决策树
##信息增益##
首先来两个英文术语:
- 信息增益 information gain
- 熵 entropy
如何度量数据集的无序程度
计算给定数据集的香农熵
def calcShannonEnt(dataSet):
#计算数据集中实例的总数
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
更改工作目录,系统路径
>>> import os
>>> os.chdir("/home/juedaiyuer/mycode/researchNote/machinelearning/Ch03")
>>> import sys
>>> sys.path.append("/home/juedaiyuer/mycode/researchNote/machinelearning/Ch03")
测试
>>> import trees
>>> myDat,labels=trees.createDataSet()
>>> myDat
[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no'](/juedaiyuer/researchNote/wiki/1,-1,-'yes'],-[1,-1,-'yes'],-[1,-0,-'no'],-[0,-1,-'no'],-[0,-1,-'no')
>>> trees.calcShannonEnt(myDat)
0.9709505944546686
熵越高,则混合的数据也越多
在数据集中添加更多的分类
>>> myDat[0][-1]='maybe'
>>> myDat
[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no'](/juedaiyuer/researchNote/wiki/1,-1,-'maybe'],-[1,-1,-'yes'],-[1,-0,-'no'],-[0,-1,-'no'],-[0,-1,-'no')
>>> trees.calcShannonEnt(myDat)
1.3709505944546687
另一个度量集合无序程度的方法是基尼不纯度(Gini impurity)
##划分数据集##
ID3算法划分数据集
按照给定特征划分数据集
'''
dataSet 待划分的数据集
axis 划分数据集的特征
value 特征的返回值
'''
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
extend()和append()的区别
>>> a=[1,2,3]
>>> b=[4,5,6]
>>> a.append(b)
>>> a
[1, 2, 3, [4, 5, 6]]
>>> a=[1,2,3]
>>> a.extend(b)
>>> a
[1, 2, 3, 4, 5, 6]
测试函数splitDataSet()
>>> reload(trees)
<module 'trees' from 'trees.pyc'>
>>> myDat,labels=trees.createDataSet()
>>> myDat
[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no'](/juedaiyuer/researchNote/wiki/1,-1,-'yes'],-[1,-1,-'yes'],-[1,-0,-'no'],-[0,-1,-'no'],-[0,-1,-'no')
>>> trees.splitDataSet(myDat,0,1)
[1, 'yes'], [1, 'yes'], [0, 'no'](/juedaiyuer/researchNote/wiki/1,-'yes'],-[1,-'yes'],-[0,-'no')
>>> trees.splitDataSet(myDat,0,0)
[1, 'no'], [1, 'no'](/juedaiyuer/researchNote/wiki/1,-'no'],-[1,-'no')
选择最好的数据集划分方式
'''
调用数据满足一定的要求:
1.数据必须是一种列表元素组成的列表,所有的列表元素都要具有相同的数据长度
2.数据的最后一列或者每个实例的最后一个元素是当前实例的类别标签
'''
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet] #create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals: #遍历当前特征中的所有唯一属性值
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
测试
>>> reload(trees)
>>> myDat,labels=trees.createDataSet()
>>> trees.chooseBestFeatureToSplit(myDat)
0
>>> myDat
[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no'](/juedaiyuer/researchNote/wiki/1,-1,-'yes'],-[1,-1,-'yes'],-[1,-0,-'no'],-[0,-1,-'no'],-[0,-1,-'no')
##递归构建决策树##
##source##
- 机器学习实战:第3章