randomforestclassifier sklearn string - smart1004/doc GitHub Wiki
randomforestclassifier sklearn string
http://scikit-learn.org/stable/modules/preprocessing.html
genders = ['female', 'male'] locations = ['from Africa', 'from Asia', 'from Europe', 'from US'] browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari'] enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
Note that for there are missing categorical values for the 2nd and 3rd
feature
X = ['male', 'from US', 'uses Safari'], 'female', 'from Europe', 'uses Firefox' enc.fit(X) OneHotEncoder(categorical_features=None, categories=[...], dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None, sparse=True) enc.transform('female', 'from Asia', 'uses Chrome').toarray() array(1., 0., 0., 1., 0., 0., 1., 0., 0., 0.)
Finally, the answer to your question lies in coding the categorical feature into multiple binary features. For example, you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever.
You can check documentation here for encoding categorical features and feature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.