randomforestclassifier sklearn string - smart1004/doc GitHub Wiki

randomforestclassifier sklearn string
http://scikit-learn.org/stable/modules/preprocessing.html

genders = ['female', 'male'] locations = ['from Africa', 'from Asia', 'from Europe', 'from US'] browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari'] enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])

Note that for there are missing categorical values for the 2nd and 3rd

feature

X = ['male', 'from US', 'uses Safari'], 'female', 'from Europe', 'uses Firefox' enc.fit(X) OneHotEncoder(categorical_features=None, categories=[...], dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None, sparse=True) enc.transform('female', 'from Asia', 'uses Chrome').toarray() array(1., 0., 0., 1., 0., 0., 1., 0., 0., 0.)

https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest

Finally, the answer to your question lies in coding the categorical feature into multiple binary features. For example, you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever.

You can check documentation here for encoding categorical features and feature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.