2018年6月15日 星期五

Feature optimize for machine learning-One-Hot Encoder

Recently begin study MLCC from google.
Besides complex mathematics underneath all kinds of optimizers, over 80% of work time will be spent on data collecting/processing/cleaning and define useful features to feed to optimizer.
Linear regressor  requires numeric features. So for some of the data columns which contains characters/string(categorical), we can use so call "One hot encoding" method to convert these kind of data. skikit-learn offer module to easily get these done 
LabelEncoder
OneHotEncoder


from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
input_data = ['happy','sad','cry','happy','blank','blank','sad','cry','happy','sad']
#need to convert to array structure
values = array(input_data)
# integer encode
label_encoder = LabelEncoder()
encoded_output_list = label_encoder.fit_transform(values)
print(encoded_output_list)

Output:
[2 3 1 2 0 0 3 1 2 3]

output is the transformed integer list from input list, but still not yet an one-hot list.
You still need to User OneHotEncoder to encode integer list to one-hot formateed list with below code


# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
#need to reshape integer list shape from 1xn to nx1 since it fits 
#feature columns more for later Machine learning usage
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)  
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

Output:
[[0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

This discrete feature to numeric feature transformation is frequently used in ML. Noted Here.