顯示具有 Data Science 標籤的文章。 顯示所有文章
顯示具有 Data Science 標籤的文章。 顯示所有文章

2018年7月29日 星期日

Pandas index and selection

pandas is a very handy module for data scientist using python. There are some tricks that worth noted here for reference. Skip the basic DataFrame and Series class, i'll put focus on data selection. Most commonly used method for indexing are loc and iloc.

reviews.loc[[0,1,10,100],['country','province','region_1','region_2']
Above code select columns 'country','province','region_1','region_2' of row 0,1,10,100 from reviews DataFrame. loc is used for selection with string column name or index name.

reviews.iloc[[1,2,3,5,8],:]
This line used to select with numeric indexing of rows 1,2,3,5,8 from reviews dataframe

reviews.loc[[x for x in range(101)],['country','variety']]
More complex usage. Select first 100 rows of columns 'country' and 'variety'

reviews.country == 'Italy'
This line of code can produce a boolean Series which can be used for conditioning select. For example:

reviews[reviews.country =='Italy']
This line can select reviews of country equal to 'Italy'

reviews.region_2.notnull()
notnull isnull can be used to produce a Series used to indexing whether the column is not NaN or is NaN logical operation can also been used for dataframe selection

ds3 = reviews[reviews.country.isin(['Italy','France']) & (reviews.points >=90)].country
isin method equals to 'in' operation of python. notice the '&' equal to 'and' in python. But we may got confused why it creates another operators for pandas indexing?

2018年6月15日 星期五

Feature optimize for machine learning-One-Hot Encoder

Recently begin study MLCC from google.
Besides complex mathematics underneath all kinds of optimizers, over 80% of work time will be spent on data collecting/processing/cleaning and define useful features to feed to optimizer.
Linear regressor  requires numeric features. So for some of the data columns which contains characters/string(categorical), we can use so call "One hot encoding" method to convert these kind of data. skikit-learn offer module to easily get these done 
LabelEncoder
OneHotEncoder


from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
input_data = ['happy','sad','cry','happy','blank','blank','sad','cry','happy','sad']
#need to convert to array structure
values = array(input_data)
# integer encode
label_encoder = LabelEncoder()
encoded_output_list = label_encoder.fit_transform(values)
print(encoded_output_list)

Output:
[2 3 1 2 0 0 3 1 2 3]

output is the transformed integer list from input list, but still not yet an one-hot list.
You still need to User OneHotEncoder to encode integer list to one-hot formateed list with below code


# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
#need to reshape integer list shape from 1xn to nx1 since it fits 
#feature columns more for later Machine learning usage
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)  
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

Output:
[[0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

This discrete feature to numeric feature transformation is frequently used in ML. Noted Here.