2018年7月29日 星期日

Pandas index and selection

pandas is a very handy module for data scientist using python. There are some tricks that worth noted here for reference. Skip the basic DataFrame and Series class, i'll put focus on data selection. Most commonly used method for indexing are loc and iloc.

reviews.loc[[0,1,10,100],['country','province','region_1','region_2']
Above code select columns 'country','province','region_1','region_2' of row 0,1,10,100 from reviews DataFrame. loc is used for selection with string column name or index name.

reviews.iloc[[1,2,3,5,8],:]
This line used to select with numeric indexing of rows 1,2,3,5,8 from reviews dataframe

reviews.loc[[x for x in range(101)],['country','variety']]
More complex usage. Select first 100 rows of columns 'country' and 'variety'

reviews.country == 'Italy'
This line of code can produce a boolean Series which can be used for conditioning select. For example:

reviews[reviews.country =='Italy']
This line can select reviews of country equal to 'Italy'

reviews.region_2.notnull()
notnull isnull can be used to produce a Series used to indexing whether the column is not NaN or is NaN logical operation can also been used for dataframe selection

ds3 = reviews[reviews.country.isin(['Italy','France']) & (reviews.points >=90)].country
isin method equals to 'in' operation of python. notice the '&' equal to 'and' in python. But we may got confused why it creates another operators for pandas indexing?