• Ekta Aggarwal

Pandas Demystified!

Updated: Feb 14

Pandas is one of the most powerful Python libraries which make it a programmer's choice! It is built over Numpy and matplotlib and is used for data manipulations.


In this tutorial we will learn with plethora of examples to understand how data can be wrangled!


Key topics covered:


Reading a dataset:

In this tutorial we will make use of following CSV file:

Employee_info
.csv
Download CSV • 1KB




Let us read our file using pandas' read_csv function. Do specify the file path where your file is located:

import pandas as pd
mydata = pd.read_csv("C:\\Users\\Employee_info.csv")


Viewing only first or last N rows.


head( ) function

In Python head( ) function is used to view first N rows. By default head( ) returns first 5 rows.

mydata.head()

We can modify the number of rows in head( ) . In the following code, I have specified that I want to view first 2 rows only.

mydata.head(2)

tail( ) function

tail( ) function is used to view last N rows. By default tail( ) returns last 5 rows.

mydata.tail()

In the following code, we have specified that we want to view last 2 rows only.

mydata.tail(2)

What is the shape of my data?


Shape of a data defines the number of rows and columns.

Using shape function we can get the shape of our mydata/

mydata.shape

Output : (45, 5)

Shape function returns 2 values: where first value (value at index 0 ) is the number of rows, while second value (value at index 1) denotes the number of columns.


Following code returns the number of rows.

mydata.shape[0]

Output: 45

By specifying [1] with shape, we can get the number of columns.

mydata.shape[1]

Output: 5



Knowing our data's structure


We can get the column names using columns function as follows:

mydata.columns

Output:

Index(['Employee Name', 'Gender', 'Department', 'Annual Salary ', 'Performance Rating'], dtype='object')


Row names are referred as index in Python. Using index function one can get the row names of the data.

mydata.index

Output:

RangeIndex(start=0, stop=45, step=1)

*Range (0,45) means a counting from 0 to 44 (45 is excluded)


dtypes function

We can understand the type of our data using dtypes function

mydata.dtypes

In the following output 'object' means character;

int64 means integer

and float64 means float (decimal)values

are there in the corresponding column.


Output:

Employee Name object Gender int64 Department object Annual Salary int64 Performance Rating float64 dtype: object


astype function

We can change the type of our columns using astype function.

In the following code we are converting annual salary to a float.

 mydata['Annual Salary '] = mydata['Annual Salary '].astype(float)
mydata['Annual Salary '].dtypes

Output: dtype('float64')



Unique


unique function

To get the unique values in a column we can use unique function

Task: Get the unique department names in mydata.

mydata.Department.unique()

Output: array(['Consulting', 'Technology', 'Outsourcing'], dtype=object)


nunique function

To count the number of unique values we have nunique function

Task: Count the unique department names in mydata.

mydata.Department.nunique()

Output: 3


Getting a frequency distribution:

value_counts


Using value_counts one can get the frequency distribution of a variable.

For eg. In mydata for department columns: Consulting is appearing 17 times, Technology 17 times and outsourcing 11 times.

mydata.Department.value_counts()

Output:

Consulting 17 Technology 17 Outsourcing 11 Name: Department, dtype: int64


We can sort the data on the basis of frequency in ascending order by specifying ascending = True.

mydata.Department.value_counts(ascending=True)

Output:

Outsourcing 11 Technology 17 Consulting 17 Name: Department, dtype: int64


crosstab


Using crosstab one can get the frequency distribution of 2 variables.

pd.crosstab(mydata.Department,mydata.Gender)







Selecting only numeric or character columns


_get_numeric_data

To get only numeric columns in our data we can use _get_numeric_data( ) function

numeric_only = mydata._get_numeric_data()
numeric_only.head()

Alternatively

select_dtypes :

By specifying exclude = 'object' (i.e. excluding character columns) we can get the numeric columns.

numeric_only = mydata.select_dtypes(exclude='object')
numeric_only.head()

By specifying include = 'object' we can get the character columns.

character_only = mydata.select_dtypes(include='object')
character_only.head()

To get numeric columns we can select multiple dtypes by providing the list as follows:

import numpy as np
numeric_only = mydata.select_dtypes(include=['int','float',np.number])
numeric_only.head()

Dropping rows or columns using drop( )


Using drop function we can drop rows and columns from our data.

Task: Drop department columns from our data.


mydata.drop('Department',axis = 1).head()

Alternatively, you can specify axis = 'columns' to delete the column.

mydata.drop('Department',axis = "columns").head()

Dropping multiple columns:


To drop multiple columns we specify the list of columns to be deleted:

Task: Drop the columns department and gender from mydata

mydata.drop(['Department','Gender'],axis = "columns").head()







Dropping rows:

To drop rows from our data we specify axis = 0 or axis = "index"


Task: Drop first 10 rows from mydata

mydata.drop(range(10),axis = "index").head()

Alternatively,

mydata.drop(range(10),axis = 0).head()

Renaming columns in a dataframe

Let us create a copy of our dataset as mydata_copy.

mydata_copy= mydata.copy()

Renaming all the columns


We can fetch the column names as follows:

mydata_copy.columns

Output:

Index(['Emp Name', 'Sex', 'Dept', 'Salary', 'Rating'], dtype='object')


To rename all the columns we can assign a list of new column names as follows:

Note: The column names must be in exact order as the previous column names

mydata_copy.columns  = ['Name','Gender','Dept','Salary','Rating']
mydata_copy.head(2)






Renaming only some of the columns


To rename only some of the columns we use rename( ) function. In rename( ) we define columns = {our dictionary of key-values}

Keys are Old column names

Values are New column names


Let us rename columns Name and Gender to Emp Name and Sex respectively.

mydata_copy.rename(columns={"Name" : "Emp Name","Gender" : "Sex"})

By default inplace = False which means the changes won't be made in original dataset. Thus we need to save the new data.

To make changes in original data we set inplace = True.

mydata_copy.rename(columns={"Name" : "Emp Name","Gender" : "Sex"},inplace = True)
mydata_copy.columns

Output

Index(['Emp Name', 'Sex', 'Dept', 'Salary', 'Rating'], dtype='object')


Creating new columns in a dataframe


Let us create a copy of our dataset as mydata_copy and rename our columns.

mydata_copy= mydata.copy()
mydata_copy.columns  = ['Name','Gender','Dept','Salary','Rating']

Let us create new column Monthly_Salary = Salary/12


Using square brackets [ ]:


To create a new colujmn we need to define new column name in [ ] square brackets. and define our formula as follows:

mydata_copy['Monthly_Salary'] = mydata_copy['Salary']/12

Note: We cannot create new columns by defining data_name.new_column_name. We MUST use [ ] square brackets to create new columns.

Following code will lead to an error.

mydata_copy.Monthly_Salary_exp= mydata_copy['Salary']/12

Using eval function:


In the eval function we can specify our column names directly without mentioning the name of our data repeatedly.

In the LHS, we have used eval( ) function i.e. we will not need to write mydata_copy['column name'] .

mydata_copy['Monthly_Salary2']  = mydata_copy.eval('Salary/12')

Using assign function:


If we have to create multiple columns then we can use assign function as follows:

Syntax:

new_data= old_data.assign( new_col_1= calculation_for_new_col1, new_col_2 = calculation_for_new_col2)

mydata_copy2  = mydata_copy.assign(Monthly_Salary3  = mydata_copy.Salary/12)
mydata_copy2.head(3)

Filtering data


Filtering with loc


To filter the data with loc we need to provide the column or index names!!!!!

Basic sytax for filtering:

For filtering the data we need to provide 2 positions separated by a comma inside a comma i.e .[ a,b] where a denotes the row names or Boolean vector for rows, while b is the column names or Boolean vector for columns. To select all the rows or all the columns we denote it by a colon ( : )


Task: Filter the data for all the rows and 2 columns: Employee name and gender.

Since we are selecting all the rows thus we have written a colon ( :) instead of 'a'.

To select multiple columns we have providing our list ['Employee Name','Gender'] in place of 'b'

mydata.loc[:,['Employee Name','Gender']]

Task: Filter the data for first 5 rows and 2 columns: Employee name and gender.


Since we are selecting first 5 rows only thus we have defined our range 0:5 in place of 'a' which means start from row 0 till row 4 (5 is excluded)

mydata.loc[0:5,['Employee Name','Gender']]

Alternatively

range(5) means 0:5 (i.e. 5 is excluded)

mydata.loc[range(5),['Employee Name','Gender']]








Task: Filter the data for first 5 rows and all the columns

mydata.loc[range(5),:]

Filtering for consecutive columns:


To select consecutive columns we can write columnName1 : columnNameN by which all columns starting from columnName1 till columnNameN will be selected.


Task: Filter first 5 rows and columns from employee name till annual salary.

Here we have written "Employee Name" : "Annual Salary " to retrieve all the column between them (including both of them)

mydata.loc[range(5),"Employee Name" : "Annual Salary "]


Filtering using Boolean vectors.


Task: Filter the data for the rows where gender is 1.

In the below code mydata.Gender == 1 returns a Boolean vector which takes the value True when gender is 1, otherwise False.

Wherever our Boolean vector has True, only those rows will be selected.

mydata.loc[mydata.Gender == 1,:]

Filtering for multiple conditions


AND condition - when multiple condition need to be true

Task: Filter the data for the rows where gender is 1 and department is consulting.

In the below code (mydata.Gender == 1) & (mydata.Department == 'Consulting') returns a Boolean vector which takes the value True when both gender is 1 and Department is consulting, otherwise False.

mydata.loc[(mydata.Gender == 1) & (mydata.Department == 'Consulting'),:]

Task: In the above query select only employee name and salary

mydata.loc[(mydata.Gender == 1) & (mydata.Department == 'Consulting'),['Employee Name','Annual Salary ']]















OR condition - when at least one condition needs to be true

Task: Filter the data for the rows where either gender is 1 or Salary is more than 110000.


Since we need to apply an 'or' condition i.e. at least one of the conditions should be true thus in Python we use '|' to denote 'or'

mydata.loc[(mydata.Gender == 1) | (mydata['Annual Salary '] > 110000),['Employee Name','Annual Salary ']]

IN CONDITION

Task: Filter for rows where department is either consulting or outsourcing.

We can achieve this using or '|' condition as follows:

mydata.loc[(mydata.Department == 'Outsourcing') | (mydata.Department == 'Consulting'),:]

But suppose we have to choose for multiple values (say 10 departments) then writing OR condition would be too tedious.

To simplify it we have .isin( ) function in Python. Inside isin( ) we specify the list of values which need to be searched.


In the following code we have written mydata.Department.isin(['Outsourcing','Consulting']) i.e. this will return a Boolean vector where department name is either outsourcing or consulting.

mydata.loc[mydata.Department.isin(['Outsourcing','Consulting']),:]

Excluding the rows satisfying a particular condition


Task: Filter for rows where department is neither consulting nor outsourcing.

It means we do not want the department names to be consulting and outsourcing. To do this, we can firstly create a vector using .isin( ) which will return True where department is consulting or outsourcing, and then specify to exclude such rows i.e. provide negation.

To provide a negation we write a (~) symbol before the Boolean vector.

mydata.loc[~mydata.Department.isin(['Outsourcing','Consulting']),:]


pd.query


We can filter our rows using pd.query( ) function

Task: Filter for rows where Gender is 1.

mydata.query('Gender ==1')

Filtering with iloc


To filter the data with iloc we need to provide the column or index names!!!!!


Task: Filter for first 5 rows. (Since our index names are 0:44, thus iloc and loc lead to same output)

mydata.iloc[range(5)]

Task: Filter for first 5 rows and first 3 columns.

To filter for first 3 columns we have defined 0:3 after comma.

mydata.iloc[0:5,0:3]









Obtaining indices using np.where


Task: Filter for rows where department is consulting.

Numpy's where function returns the indices wherever a particular condition is found True. We can then use those indices in iloc to filter the rows.

import numpy as np
np.where(mydata['Department'] == 'Consulting')

Output:

(array([ 0, 2, 3, 7, 8, 10, 17, 19, 21, 25, 28, 31, 32, 33, 34, 36, 41], dtype=int64),)

mydata.iloc[np.where(mydata['Department'] == 'Consulting')]

Managing index in a dataframe.


Setting the index in an already existing dataset


Using set_index( ) we can set our index in already existing dataset.

data1  = mydata.set_index(["Employee Name","Gender"]).sort_index()
data1.head(10)

Resetting the index


We can also reset the index i.e. make them columns in our data using reset_index( ) function.

By default inplace = False, which means to save the changes we need to create a new dataset.

To make the changes in the original dataset we mention inplace = True

data1.reset_index(inplace = True)
data1.head()

To learn more about how to deal with index for filtering them refer to this tutorial:

Setting and filtering index in Python.



Sorting the data


We can sort the data on the basis of a single column using sort_values( ) function

Task: Sort mydata by Annual Salary

mydata.sort_values("Annual Salary ",ascending = False)

By default ascending = True and inplace = False. To sort the data in descending order we need to specify ascending = False. To make the changes in original data you need to set inplace = True, otherwise you need to save your data in a new dataset.


Sorting data by multiple variables

Task: Sort mydata by Department and Annual Salary

mydata.sort_values(['Department',"Annual Salary "],ascending = False)

Task: Sort a single column: Annual Salary

mydata['Annual Salary '].sort_values()

Ranking the data


Let us firstly create a copy of our dataset:


data3 = mydata.copy()

To provide ranks to a dataset we can use rank( ) function.

By default ascending = True which means lowest value will be provided rank 1.

data3['Rank_of_salary']  = data3['Annual Salary '].rank(ascending=True)

data3.head()

Ranking and group by

Task: Provide ranks on the basis of annual salary within each department.

In the following code we have grouped our annual salary by department and then have applied rank( ) function. Note: we have kept ascending = False.

data3['Rank_by_department'] = data3['Annual Salary '].groupby(data3.Department).rank(ascending =False)

Let us filter the above data for consulting department:

consulting = data3.loc[data3.Department == 'Consulting',:]

Using sort_values( ) function let us sort the data by rank and observe the output.

consulting.sort_values('Rank_by_department')

Getting descriptive statistics


describe( ) function

We can get basic summary statistics for our data using describe( ) function

mydata.describe()

By default describe returns summary statistics for numeric columns. To get summary statistics for character or categorical column we need to specify include ='object'






mydata.describe(include='object')








To get only one statistic(say mean) we can calculate it as follows:

mydata.mean()

Output:

Gender 1.511111 Annual Salary 55723.733333 Performance Rating 3.986667 dtype: float64


Task: Calculate median for all numeric columns in mydata

mydata.median()

Output:

Gender 2.0 Annual Salary 50400.0 Performance Rating 4.5 dtype: float64


Task: Calculate average annual salary using mydata.

mydata['Annual Salary '].mean()

Output: 55723.73333333333


Task: Calculate average annual salary and performance rating using mydata.

mydata[['Annual Salary ','Performance Rating']].mean()
mydata.loc[:,['Annual Salary ','Performance Rating']].mean()

Output:

Annual Salary 55723.733333 Performance Rating 3.986667 dtype: float64


agg function


We can calculate multiple summary statistics using agg( ) function.

Task: Calculate mean and median for all the columns.

mydata.agg(['mean','median'])




Task: Calculate mean and median for annual salary and performance rating,

mydata[['Annual Salary ','Performance Rating']].agg(['mean','median'])





Task: Calculate mean and median for annual salary and minimum and maximum performance rating.

For this we can define a dictionary where keys are the column names and values are the list of statistics for the corresponding keys.

Eg. We have the key as Annual Salary and its values are ['mean' , 'median'] ( a list)

mydata.agg({"Annual Salary " : ['mean','median'], "Performance Rating" : ['min','max']})







Group by


Task: Retrieve the minimum value for all the variables for each department.

Since we want to divide our data on the basis of each department and then get the minimum values, for this we have grouped our data by department and then applied the min( ) function to get minimum values for all the columns.

mydata.groupby('Department').min()








Grouping by multiple variables


Task: Retrieve the minimum value for all the variables for each department and gender.

Here have specified 2 variables Department and Gender in our group by

mydata.groupby(['Department','Gender']).min()









By default group by is returning minimum values for all the columns. To fetch minimum values for a single column (say Annual Salary ), after group_by we select the column name and then calculate the summary statistic.

mydata.groupby(['Department','Gender'])['Annual Salary '].min()







Calculating multiple summary statistics:


Task: Calculate mean and median values for each Department and Gender combination.

To calculate multiple summary statistics we can use agg( ) function. Inside agg we have defined the summary statistics which we want to retrieve.

mydata.groupby(['Gender','Department']).agg(['mean','median'])











To learn more about group by in detail you can refer:

Group by and aggregate functions in Python


Calculating Percentiles


Using quantile( ) function we can get percentiles or quantiles for our data.


Task: Calculate 1st quartile or 25th quantile for all the variables.

mydata.quantile(0.25)

Output:

Gender 1.0 Annual Salary 35880.0 Performance Rating 3.2 Name: 0.25, dtype: float64


Task: Calculate 25th,40th, 8th and 99 quantile for all the variables.

mydata.quantile([0.25,0.4,0.8,0.99])







Task: Calculate 25th,40th, 8th and 99 quantile for annual salary

mydata['Annual Salary '].quantile([0.25,0.4,0.8,0.99])

Output:

0.25 35880.0 0.40 46430.0 0.80 71600.0 0.99 127912.0 Name: Annual Salary , dtype: float64


Cumulative sum


Let us firstly sort our data by department and annual salary. (We have sorted salaries in ascending order)

dummydata = mydata[['Department','Annual Salary ']].sort_values(['Department','Annual Salary '], ascending = [True,False])

We can calculate cumulative sum using cumsum( ) function.


Task: Calculate cumulative sum for the above data.

dummydata['Annual Salary '].cumsum()

Group by and cumulative sum

Task: Calculate cumulative sum for each department.

For this we have firstly used groupby( ) command and then applied cumsum( ) function.

dummydata['Department_wise_cumsum'] = dummydata.groupby('Department').cumsum()
dummydata

Dealing with missing values


A missing value can be created using Numpy's .nan function. So let us firstly import Numpy as np

import numpy as np

Dataset:


Let us create a dataset where missing values are created by np.nan

data1 = {'Product_category': ['Makeup', 'Biscuits', 'Household_Cleaning', 'Cold_drinks'],
        'Sales': [10000, 2000, 17891, np.nan],
        'Returns' : [678, np.nan, 678,234]}

data1 = pd.DataFrame(data1)
data1







isnull ( ) function

isnull( ) function returns True when a missing value is encountered, otherwise returns False.

data1.isnull()

For index 1 and 3 we have Returns and Sales value as True respectively.








Applying isnull( ) on a column

data1.Sales.isnull()

In the data, in sales column we had a missing value at index =3 thus we have a True at that position.







Filtering for rows having missing values.


Task: Filter for rows where Sales values are missing.

We can filter the rows using .loc and apply our Boolean vector of True and False (which we have got the in the previous code)

data1.loc[data1.Sales.isnull(),]





notnull ( )

notnull is complementary of isnull( ) .When the data is missing then notnull( ) returns a False, while for non-missing data it returns True.


Filtering for rows having non-missing values.

Task: Filter for rows where Sales values are not missing.

We can filter the rows using .loc and apply our Boolean vector of True and False i.e. data1.Sales.notnull()

data1.loc[data1.Sales.notnull(),]







Getting total number of missing values.


Using sum( ) and isnull( ) function we can get the total number of missing values in a column

sum(data1.Sales.isnull())
data1.Sales.isnull().sum() #Alternatively

Output: 1


Getting total number of non-missing values.


Using sum( ) and notnull( ) function we can get the total number of non-missing values in a column

sum(data1.Sales.notnull())
data1.Sales.notnull().sum()

Output: 3


Filling missing values


We can fill missing value in a data using fillna( ) function.

data1.fillna(value="Missing")








Dropping missing values


Using dropna( ) function we can drop the rows having missing values.


By specifying how = 'all' we are telling Python to delete the rows where all of the columns have missing values (that is entire row is full of NaN)

data1.dropna(how='all')

Since we did not have any row where all columns have missing values thus no row got deleted.









By specifying how = 'any' we are telling Python to delete the rows where at least one missing is present.

data1.dropna(how='any')





In the above 2 codes we are considering all the columns for missing values. To consider only some of the columns we define subset = ['list of columns to be considered']


For the code chunk below, those rows will be deleted where there is missing value is Sales column.

data1.dropna(subset = ['Sales'],how = 'all')








In the following code, those rows will get deleted where there is missing value in either of the 2 columns (as how = "any")

data1.dropna(subset = ['Sales','Returns'],how = 'any')





inplace = True

By default Python does not make changes in the original dataset as inplace = False. To make changes in the original data we need to specify inplace = True

data1.dropna(subset = ['Sales','Returns'],how = 'all',inplace = True)
data1


Dealing with duplicate values


Let us create a dataset which we will use

data = pd.DataFrame({
"Product_Category" : ["Makeup","Makeup","Makeup","Cold_drinks","Sauces","Sauces"], 
"Sales" : [15000,15000,10000,23454,7800,9000]})
data