- Ekta Aggarwal

# Pandas Demystified!

Updated: Feb 14

Pandas is one of the most powerful Python libraries which make it a programmer's choice! It is built over Numpy and matplotlib and is used for data manipulations.

In this tutorial we will learn with plethora of examples to understand how data can be wrangled!

__Key topics covered:__

__Reading a dataset:__

__Reading a dataset:__

In this tutorial we will make use of following CSV file:

Let us read our file using** pandas' read_csv **function. Do specify the file path where your file is located:

```
import pandas as pd
mydata = pd.read_csv("C:\\Users\\Employee_info.csv")
```

**Viewing only first or last N rows.**

**head( ) function**

In Python head( ) function is used to view first N rows. By default head( ) returns first 5 rows.

`mydata.head()`

We can modify the number of rows in head( ) . In the following code, I have specified that I want to view first 2 rows only.

`mydata.head(2)`

**tail( ) function**

tail( ) function is used to view last N rows. By default tail( ) returns last 5 rows.

`mydata.tail()`

In the following code, we have specified that we want to view last 2 rows only.

`mydata.tail(2)`

**What is the shape of my data?**

Shape of a data defines the number of rows and columns.

Using **shape function **we can get the shape of our **mydata/**

`mydata.shape`

** Output :** (45, 5)

Shape function returns 2 values: where first value (value at index 0 ) is the number of rows, while second value (value at index 1) denotes the number of columns.

Following code returns the number of rows.

`mydata.shape[0]`

** Output:** 45

By specifying [1] with shape, we can get the number of columns.

`mydata.shape[1]`

** Output:** 5

**Knowing our data's structure**

We can get the column names using **columns **function as follows:

`mydata.columns`

__Output: __

Index(['Employee Name', 'Gender', 'Department', 'Annual Salary ', 'Performance Rating'], dtype='object')

Row names are referred as **index** in Python. Using **index** function one can get the row names of the data.

`mydata.index`

__Output: __

RangeIndex(start=0, stop=45, step=1)

*Range (0,45) means a counting from 0 to 44 (45 is excluded)

**dtypes function**

We can understand the type of our data using **dtypes **function

`mydata.dtypes`

In the following output 'object' means character;

int64 means integer

and float64 means float (decimal)values

are there in the corresponding column.

__Output:__

Employee Name object Gender int64 Department object Annual Salary int64 Performance Rating float64 dtype: object

**astype function**

We can change the type of our columns using **astype function.**

In the following code we are converting annual salary to a float.

` mydata['Annual Salary '] = mydata['Annual Salary '].astype(float)`

`mydata['Annual Salary '].dtypes`

** Output: **dtype('float64')

**Unique**

**unique function**

To get the unique values in a column we can use unique function

** Task:** Get the unique department names in

**mydata.**

`mydata.Department.unique()`

** Output**: array(['Consulting', 'Technology', 'Outsourcing'], dtype=object)

**nunique function**

To count the number of unique values we have nunique function

** Task:** Count the unique department names in

**mydata.**

`mydata.Department.nunique()`

** Output**: 3

**Getting a frequency distribution:**

**
value_counts**

Using **value_counts **one can get the frequency distribution of a variable.

For eg. In mydata for department columns: Consulting is appearing 17 times, Technology 17 times and outsourcing 11 times.

`mydata.Department.value_counts()`

__Output:__

Consulting 17 Technology 17 Outsourcing 11 Name: Department, dtype: int64

We can sort the data on the basis of frequency in ascending order by specifying ascending = True.

`mydata.Department.value_counts(ascending=True)`

__Output:__

Outsourcing 11 Technology 17 Consulting 17 Name: Department, dtype: int64

**crosstab**

Using **crosstab **one can get the frequency distribution of 2 variables.

`pd.crosstab(mydata.Department,mydata.Gender)`

**Selecting only numeric or character columns**

**_get_numeric_data**

To get only numeric columns in our data we can use **_get_numeric_data( )** function

```
numeric_only = mydata._get_numeric_data()
numeric_only.head()
```

Alternatively

**select_dtypes **:

By specifying exclude = 'object' (i.e. excluding character columns) we can get the numeric columns.

```
numeric_only = mydata.select_dtypes(exclude='object')
numeric_only.head()
```

By specifying include = 'object' we can get the character columns.

```
character_only = mydata.select_dtypes(include='object')
character_only.head()
```

To get numeric columns we can select multiple dtypes by providing the list as follows:

```
import numpy as np
numeric_only = mydata.select_dtypes(include=['int','float',np.number])
numeric_only.head()
```

**Dropping rows or columns using drop( )**

Using **drop **function we can drop rows and columns from our data.

** Task:** Drop department columns from our data.

`mydata.drop('Department',axis = 1).head()`

Alternatively, you can specify axis = 'columns' to delete the column.

`mydata.drop('Department',axis = "columns").head()`

__Dropping multiple columns:__

To drop multiple columns we specify the list of columns to be deleted:

** Task:** Drop the columns department and gender from mydata

`mydata.drop(['Department','Gender'],axis = "columns").head()`

__Dropping rows:__

To drop rows from our data we specify axis = 0 or axis = "index"

** Task: **Drop first 10 rows from mydata

`mydata.drop(range(10),axis = "index").head()`

Alternatively,

`mydata.drop(range(10),axis = 0).head()`

**Renaming columns in a dataframe**

Let us create a copy of our dataset as mydata_copy.

`mydata_copy= mydata.copy()`

**Renaming all the columns**

We can fetch the column names as follows:

`mydata_copy.columns`

**Output:**

Index(['Emp Name', 'Sex', 'Dept', 'Salary', 'Rating'], dtype='object')

To rename all the columns we can assign a list of new column names as follows:

Note: The column names must be in exact order as the previous column names

`mydata_copy.columns = ['Name','Gender','Dept','Salary','Rating']`

`mydata_copy.head(2)`

**Renaming only some of the columns**

To rename only some of the columns we use rename( ) function. In rename( ) we define columns = {our dictionary of key-values}

**Keys are Old column names**

**Values are New column names**

Let us rename columns Name and Gender to Emp Name and Sex respectively.

`mydata_copy.rename(columns={"Name" : "Emp Name","Gender" : "Sex"})`

By default** inplace = False** which means the changes won't be made in original dataset. Thus we need to save the new data.

To make changes in original data we set **inplace = True.**

`mydata_copy.rename(columns={"Name" : "Emp Name","Gender" : "Sex"},inplace = True)`

`mydata_copy.columns`

__Output__

`Index(['Emp Name', 'Sex', 'Dept', 'Salary', 'Rating'], dtype='object')`

**Creating new columns in a dataframe**

Let us create a copy of our dataset as mydata_copy and rename our columns.

```
mydata_copy= mydata.copy()
mydata_copy.columns = ['Name','Gender','Dept','Salary','Rating']
```

Let us create new column **Monthly_Salary ** = Salary/12

**Using square brackets [ ]:**

To create a new colujmn we need to define new column name in [ ] square brackets. and define our formula as follows:

`mydata_copy['Monthly_Salary'] = mydata_copy['Salary']/12`

**Note:** We cannot create new columns by defining data_name.new_column_name. **We MUST use [ ] square brackets to create new columns.**

Following code will lead to an error.

`mydata_copy.Monthly_Salary_exp= mydata_copy['Salary']/12`

**Using eval function:**

In the **eval **function we can specify our column names directly without mentioning the name of our data repeatedly.

In the LHS, we have used eval( ) function i.e. we will not need to write mydata_copy['column name'] .

`mydata_copy['Monthly_Salary2'] = mydata_copy.eval('Salary/12')`

**Using assign function: **

If we have to create multiple columns then we can use assign function as follows:

__Syntax:__

*new_data= old_data.assign(
new_col_1= calculation_for_new_col1, new_col_2 = calculation_for_new_col2)*

`mydata_copy2 = mydata_copy.assign(Monthly_Salary3 = mydata_copy.Salary/12)`

`mydata_copy2.head(3)`

**Filtering data **

__Filtering with loc__

**To filter the data with loc we need to provide the column or index names!!!!!**

__Basic sytax for filtering: __

For filtering the data we need to provide 2 positions separated by a comma inside a comma i.e .[ a,b] where a denotes the row names or Boolean vector for rows, while b is the column names or Boolean vector for columns. To select all the rows or all the columns we denote it by a colon ( : )

**Task:** Filter the data for all the rows and 2 columns: Employee name and gender.

Since we are selecting all the rows thus we have written a colon ( :) instead of 'a'.

To select multiple columns we have providing our list ['Employee Name','Gender'] in place of 'b'

`mydata.loc[:,['Employee Name','Gender']]`

**Task:** Filter the data for first 5 rows and 2 columns: Employee name and gender.

Since we are selecting first 5 rows only thus we have defined our range 0:5 in place of 'a' which means start from row 0 till row 4 (5 is excluded)

`mydata.loc[0:5,['Employee Name','Gender']]`

Alternatively

range(5) means 0:5 (i.e. 5 is excluded)

`mydata.loc[range(5),['Employee Name','Gender']]`

**Task:** Filter the data for first 5 rows and all the columns

`mydata.loc[range(5),:]`

__Filtering for consecutive columns:__

To select consecutive columns we can write columnName1 : columnNameN by which all columns starting from columnName1 till columnNameN will be selected.

**Task:** Filter first 5 rows and columns from employee name till annual salary.

Here we have written "Employee Name" : "Annual Salary " to retrieve all the column between them (including both of them)

`mydata.loc[range(5),"Employee Name" : "Annual Salary "]`

__Filtering using Boolean vectors.__

**Task:** Filter the data for the rows where gender is 1.

In the below code **mydata.Gender == 1** returns a Boolean vector which takes the value True when gender is 1, otherwise False.

Wherever our Boolean vector has True, only those rows will be selected.

`mydata.loc[mydata.Gender == 1,:]`

__Filtering for multiple conditions__

**AND condition - when multiple condition need to be true **

**Task:** Filter the data for the rows where gender is 1 and department is consulting.

In the below code **(mydata.Gender == 1) & (mydata.Department == 'Consulting') **returns a Boolean vector which takes the value True when both gender is 1 and Department is consulting, otherwise False.

`mydata.loc[(mydata.Gender == 1) & (mydata.Department == 'Consulting'),:]`

**Task:** In the above query select only employee name and salary

`mydata.loc[(mydata.Gender == 1) & (mydata.Department == 'Consulting'),['Employee Name','Annual Salary ']]`

**OR condition - when at least one condition needs to be true **

**Task:** Filter the data for the rows where either gender is 1 or Salary is more than 110000.

Since we need to apply an 'or' condition i.e. at least one of the conditions should be true thus in Python we use '|' to denote 'or'

`mydata.loc[(mydata.Gender == 1) | (mydata['Annual Salary '] > 110000),['Employee Name','Annual Salary ']]`

**IN CONDITION**

**Task: **Filter for rows where department is either consulting or outsourcing.

We can achieve this using or '|' condition as follows:

`mydata.loc[(mydata.Department == 'Outsourcing') | (mydata.Department == 'Consulting'),:]`

But suppose we have to choose for multiple values (say 10 departments) then writing OR condition would be too tedious.

To simplify it we have **.isin( ) **function in Python. Inside isin( ) we specify the list of values which need to be searched.

In the following code we have written **mydata.Department.isin(['Outsourcing','Consulting']) **i.e. this will return a Boolean vector where department name is either outsourcing or consulting.

`mydata.loc[mydata.Department.isin(['Outsourcing','Consulting']),:]`

**Excluding the rows satisfying a particular condition**

**Task: **Filter for rows where department is neither consulting nor outsourcing.

It means we do not want the department names to be consulting and outsourcing. To do this, we can firstly create a vector using .isin( ) which will return True where department is consulting or outsourcing, and then specify to exclude such rows i.e. provide negation.

To provide a negation we write a (~) symbol before the Boolean vector.

`mydata.loc[~mydata.Department.isin(['Outsourcing','Consulting']),:]`

**pd.query**

We can filter our rows using pd.query( ) function

**Task: **Filter for rows where Gender is 1.

`mydata.query('Gender ==1')`

__Filtering with iloc__

**To filter the data with iloc we need to provide the column or index names!!!!!**

**Task:** Filter for first 5 rows. (Since our index names are 0:44, thus iloc and loc lead to same output)

`mydata.iloc[range(5)]`

**Task:** Filter for first 5 rows and first 3 columns.

To filter for first 3 columns we have defined 0:3 after comma.

`mydata.iloc[0:5,0:3]`

**Obtaining indices using np.where**

**Task: **Filter for rows where department is consulting.

Numpy's where function returns the indices wherever a particular condition is found True. We can then use those indices in iloc to filter the rows.

`import numpy as np`

`np.where(mydata['Department'] == 'Consulting')`

__Output:__

(array([ 0, 2, 3, 7, 8, 10, 17, 19, 21, 25, 28, 31, 32, 33, 34, 36, 41], dtype=int64),)

`mydata.iloc[np.where(mydata['Department'] == 'Consulting')]`

**Managing index in a dataframe.**

**Setting the index in an already existing dataset**

Using **set_index**( ) we can set our index in already existing dataset.

`data1 = mydata.set_index(["Employee Name","Gender"]).sort_index()`

`data1.head(10)`

**Resetting the index**

We can also reset the index i.e. make them columns in our data using **reset_index( ) function.**

By default inplace = False, which means to save the changes we need to create a new dataset.

To make the changes in the original dataset we mention** inplace = True**

```
data1.reset_index(inplace = True)
data1.head()
```

To learn more about how to deal with index for filtering them refer to this tutorial:

__Setting and filtering index in Python.__

**Sorting the data**

We can sort the data on the basis of a single column using **sort_values**( ) function

**Task: **Sort **mydata **by Annual Salary

`mydata.sort_values("Annual Salary ",ascending = False)`

By default ascending = True and inplace = False. To sort the data in descending order we need to specify ascending = False. To make the changes in original data you need to set inplace = True, otherwise you need to save your data in a new dataset.

**Sorting data by multiple variables**

**Task: **Sort **mydata **by Department and Annual Salary

`mydata.sort_values(['Department',"Annual Salary "],ascending = False)`

**Task: **Sort a single column: Annual Salary

`mydata['Annual Salary '].sort_values()`

**Ranking the data**

Let us firstly create a copy of our dataset:

`data3 = mydata.copy()`

To provide ranks to a dataset we can use ** rank( ) function.**

By default ascending = True which means lowest value will be provided rank 1.

```
data3['Rank_of_salary'] = data3['Annual Salary '].rank(ascending=True)
data3.head()
```

__Ranking and group by__

**Task:** Provide ranks on the basis of annual salary within each department.

In the following code we have grouped our annual salary by department and then have applied **rank( ) function.** Note: we have kept ascending = False.

`data3['Rank_by_department'] = data3['Annual Salary '].groupby(data3.Department).rank(ascending =False)`

Let us filter the above data for consulting department:

`consulting = data3.loc[data3.Department == 'Consulting',:]`

Using **sort_values( ) function** let us sort the data by rank and observe the output.

`consulting.sort_values('Rank_by_department')`

**Getting descriptive statistics**

**describe( ) function**

We can get basic summary statistics for our data using describe( ) function

`mydata.describe()`

By default describe returns summary statistics for numeric columns. To get summary statistics for character or categorical column we need to specify** include ='object'**

`mydata.describe(include='object')`

To get only one statistic(say mean) we can calculate it as follows:

`mydata.mean()`

** Output**:

Gender 1.511111 Annual Salary 55723.733333 Performance Rating 3.986667 dtype: float64

** Task: **Calculate median for all numeric columns in

**mydata**

`mydata.median()`

__Output:__

Gender 2.0 Annual Salary 50400.0 Performance Rating 4.5 dtype: float64

** Task**: Calculate average annual salary using mydata.

`mydata['Annual Salary '].mean()`

** Output:** 55723.73333333333

** Task**: Calculate average annual salary and performance rating using mydata.

```
mydata[['Annual Salary ','Performance Rating']].mean()
mydata.loc[:,['Annual Salary ','Performance Rating']].mean()
```

__Output:__

Annual Salary 55723.733333 Performance Rating 3.986667 dtype: float64

**agg function**

We can calculate multiple summary statistics using agg( ) function.

**Task: **Calculate mean and median for all the columns.

`mydata.agg(['mean','median'])`

**Task: **Calculate mean and median for annual salary and performance rating,

`mydata[['Annual Salary ','Performance Rating']].agg(['mean','median'])`

**Task: **Calculate mean and median for annual salary and minimum and maximum performance rating.

For this we can define a dictionary where keys are the column names and values are the list of statistics for the corresponding keys.

Eg. We have the key as Annual Salary and its values are ['mean' , 'median'] ( a list)

`mydata.agg({"Annual Salary " : ['mean','median'], "Performance Rating" : ['min','max']})`

**Group by **

**Task: **Retrieve the minimum value for all the variables for each department.

Since we want to divide our data on the basis of each department and then get the minimum values, for this we have **grouped our data by **department and then applied the min( ) function to get minimum values for all the columns.

`mydata.groupby('Department').min()`

**Grouping by multiple variables**

**Task: **Retrieve the minimum value for all the variables for each department and gender.

Here have specified 2 variables Department and Gender in our group by

`mydata.groupby(['Department','Gender']).min()`

By default group by is returning minimum values for all the columns. To fetch minimum values for a single column (say Annual Salary ), after group_by we select the column name and then calculate the summary statistic.

`mydata.groupby(['Department','Gender'])['Annual Salary '].min()`

**Calculating multiple summary statistics:**

**Task: **Calculate mean and median values for each Department and Gender combination.

To calculate multiple summary statistics we can use **agg( ) **function. Inside agg we have defined the summary statistics which we want to retrieve.

`mydata.groupby(['Gender','Department']).agg(['mean','median'])`

To learn more about group by in detail you can refer:

__Group by and aggregate functions in Python__

**Calculating Percentiles**

Using **quantile( ) **function we can get percentiles or quantiles for our data.

**Task: **Calculate 1st quartile or 25th quantile for all the variables.

`mydata.quantile(0.25)`

__Output:__

Gender 1.0 Annual Salary 35880.0 Performance Rating 3.2 Name: 0.25, dtype: float64

**Task: **Calculate 25th,40th, 8th and 99 quantile for all the variables.

`mydata.quantile([0.25,0.4,0.8,0.99])`

**Task: **Calculate 25th,40th, 8th and 99 quantile for annual salary

`mydata['Annual Salary '].quantile([0.25,0.4,0.8,0.99])`

__Output:__

0.25 35880.0 0.40 46430.0 0.80 71600.0 0.99 127912.0 Name: Annual Salary , dtype: float64

**Cumulative sum**

Let us firstly sort our data by department and annual salary. (We have sorted salaries in ascending order)

`dummydata = mydata[['Department','Annual Salary ']].sort_values(['Department','Annual Salary '], ascending = [True,False])`

**We can calculate cumulative sum using cumsum( ) function.**

Task: Calculate cumulative sum for the above data.

`dummydata['Annual Salary '].cumsum()`

**Group by and cumulative sum**

**Task: **Calculate cumulative sum for each department.

For this we have firstly used **groupby( ) **command and then applied **cumsum( ) **function.

```
dummydata['Department_wise_cumsum'] = dummydata.groupby('Department').cumsum()
dummydata
```

**Dealing with missing values**

A missing value can be created using Numpy's .nan function. So let us firstly import Numpy as np

`import numpy as np`

__Dataset:__

Let us create a dataset where missing values are created by np.nan

```
data1 = {'Product_category': ['Makeup', 'Biscuits', 'Household_Cleaning', 'Cold_drinks'],
'Sales': [10000, 2000, 17891, np.nan],
'Returns' : [678, np.nan, 678,234]}
data1 = pd.DataFrame(data1)
data1
```

__isnull ( ) function__

isnull( ) function returns True when a missing value is encountered, otherwise returns False.

`data1.isnull()`

For index 1 and 3 we have Returns and Sales value as True respectively.

Applying isnull( ) on a column

`data1.Sales.isnull()`

In the data, in sales column we had a missing value at index =3 thus we have a True at that position.

__Filtering for rows having missing values.__

**Task: **Filter for rows where Sales values are missing.

We can filter the rows using .loc and apply our Boolean vector of True and False (which we have got the in the previous code)

`data1.loc[data1.Sales.isnull(),]`

__notnull ( )__

notnull is complementary of isnull( ) .When the data is missing then notnull( ) returns a False, while for non-missing data it returns True.

__Filtering for rows having non-missing values.__

**Task: **Filter for rows where Sales values are not missing.

We can filter the rows using .loc and apply our Boolean vector of True and False i.e. data1.Sales.notnull()

`data1.loc[data1.Sales.notnull(),]`

__Getting total number of missing values.__

Using sum( ) and isnull( ) function we can get the total number of missing values in a column

```
sum(data1.Sales.isnull())
data1.Sales.isnull().sum() #Alternatively
```

Output: 1

__Getting total number of non-missing values.__

Using sum( ) and notnull( ) function we can get the total number of non-missing values in a column

```
sum(data1.Sales.notnull())
data1.Sales.notnull().sum()
```

Output: 3

__Filling missing values__

We can fill missing value in a data using fillna( ) function.

`data1.fillna(value="Missing")`

__Dropping missing values__

Using dropna( ) function we can drop the rows having missing values.

By specifying how = 'all' we are telling Python to delete the rows where all of the columns have missing values (that is entire row is full of NaN)

`data1.dropna(how='all')`

Since we did not have any row where all columns have missing values thus no row got deleted.

By specifying how = 'any' we are telling Python to delete the rows where at least one missing is present.

`data1.dropna(how='any')`

In the above 2 codes we are considering all the columns for missing values. To consider only some of the columns we define **subset = **['list of columns to be considered']

For the code chunk below, those rows will be deleted where there is missing value is Sales column.

`data1.dropna(subset = ['Sales'],how = 'all')`

In the following code, those rows will get deleted where there is missing value in either of the 2 columns (as how = "any")

`data1.dropna(subset = ['Sales','Returns'],how = 'any')`

**inplace = True**

By default Python does not make changes in the original dataset as inplace = False. **To make changes in the original data we need to specify inplace = True**

```
data1.dropna(subset = ['Sales','Returns'],how = 'all',inplace = True)
data1
```

**Dealing with duplicate values**

Let us create a dataset which we will use

```
data = pd.DataFrame({
"Product_Category" : ["Makeup","Makeup","Makeup","Cold_drinks","Sauces","Sauces"],
"Sales" : [15000,15000,10000,23454,7800,9000]})
data
```