Pandas is one of the most powerful Python libraries which make it a programmer's choice! It is built over Numpy and matplotlib and is used for data manipulations.

In this tutorial we will learn with plethora of examples to understand how data can be wrangled!

__Key topics covered:__

__Reading a dataset:__

__Reading a dataset:__

In this tutorial we will make use of following CSV file:

Let us read our file using** pandas' read_csv **function. Do specify the file path where your file is located:

```
import pandas as pd
mydata = pd.read_csv("C:\\Users\\Employee_info.csv")
```

**Viewing only first or last N rows.**

**head( ) function**

In Python head( ) function is used to view first N rows. By default head( ) returns first 5 rows.

`mydata.head()`

We can modify the number of rows in head( ) . In the following code, I have specified that I want to view first 2 rows only.

`mydata.head(2)`

**tail( ) function**

tail( ) function is used to view last N rows. By default tail( ) returns last 5 rows.

`mydata.tail()`

In the following code, we have specified that we want to view last 2 rows only.

`mydata.tail(2)`

**What is the shape of my data?**

Shape of a data defines the number of rows and columns.

Using **shape function **we can get the shape of our **mydata/**

`mydata.shape`

** Output :** (45, 5)

Shape function returns 2 values: where first value (value at index 0 ) is the number of rows, while second value (value at index 1) denotes the number of columns.

Following code returns the number of rows.

`mydata.shape[0]`

** Output:** 45

By specifying [1] with shape, we can get the number of columns.

`mydata.shape[1]`

** Output:** 5

**Knowing our data's structure**

We can get the column names using **columns **function as follows:

`mydata.columns`

__Output: __

Index(['Employee Name', 'Gender', 'Department', 'Annual Salary ', 'Performance Rating'], dtype='object')

Row names are referred as **index** in Python. Using **index** function one can get the row names of the data.

`mydata.index`

__Output: __

RangeIndex(start=0, stop=45, step=1)

*Range (0,45) means a counting from 0 to 44 (45 is excluded)

**dtypes function**

We can understand the type of our data using **dtypes **function

`mydata.dtypes`

In the following output 'object' means character;

int64 means integer

and float64 means float (decimal)values

are there in the corresponding column.

__Output:__

Employee Name object Gender int64 Department object Annual Salary int64 Performance Rating float64 dtype: object

**astype function**

We can change the type of our columns using **astype function.**

In the following code we are converting annual salary to a float.

` mydata['Annual Salary '] = mydata['Annual Salary '].astype(float)`

`mydata['Annual Salary '].dtypes`

__Output:__** **dtype('float64')

**Unique**

**unique function**

To get the unique values in a column we can use unique function

** Task:** Get the unique department names in

**mydata.**

`mydata.Department.unique()`

** Output**: array(['Consulting', 'Technology', 'Outsourcing'], dtype=object)

**nunique function**

To count the number of unique values we have nunique function

** Task:** Count the unique department names in

**mydata.**

`mydata.Department.nunique()`

** Output**: 3

**Getting a frequency distribution:**

**value_counts**

Using **value_counts **one can get the frequency distribution of a variable.

For eg. In mydata for department columns: Consulting is appearing 17 times, Technology 17 times and outsourcing 11 times.

`mydata.Department.value_counts()`

__Output:__

Consulting 17 Technology 17 Outsourcing 11 Name: Department, dtype: int64

We can sort the data on the basis of frequency in ascending order by specifying ascending = True.

`mydata.Department.value_counts(ascending=True)`

__Output:__

Outsourcing 11 Technology 17 Consulting 17 Name: Department, dtype: int64

**crosstab**

Using **crosstab **one can get the frequency distribution of 2 variables.

`pd.crosstab(mydata.Department,mydata.Gender)`

**Selecting only numeric or character columns**

**_get_numeric_data**

To get only numeric columns in our data we can use **_get_numeric_data( )** function

```
numeric_only = mydata._get_numeric_data()
numeric_only.head()
```

Alternatively

**select_dtypes **:

By specifying exclude = 'object' (i.e. excluding character columns) we can get the numeric columns.

```
numeric_only = mydata.select_dtypes(exclude='object')
numeric_only.head()
```

By specifying include = 'object' we can get the character columns.

```
character_only = mydata.select_dtypes(include='object')
character_only.head()
```

To get numeric columns we can select multiple dtypes by providing the list as follows:

```
import numpy as np
numeric_only = mydata.select_dtypes(include=['int','float',np.number])
numeric_only.head()
```

**Dropping rows or columns using drop( )**

Using **drop **function we can drop rows and columns from our data.

** Task:** Drop department columns from our data.

`mydata.drop('Department',axis = 1).head()`

Alternatively, you can specify axis = 'columns' to delete the column.

`mydata.drop('Department',axis = "columns").head()`

__Dropping multiple columns:__

To drop multiple columns we specify the list of columns to be deleted:

** Task:** Drop the columns department and gender from mydata

`mydata.drop(['Department','Gender'],axis = "columns").head()`

__Dropping rows:__

To drop rows from our data we specify axis = 0 or axis = "index"

** Task: **Drop first 10 rows from mydata

`mydata.drop(range(10),axis = "index").head()`

Alternatively,

`mydata.drop(range(10),axis = 0).head()`

**Renaming columns in a dataframe**

Let us create a copy of our dataset as mydata_copy.

`mydata_copy= mydata.copy()`

**Renaming all the columns**

We can fetch the column names as follows:

`mydata_copy.columns`

__Output:__

Index(['Emp Name', 'Sex', 'Dept', 'Salary', 'Rating'], dtype='object')

To rename all the columns we can assign a list of new column names as follows:

Note: The column names must be in exact order as the previous column names

`mydata_copy.columns = ['Name','Gender','Dept','Salary','Rating']`

`mydata_copy.head(2)`

**Renaming only some of the columns**

To rename only some of the columns we use rename( ) function. In rename( ) we define columns = {our dictionary of key-values}

**Keys are Old column names**

**Values are New column names**

Let us rename columns Name and Gender to Emp Name and Sex respectively.

`mydata_copy.rename(columns={"Name" : "Emp Name","Gender" : "Sex"})`

By default** inplace = False** which means the changes won't be made in original dataset. Thus we need to save the new data.

To make changes in original data we set **inplace = True.**

`mydata_copy.rename(columns={"Name" : "Emp Name","Gender" : "Sex"},inplace = True)`

`mydata_copy.columns`

__Output__

`Index(['Emp Name', 'Sex', 'Dept', 'Salary', 'Rating'], dtype='object')`

**Creating new columns in a dataframe**

Let us create a copy of our dataset as mydata_copy and rename our columns.

```
mydata_copy= mydata.copy()
mydata_copy.columns = ['Name','Gender','Dept','Salary','Rating']
```

Let us create new column **Monthly_Salary ** = Salary/12

**Using square brackets [ ]:**

To create a new colujmn we need to define new column name in [ ] square brackets. and define our formula as follows:

`mydata_copy['Monthly_Salary'] = mydata_copy['Salary']/12`

**Note:** We cannot create new columns by defining data_name.new_column_name. **We MUST use [ ] square brackets to create new columns.**

Following code will lead to an error.

`mydata_copy.Monthly_Salary_exp= mydata_copy['Salary']/12`

**Using eval function:**

In the **eval **function we can specify our column names directly without mentioning the name of our data repeatedly.

In the LHS, we have used eval( ) function i.e. we will not need to write mydata_copy['column name'] .

`mydata_copy['Monthly_Salary2'] = mydata_copy.eval('Salary/12')`

**Using assign function: **

If we have to create multiple columns then we can use assign function as follows:

__Syntax:__

*new_data= old_data.assign(*
*new_col_1= calculation_for_new_col1, new_col_2 = calculation_for_new_col2)*

`mydata_copy2 = mydata_copy.assign(Monthly_Salary3 = mydata_copy.Salary/12)`

`mydata_copy2.head(3)`

**Filtering data **

__Filtering with loc__

**To filter the data with loc we need to provide the column or index names!!!!!**

__Basic sytax for filtering: __

For filtering the data we need to provide 2 positions separated by a comma inside a comma i.e .[ a,b] where a denotes the row names or Boolean vector for rows, while b is the column names or Boolean vector for columns. To select all the rows or all the columns we denote it by a colon ( : )

**Task:** Filter the data for all the rows and 2 columns: Employee name and gender.

Since we are selecting all the rows thus we have written a colon ( :) instead of 'a'.

To select multiple columns we have providing our list ['Employee Name','Gender'] in place of 'b'

`mydata.loc[:,['Employee Name','Gender']]`

**Task:** Filter the data for first 5 rows and 2 columns: Employee name and gender.

Since we are selecting first 5 rows only thus we have defined our range 0:5 in place of 'a' which means start from row 0 till row 4 (5 is excluded)

`mydata.loc[0:5,['Employee Name','Gender']]`

Alternatively

range(5) means 0:5 (i.e. 5 is excluded)

`mydata.loc[range(5),['Employee Name','Gender']]`

**Task:** Filter the data for first 5 rows and all the columns

`mydata.loc[range(5),:]`

__Filtering for consecutive columns:__

To select consecutive columns we can write columnName1 : columnNameN by which all columns starting from columnName1 till columnNameN will be selected.

**Task:** Filter first 5 rows and columns from employee name till annual salary.

Here we have written "Employee Name" : "Annual Salary " to retrieve all the column between them (including both of them)

`mydata.loc[range(5),"Employee Name" : "Annual Salary "]`

__Filtering using Boolean vectors.__

**Task:** Filter the data for the rows where gender is 1.

In the below code **mydata.Gender == 1** returns a Boolean vector which takes the value True when gender is 1, otherwise False.

Wherever our Boolean vector has True, only those rows will be selected.

`mydata.loc[mydata.Gender == 1,:]`

__Filtering for multiple conditions__

**AND condition - when multiple condition need to be true **

**Task:** Filter the data for the rows where gender is 1 and department is consulting.

In the below code **(mydata.Gender == 1) & (mydata.Department == 'Consulting') **returns a Boolean vector which takes the value True when both gender is 1 and Department is consulting, otherwise False.

`mydata.loc[(mydata.Gender == 1) & (mydata.Department == 'Consulting'),:]`

**Task:** In the above query select only employee name and salary

`mydata.loc[(mydata.Gender == 1) & (mydata.Department == 'Consulting'),['Employee Name','Annual Salary ']]`

**OR condition - when at least one condition needs to be true **

**Task:** Filter the data for the rows where either gender is 1 or Salary is more than 110000.

Since we need to apply an 'or' condition i.e. at least one of the conditions should be true thus in Python we use '|' to denote 'or'

`mydata.loc[(mydata.Gender == 1) | (mydata['Annual Salary '] > 110000),['Employee Name','Annual Salary ']]`

**IN CONDITION**

**Task: **Filter for rows where department is either consulting or outsourcing.

We can achieve this using or '|' condition as follows:

`mydata.loc[(mydata.Department == 'Outsourcing') | (mydata.Department == 'Consulting'),:]`

But suppose we have to choose for multiple values (say 10 departments) then writing OR condition would be too tedious.

To simplify it we have **.isin( ) **function in Python. Inside isin( ) we specify the list of values which need to be searched.

In the following code we have written **mydata.Department.isin(['Outsourcing','Consulting']) **i.e. this will return a Boolean vector where department name is either outsourcing or consulting.

`mydata.loc[mydata.Department.isin(['Outsourcing','Consulting']),:]`

**Excluding the rows satisfying a particular condition**

**Task: **Filter for rows where department is neither consulting nor outsourcing.

It means we do not want the department names to be consulting and outsourcing. To do this, we can firstly create a vector using .isin( ) which will return True where department is consulting or outsourcing, and then specify to exclude such rows i.e. provide negation.

To provide a negation we write a (~) symbol before the Boolean vector.

`mydata.loc[~mydata.Department.isin(['Outsourcing','Consulting']),:]`

**pd.query**

We can filter our rows using pd.query( ) function

**Task: **Filter for rows where Gender is 1.

`mydata.query('Gender ==1')`

__Filtering with iloc__

**To filter the data with iloc we need to provide the column or index names!!!!!**

**Task:** Filter for first 5 rows. (Since our index names are 0:44, thus iloc and loc lead to same output)

`mydata.iloc[range(5)]`

**Task:** Filter for first 5 rows and first 3 columns.

To filter for first 3 columns we have defined 0:3 after comma.

`mydata.iloc[0:5,0:3]`

**Obtaining indices using np.where**

**Task: **Filter for rows where department is consulting.

Numpy's where function returns the indices wherever a particular condition is found True. We can then use those indices in iloc to filter the rows.

`import numpy as np`

`np.where(mydata['Department'] == 'Consulting')`

__Output:__

(array([ 0, 2, 3, 7, 8, 10, 17, 19, 21, 25, 28, 31, 32, 33, 34, 36, 41], dtype=int64),)

`mydata.iloc[np.where(mydata['Department'] == 'Consulting')]`

**Managing index in a dataframe.**

**Setting the index in an already existing dataset**

Using **set_index**( ) we can set our index in already existing dataset.

`data1 = mydata.set_index(["Employee Name","Gender"]).sort_index()`

`data1.head(10)`

**Resetting the index**

We can also reset the index i.e. make them columns in our data using **reset_index( ) function.**

By default inplace = False, which means to save the changes we need to create a new dataset.

To make the changes in the original dataset we mention** inplace = True**

```
data1.reset_index(inplace = True)
data1.head()
```

To learn more about how to deal with index for filtering them refer to this tutorial:

**Sorting the data**

We can sort the data on the basis of a single column using **sort_values**( ) function

**Task: **Sort **mydata **by Annual Salary

`mydata.sort_values("Annual Salary ",ascending = False)`

By default ascending = True and inplace = False. To sort the data in descending order we need to specify ascending = False. To make the changes in original data you need to set inplace = True, otherwise you need to save your data in a new dataset.

**Sorting data by multiple variables**

**Task: **Sort **mydata **by Department and Annual Salary

`mydata.sort_values(['Department',"Annual Salary "],ascending = False)`

**Task: **Sort a single column: Annual Salary

`mydata['Annual Salary '].sort_values()`

**Ranking the data**

Let us firstly create a copy of our dataset:

`data3 = mydata.copy()`

To provide ranks to a dataset we can use ** rank( ) function.**

By default ascending = True which means lowest value will be provided rank 1.

```
data3['Rank_of_salary'] = data3['Annual Salary '].rank(ascending=True)
data3.head()
```

__Ranking and group by__

**Task:** Provide ranks on the basis of annual salary within each department.

In the following code we have grouped our annual salary by department and then have applied **rank( ) function.** Note: we have kept ascending = False.

`data3['Rank_by_department'] = data3['Annual Salary '].groupby(data3.Department).rank(ascending =False)`

Let us filter the above data for consulting department:

`consulting = data3.loc[data3.Department == 'Consulting',:]`

Using **sort_values( ) function** let us sort the data by rank and observe the output.

`consulting.sort_values('Rank_by_department')`

**Getting descriptive statistics**

**describe( ) function**

We can get basic summary statistics for our data using describe( ) function

`mydata.describe()`

By default describe returns summary statistics for numeric columns. To get summary statistics for character or categorical column we need to specify** include ='object'**

`mydata.describe(include='object')`

To get only one statistic(say mean) we can calculate it as follows:

`mydata.mean()`

** Output**:

Gender 1.511111 Annual Salary 55723.733333 Performance Rating 3.986667 dtype: float64

** Task: **Calculate median for all numeric columns in

**mydata**

`mydata.median()`

__Output:__

Gender 2.0 Annual Salary 50400.0 Performance Rating 4.5 dtype: float64

** Task**: Calculate average annual salary using mydata.

`mydata['Annual Salary '].mean()`

** Output:** 55723.73333333333

** Task**: Calculate average annual salary and performance rating using mydata.

```
mydata[['Annual Salary ','Performance Rating']].mean()
mydata.loc[:,['Annual Salary ','Performance Rating']].mean()
```

__Output:__

Annual Salary 55723.733333 Performance Rating 3.986667 dtype: float64

**agg function**

We can calculate multiple summary statistics using agg( ) function.

**Task: **Calculate mean and median for all the columns.

`mydata.agg(['mean','median'])`

**Task: **Calculate mean and median for annual salary and performance rating,

`mydata[['Annual Salary ','Performance Rating']].agg(['mean','median'])`

**Task: **Calculate mean and median for annual salary and minimum and maximum performance rating.

For this we can define a dictionary where keys are the column names and values are the list of statistics for the corresponding keys.

Eg. We have the key as Annual Salary and its values are ['mean' , 'median'] ( a list)

`mydata.agg({"Annual Salary " : ['mean','median'], "Performance Rating" : ['min','max']})`

**Group by **

**Task: **Retrieve the minimum value for all the variables for each department.

Since we want to divide our data on the basis of each department and then get the minimum values, for this we have **grouped our data by **department and then applied the min( ) function to get minimum values for all the columns.

`mydata.groupby('Department').min()`

**Grouping by multiple variables**

**Task: **Retrieve the minimum value for all the variables for each department and gender.

Here have specified 2 variables Department and Gender in our group by

`mydata.groupby(['Department','Gender']).min()`

By default group by is returning minimum values for all the columns. To fetch minimum values for a single column (say Annual Salary ), after group_by we select the column name and then calculate the summary statistic.

`mydata.groupby(['Department','Gender'])['Annual Salary '].min()`

**Calculating multiple summary statistics:**

**Task: **Calculate mean and median values for each Department and Gender combination.

To calculate multiple summary statistics we can use **agg( ) **function. Inside agg we have defined the summary statistics which we want to retrieve.

`mydata.groupby(['Gender','Department']).agg(['mean','median'])`

To learn more about group by in detail you can refer:

**Calculating Percentiles**

Using **quantile( ) **function we can get percentiles or quantiles for our data.

**Task: **Calculate 1st quartile or 25th quantile for all the variables.

`mydata.quantile(0.25)`

__Output:__

Gender 1.0 Annual Salary 35880.0 Performance Rating 3.2 Name: 0.25, dtype: float64

**Task: **Calculate 25th,40th, 8th and 99 quantile for all the variables.

`mydata.quantile([0.25,0.4,0.8,0.99])`

**Task: **Calculate 25th,40th, 8th and 99 quantile for annual salary

`mydata['Annual Salary '].quantile([0.25,0.4,0.8,0.99])`

__Output:__

0.25 35880.0 0.40 46430.0 0.80 71600.0 0.99 127912.0 Name: Annual Salary , dtype: float64

**Cumulative sum**

Let us firstly sort our data by department and annual salary. (We have sorted salaries in ascending order)

`dummydata = mydata[['Department','Annual Salary ']].sort_values(['Department','Annual Salary '], ascending = [True,False])`

**We can calculate cumulative sum using cumsum( ) function.**

Task: Calculate cumulative sum for the above data.

`dummydata['Annual Salary '].cumsum()`

**Group by and cumulative sum**

**Task: **Calculate cumulative sum for each department.

For this we have firstly used **groupby( ) **command and then applied **cumsum( ) **function.

```
dummydata['Department_wise_cumsum'] = dummydata.groupby('Department').cumsum()
dummydata
```

**Dealing with missing values**

A missing value can be created using Numpy's .nan function. So let us firstly import Numpy as np

`import numpy as np`

__Dataset:__

Let us create a dataset where missing values are created by np.nan

```
data1 = {'Product_category': ['Makeup', 'Biscuits', 'Household_Cleaning', 'Cold_drinks'],
'Sales': [10000, 2000, 17891, np.nan],
'Returns' : [678, np.nan, 678,234]}
data1 = pd.DataFrame(data1)
data1
```

__isnull ( ) function__

isnull( ) function returns True when a missing value is encountered, otherwise returns False.

`data1.isnull()`

For index 1 and 3 we have Returns and Sales value as True respectively.

Applying isnull( ) on a column

`data1.Sales.isnull()`

In the data, in sales column we had a missing value at index =3 thus we have a True at that position.

__Filtering for rows having missing values.__

**Task: **Filter for rows where Sales values are missing.

We can filter the rows using .loc and apply our Boolean vector of True and False (which we have got the in the previous code)

`data1.loc[data1.Sales.isnull(),]`

__notnull ( )__

notnull is complementary of isnull( ) .When the data is missing then notnull( ) returns a False, while for non-missing data it returns True.

__Filtering for rows having non-missing values.__

**Task: **Filter for rows where Sales values are not missing.

We can filter the rows using .loc and apply our Boolean vector of True and False i.e. data1.Sales.notnull()

`data1.loc[data1.Sales.notnull(),]`

__Getting total number of missing values.__

Using sum( ) and isnull( ) function we can get the total number of missing values in a column

```
sum(data1.Sales.isnull())
data1.Sales.isnull().sum() #Alternatively
```

Output: 1

__Getting total number of non-missing values.__

Using sum( ) and notnull( ) function we can get the total number of non-missing values in a column

```
sum(data1.Sales.notnull())
data1.Sales.notnull().sum()
```

Output: 3

__Filling missing values__

We can fill missing value in a data using fillna( ) function.

`data1.fillna(value="Missing")`

__Dropping missing values__

Using dropna( ) function we can drop the rows having missing values.

By specifying how = 'all' we are telling Python to delete the rows where all of the columns have missing values (that is entire row is full of NaN)

`data1.dropna(how='all')`

Since we did not have any row where all columns have missing values thus no row got deleted.

By specifying how = 'any' we are telling Python to delete the rows where at least one missing is present.

`data1.dropna(how='any')`

In the above 2 codes we are considering all the columns for missing values. To consider only some of the columns we define **subset = **['list of columns to be considered']

For the code chunk below, those rows will be deleted where there is missing value is Sales column.

`data1.dropna(subset = ['Sales'],how = 'all')`

In the following code, those rows will get deleted where there is missing value in either of the 2 columns (as how = "any")

`data1.dropna(subset = ['Sales','Returns'],how = 'any')`

**inplace = True**

By default Python does not make changes in the original dataset as inplace = False. **To make changes in the original data we need to specify inplace = True**

```
data1.dropna(subset = ['Sales','Returns'],how = 'all',inplace = True)
data1
```

**Dealing with duplicate values**

Let us create a dataset which we will use

```
data = pd.DataFrame({
"Product_Category" : ["Makeup","Makeup","Makeup","Cold_drinks","Sauces","Sauces"],
"Sales" : [15000,15000,10000,23454,7800,9000]})
data
```

**duplicated( ) function**

Pandas' duplicated( ) function when used with a dataframe returns a boolean series where True indicates that entire row has been duplicated.

`data.duplicated()`

In the above dataset our row at index 1 (second row) is a duplicate thus duplicated( ) has returned a series where second element is True and rest are False (False denotes that the value is a unique value)

We can use the above boolean series to filter our dataset: To get a dataset with only duplicates :

`data.loc[data.duplicated(),:]`

By default Python considers first occurence as unique occurence (ie. row with index 0) and all other repetitions as duplicates (row with index 1). This is because by default **keep = "first"** in duplicated( ) function.

To keep the last occurrence as unique and all others as duplicate we set **keep = "last"**

`data.duplicated(keep = "last")`

Note that now row with index 0 has True (i.e. it is a duplicate) while row with index 1 is False (it is being considered as unique)

`data.loc[data.duplicated(keep = "last"),:]`

When keep = False then Python treat all of the repetitions (including first and last occurrence as duplicates)

`data.duplicated(keep = False)`

**Dropping duplicate values**

Using pandas' drop_duplicates( ) we can drop the duplicate values.

In the following code keep = "first" means first occurrence (at index 0 ) would be a unique value and hence would not be dropped.

`data.drop_duplicates(keep = "first")`

keep = "last" means last occurrence (at index 1) would be a unique value and would not be dropped.

`data.drop_duplicates(keep = "last")`

keep = False means all occurrence (at index 0 and 1) would be a treated as duplicates and hence would be dropped.

`data.drop_duplicates(keep = False)`

**Finding duplicates on the basis of some columns:**

To find duplicated on the basis of Product_Category column we apply duplicated function to the single column and not on entire data:

`data.Product_Category.duplicated(keep =False)`

Note: We have defined keep = False. Since Makeup and Sauces are repeated thus we have a True corresponding to their indices.

**Task:** Filter for those product categories which are unique.

In the code below we are filtering for duplicate product categories using duplicated( ) function and have then provided a negation using ' **~ ' symbol.**

`data.loc[~data.Product_Category.duplicated(keep =False),:]`

**Creating pivot tables**

We can create pivot tables using pandas' pivot_table function.

__Syntax__

__Syntax__

```
data.pivot_table(values = "list of column names for aggregation ",
index = "List of categorical column for rows",
columns="List of categorical column for column",
aggfunc = [List of functions to be evaluated],
margins = [True or False for getting grand total i.e. margin],
margins_name="Column name for Grand Total or Margins")
```

**Task 1: **Get average salary for each department.

`mydata.pivot_table(values = "Annual Salary ",index = "Department")`

By default Python calculates averages for numeric columns. Thus to get an average salary we have defined values = "Annual Salary "

We want to show the data for each department in rows : thus we defined index = "Department"

**Task 2: **Get median salary for each department.

To get median salary we have define **aggfunc = np.median **in the code below.

```
import numpy as np
mydata.pivot_table(values = "Annual Salary ",index = "Department",aggfunc = np.median)
```

**Task 3: **Get average and median salary for each department.

To calculate multiple statistics we have defined the **list of functions [np.median , np.mean] in aggfunc**

`mydata.pivot_table(values = "Annual Salary ",index = "Department",aggfunc = [np.median,np.mean])`

**Task 4: **Get average and median salary for each department and gender.

To create a 2X2 view of department and age we have defined **columns = "Gender".**

**Note:** If we wanted we could have specified index = ["Department" ,"Gender"] but that would have resulted in a long table instead of a 2X2 table.

`mydata.pivot_table(values = "Annual Salary ",index = "Department",columns="Gender",aggfunc = [np.median,np.mean])`

To learn about pivot tables in detail you can refer:

**Creating dummy variables**

**To understand what are dummy variables in detail you can refer: **__What are dummy variables and creating dummy variables in Python.__

Using pandas' **get_dummies**( ) function we can create dummy variables with a single line of code.

Let us create another copy of our data

`data2 =mydata.copy()`

Using pandas' get_dummies( ) function we can create dummy variables with a single line of code.

For each department name have added a prefix **"Dep"**

`pd.get_dummies(data2.Department,prefix = "Dep")`

get_dummies( ) only creates dummy variables. To append it in our data we use pandas' concat function:

Let us firstly save our dummy variables in a dataset.

`dummy_variables = pd.get_dummies(data2.Department,prefix = "Dep")`

We now concatenate our original data using pd.concat( ) , by defining axis = 1 or axis = "columns" we are telling Python to add the columns horizontally (and not append them as rows).

```
pd.concat([data2,dummy_variables],axis = 1)
#alternatively
pd.concat([data2,dummy_variables],axis = "columns")
```

When Dep_Consulting = 1 and Dep_Technology is 0 then it is self-implied that dep_outsourcing will be 0. Thus in this case we only need 3-1 = 2 dummy variables. We can drop the first column by specifying **drop_first = True.**

```
dummy_variables = pd.get_dummies(data2.Department,prefix = "Dep",drop_first=True)
pd.concat([data2,dummy_variables],axis = "columns")
```

**Merging or joining datasets**

Let us consider 2 datasets data1 containing sales and data2 containing returns for various product categories

```
data1 = pd.DataFrame({
'Product_category': ['Makeup','Medicines','Syrups'],
'Sales': [20000,6200,70000]})
data2 = pd.DataFrame({
'Product_category': ['Makeup','Vegetables','Medicines'],
"Returns" : [2300,630,2400]})
```

**Inner Join**

Inner Join takes common elements in both the tables and join them. By default pd.merge( ) returns inner join as output.

`data1.merge(data2,on = "Product_category")`

alternatively, you can specify how = "inner"

`data1.merge(data2,on = "Product_category",how ="inner")`

**Output Explanation: **Since both the data have only 2 product categories in common : Makeup and medicines thus we have only 2 rows in our output.

**Left Join**

Left join takes all the records from left table (specified before merge keyword) irrespective of availability of record in right table (specified inside parenthesis of merge)

If the record is not available in the right table then missing values get generated for such records.

`data1.merge(data2,on = "Product_category",how ="left")`

**Output Explanation:** In** data1 (left table) **we have 3 product categories available thus we have only those 3 categories available in output. Since there is no information about **Syrups in data2 (right table) **thus we have NaN (missing values) generated for Returns for it.

**Right Join**

Rihgt join takes all the records from right table (specified inside parenthesis of merge) irrespective of availability of record in left table (specified before merge keyword)

If the record is not available in the left table then missing values get generated for such records.

`data1.merge(data2,on = "Product_category",how ="right")`

**Output Explanation:** In** data2 (right table) **we have 3 product categories available thus we have only those 3 categories available in output. Since there is no information about **Vegetables in data1 (left table) **thus we have NaN (missing values) generated for Sales for it.

**Outer Join or Full Join**

Outer join or Full join takes all the records from both the tables. Missing values get generated if a record is available in only one table.

`data1.merge(data2,on = "Product_category",how ="outer")`

**Output Explanation:** In** data1 (left table) and data2 (right table) **we have 5 unique product categories available thus we have 5 categories available in output.

Since there is no information about **Syrups in data2 (right table) **thus we have NaN (missing values) generated for Returns for it.

Similarly there is no Sales information available for Vegetables.

**Anti - Join**

Anti join comprises of those records which are present in one table but not in other.

Let us firstly understand what is **indicator =True** means in merge( ) function:

`data1.merge(data2, on = "Product_category",how = "outer",indicator=True)`

In **_merge** column following the meaning of its values:

both: Record is available in both the tables

left_only: Record is available only in left table(data1)

right_only: Record is available only in right table (data2)

Let us save this result in a new dataset called **merged**

`merged = data1.merge(data2, on = "Product_category",how = "outer",indicator=True)`

**Anti join 1: **

In the following code we are fetching those rows which belong to DATA1 and are not present in DATA2 by filtering for rows where _merge column has values: "left_only"

`merged.loc[merged["_merge"] == "left_only",:]`

**Output Explanation:** In** data1 (left table) Syrups** is present whose information is absent **in data2 (right table) **thus we have only that row in the output.

**Anti join 2: **

In the following code we are fetching those rows which belong to DATA2 and are not present in DATA1 by filtering for rows where _merge column has values: "right_only"

`merged.loc[merged["_merge"] == "right_only",:]`

**Output Explanation:** In** data2(right table) Vegetables **is present whose information is absent **in data1 (left table) **thus we have only that row in the output.

**To understand more about joins : one to many mapping, many to many mapping , when variables to be joined have different names etc. refer to this detailed tutorial:**

**Concatenating and appending datasets**

**Using pd.concat( ) **

Using pd.concat( ) we can append our data row-wise as well as column-wise.

Let us consider 3 datasets:

```
data1 = pd.DataFrame({'Product_category': ['Makeup','Medicines','Syrups'],
'Sales': [20000,6200,70000]})
data2 = pd.DataFrame({'Product_category': ['Biscuits','Vegetables','Fruits'],
'Sales': [18000,9000,30000]})
data3 = pd.DataFrame({'Product_category': ['Cold_Drinks','Chocolates','Books'],
'Sales': [14000,23000,14000]})
```

Let us concatenate the data. **By default pd.concat( ) appends the data by rows.**

`pd.concat([data1,data2,data3])`

Note the indices for the resultant dataset: They are 0, 1 and 2. When another data is appended below it then indices are taken for another dataset.

By default pd.concat( ) concatenates data by row. i.e. axis = "index" or axis = 0.

To add columns to the data we define axis = "columns" or axis = 1

`pd.concat([data1,data2,data3],axis = "columns") #or axis = 1`

**Using append( ) function**

Pandas' append function is used only to append the datasets row-wise. You cannot append them column-wise.

Let us consider following 2 datasets:

```
data1 = pd.DataFrame({'Product_category': ['Makeup','Medicines','Syrups'],
'Sales': [20000,6200,70000]})
data2 = pd.DataFrame({'Product_category': ['Biscuits','Vegetables','Fruits'],
'Sales': [18000,9000,30000],
"Returns" : [2300,630,2400]})
```

To learn in detail about various options available in pd.concat and pd.append( ) refer to this tutorial:

## ความคิดเห็น