• Ekta Aggarwal

Transposing / Reshaping data in Python (wide and long formats)


In this tutorial we shall try to understand what are long and wide format datasets and how to convert them from wide to long and long to wide in Python.


Key concepts:


Long format:


Following is the example of a long data where both the attributes- department and month information are present as seperate columns and we have a different column for values.







Wide format:


In a wide format at least one of our attributes is present as multiple columns (here we have seperate columns for each month) and values are filled in these 2X2 table (where values are our number of cases corresponding to each department and month).




Many a times we need to transpose our data i.e. switch from long or wide format or vice - versa.


Transposing data in Python


Firstly we will load pandas library for transposing our data.

import pandas as pd

Long to wide - Using pivot function:


For transposing data we shall use the following data:

df = pd.DataFrame({"Year" : [2019,2020,2021,2019,2020,2021,2019,2020,2021],
    "Department" : ['HR','HR','HR','Sales','Sales','Sales','Marketing','Marketing','Marketing'],                 
                 "Employee_Count" : [30,40,50,100,120,125,200,180,190],
                  "Average_Salary" : [10000,12000,14000,20000,25000,27000,22000,21000,20000]})
                  
print(df)

Syntax of pivot function:

Following is the syntax of pivot function taken from pandas' documentation

dataframe_name.pivot(index, columns, values)
index: Column to use to make new frame’s index. 
columns: Column to use to make new frame’s columns.
values: Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.


In the following code, we are specifying that 'Year' Column in our dataframe df, should be our index i..e, they will represent new dataframe's rows.

While values in 'Department' column would be individual columns eg. HR, Marketing and Sales would be 3 different columns.

If we do not specify anything in 'values' then all the other columns are populated in the new dataframe's values, for each row and columns. Thus, here Employee_Count and Average_Salary are used to fill the values.

df.pivot(index = 'Year',columns = 'Department')



In the following code, we are specifying that 'Year' Column in our dataframe df, should be our index i..e, they will represent new dataframe's rows.

While values in 'Department' column would be individual columns eg. HR, Marketing and Sales would be 3 different columns.

We are specifying that we need only 'Employee_Count' to be shown as values for each rows and columns in the new dataframe.

df.pivot(index = 'Year',columns = 'Department', values = 'Employee_Count')



Wide to long - Using melt function:


For transposing data we shall use the following data:

df = pd.DataFrame({"Year" : [2019,2020,2021],
    "HR" : [30,40,50],                 
     "Marketing"  : [200,180,190],
    "Sales" : [100,120,125]})
    
    print(df)

Syntax of melt function:

Following is the syntax of melt function taken from pandas' documentation

pd.melt(dataframe, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
id_vars: Column(s) to use as identifier variables for each row.
value_vars: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
value_name: Name to use for the ‘value’ column.
var_name: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
col_level: If columns are a MultiIndex then use this level to melt.
ignore_index: If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.


In the following code, we are specifying:

id_vars = 'Year' as it will have different years as different rows. Note Year column in the new dataframe can have duplicate values as we are going from wide to long format.

value_vars = ['HR','Sales','Marketing] tells Python that we need these 3 columns to be used as a single new column or variable (see column 2 in the below dataset)

It has automatically detected that the values inside the cells in original dataframe will form the values columns.

pd.melt(df,id_vars = 'Year',value_vars = ['HR','Marketing','Sales'])

Note we have renamed our second column of new variable as Department by specifying var_name = 'Department'

Similarly, we have renamed the Value column by specifying value_name = 'Employee_Count'

pd.melt(df,id_vars = 'Year',value_vars = ['HR','Marketing','Sales'],value_name = 'Employee_Count', var_name = 'Department')

Tags: