top of page
Writer's pictureEkta Aggarwal

dplyr elaborated!

dplyr is one of the most popular libraries in R which is widely used for data manipulation and data wrangling. In this tutorial you will learn about how to prepare data using dplyr with elaborated examples.


Let us firstly install and load dplyr.

install.packages("dplyr")
library(dplyr)

Dataset

In this tutorial we will use R's inbuilt dataset: mtcars.

View(head(mtcars))

It contains information of mileage(mpg), displacement (disp), horsepower(hp), number of cylinders(cyl), automatic or manual (am) etc. of 32 cars.


Selecting some of the columns


Using dplyr's select( ) function we can select and remove the columns:


Syntax:

select(data set name, columns to be selected)


In the following code we are selecting only 3 columns: mpg, disp and am from mtcars dataset

View(select(mtcars,mpg,disp,am))

Pipe operator

We can also use pipe operator i.e. %>% which is specifically used in dplyr. It helps in writing easily readable codes which are less messy. It forms a sequence of codes which will be executed one after another.


Firstly we mention the dataset name (i.e. mtcars) followed by pipe operator (i.e. %>% ) and then our function. Since we have mentioned the dataset name in the beginning thus in select( ) function we don't need to re-specify it.

View(mtcars %>% select(mpg,disp,am))



Task: Removing columns mpg, disp and am from the data.

To remove the columns we can mention their names along with a hyphen ( - ) in select statement.

View(select(mtcars,-mpg,-disp,-am))

Alternatively

View(mtcars %>% select(-mpg,-disp,-am))


Rearranging columns


We can also rearrange the columns using select( ) statement.


Firstly we specify our list of columns which we need in the beginning followed by the keyword everything( ) which denotes all other columns should come after them.


Task: Make cyl and am as first two column and then rest should come later.

View(mtcars %>% select(cyl,am,everything()))

Renaming columns


Method 1: Using rename( ) function

We can rename( ) datasets using rename(datasetname, new_name = oldname)


Task: Rename mpg to mileage and hp to horsepower


View(mtcars %>% rename(mileage = mpg, horsepower = hp))
#Alternatively
View(rename(mtcars,mileage = mpg, horsepower = hp))

Method 2: Using select( ) function


You can also specify new column name = old column name in select statement, followed by everything( ) . But this method will change the order of the columns.

View(mtcars %>% select(mileage = mpg,horsepower = hp ,everything()))

Getting distinct values


Using distinct( ) function we can find distinct values or combinations for a given set of variables.


Syntax:

distinct(dataset name, columns for which we want distinct combinations)


Task: Find all the distinct combinations for columns am and cyl

mtcars %>% distinct(am,cyl)
#Alternatively
distinct(mtcars,am,cyl)

n_distinct( ) returns the number of distinct values.

n_distinct(mtcars$am,mtcars$cyl)

Output: 6


Removing duplicates


We can remove duplicate rows in our data by:

View(mtcars %>% distinct())

Here all the columns are considered to remove duplicate values and get unique rows only.

(Since mtcars did not have any duplicate rows thus nothing was deleted)


Filtering the data


Using filter( ) function we can filter the rows in our dataset where a particular condition is satisfied.


Syntax:

filter(dataset, condition)


Task: Filter for rows where am is 0 i.e. car is manual

View(filter(mtcars,am == 0))
#Alternatively
View(mtcars %>% filter(am == 0))

Not equal to condition in filter

In R not equal to sign is represented by !=

Task: Filter for rows where am is not 0 i.e. car is automatic

View(mtcars %>% filter(am != 0))

Task: Filter for rows where mileage is more than average mileage of all the cars.

mean(mtcars$mpg) will give us the average mileage of all the cars.

View(mtcars %>% filter(mpg > mean(mtcars$mpg)))

Case: When a particular condition is not true.

To exclude the cases when a particular condition is satisfied we write a ! symbol in front of the condition in R.

Task: Filter for rows where mileage is NOT more than average mileage of all the cars.

View(mtcars %>% filter(!mpg > mean(mtcars$mpg)))

Case: Multiple conditions!

AND ( &) - When multiple conditions need to be true at one time we use & operator.

Task: Filter for rows where number of cylinders is 6 and mileage is more than average mileage of all the cars.

View(mtcars %>% filter(cyl == 6 & mpg > mean(mtcars$mpg)))

OR ( | ) - When at least one of the conditions need to be true at then we use | operator.

Task: Filter for rows where either number of cylinders is 6 or mileage is more than average mileage of all the cars.

View(mtcars %>% filter(cyl == 6 |  mpg > mean(mtcars$mpg)))

%in% - To filter for multiple values being included in a vector we use %in% .

Task: Filter for rows where number of cylinders is either 4 or 6.

View(mtcars %>% filter(cyl %in% c(4,6)))

If we do not wish to use %in% then the same code can be written as:

View(mtcars %>% filter(cyl == 4 | cyl == 6))

But good programmers generally make use of %in% operator.


Why pipe operators are preferred?


Task: Filter for the data where am == 0 and select columns mpg, disp, am,cyl and hp

View(mtcars %>% filter(am == 0) %>% select(mpg,disp,am,cyl,hp))

Notice how pipe operator is working : Firstly it selects mtcars, then filters for the rows where am is 0 and then selects the 5 columns.


Without using pipe operator we can write the above code as:

View(select(filter(mtcars,am==0),mpg,disp,am,cyl,hp))

But the above method looks too messy. pipe operator helps in providing easy readability and makes our codes look elegant.


Sorting the data


Using arrange( ) function we can sort the data in. By default arrange( ) sorts the data in ascending order.

Task: Sort the data by am and mpg

View(mtcars %>% arrange(am,mpg))

Above command sorts the data first by am, now where am == 0 then it sorts those rows by mpg in ascending order. Similarly for am == 1.


Descending order: To denote descending order we can write a hyphen ( - ) in front of the column name whose values are needed in descending order.

Task: Sort the data by am in ascending order and mpg in descending order.

View(mtcars %>% arrange(am,-mpg))

Creating new variables


Using mutate( ) function we can create new variables in our data.


Task: Create a new variable mpg/cyl in the data

View(mtcars %>% mutate(new_var = mpg/cyl))

transmute( ) function creates a new column but it returns only single column. It does not include all the other columns from the dataset like mutate( ) .

View(mtcars %>% transmute(new_var = mpg/cyl))

Using mutate_all( ) one can apply one single function on all the columns of the dataset. But it does not create new columns - rather it replaces the values in the actual columns/

Task: Replace all the column values by their mean

View(mtcars %>% mutate_all(mean, na.rm = TRUE))

mutate_if( )

If we want to apply some functions only on some of the columns which specify a particular condition then we need to use mutate_if( )

Syntax:

mutate_if(data set name, condition to be applied on the columns, function to be applied)


Let us create a different data mydata where cyl and am are converted to characters

mydata = mtcars
mydata$cyl = as.character(mydata$cyl)
mydata$am = as.character(mydata$am)

Task: Replace the original values by number of distinct values in the character columns

Using is.character function we can filter for columns having character values.

View(mydata %>% mutate_if(is.character, n_distinct))

mutate_at

mutate_at is used to apply the condition only on selected columns defined in vars( ) statement.


Task: Replace the values in the columns mpg, cyl and wt by their respective means

View(mtcars%>% mutate_at(vars(mpg,cyl,wt), mean, na.rm = TRUE))

Task: Replace the values in the columns whose name ends by 'p' by their respective means

To find columns whose name ends with 'p' we use dplyr's ends_with function

View(mtcars%>% mutate_at(vars(ends_with("p")), mean, na.rm = TRUE))

Similarly to find columns whose name starts with a particular character or pattern, we have starts_with( ) function



Rowwise


If we try to find max(mtcars$mpg * 10, mtcars$disp ) we will get 472, because it is the maximum value in both the vectors. But to do such computations for each row we need to use rowwise( ) function.


Task: Create a new column max_value which returns maximum of mpg*10 or disp for each row.

View(mtcars %>% rowwise() %>%  mutate(max_value = max(mpg*10,disp)))

We can also achieve the same result using if_else ( ) function as follows:


if_else


Syntax:

if_else(condition to be evaluated, value when condition is true, value when condition is false, missing = "value to be used in case of missing values")


Task: Create a new column max_value which returns maximum of mpg*10 or disp for each row.

View(mtcars %>% mutate(max_value = if_else(mpg*10 > disp,mpg*10,disp)))

We have missing = option in if_else( ). Suppose if a row has NA then the condition can't be evaluated and would return NA, thus instead of returning NA we can fill the NA values by a default number.

In the following code we are filling our NA values by 9999

View(mtcars %>% mutate(max_value = if_else(mpg*10 > disp,mpg*10,disp,missing = 9999)))


Summarising the data


Using summarise( ) function we can find various summary statistics for our data.


Task: Calculate mean and median mileage for mtcars

mtcars %>% summarise(mean(mpg),median(mpg))
#Alternatively
summarise(mtcars,mean(mpg),median(mpg))

Output:

mean(mpg) median(mpg)

20.09062 19.2


n( ) function

Using n( ) function is dplyr we can find count of the rows.


Task: Calculate average and median mileage and count of rows.

mtcars %>% summarise(avg_mileage = mean(mpg),median_mileage = median(mpg),count = n())

#Alternatively 

summarise(mtcars,avg_mileage = mean(mpg),median_mileage = median(mpg),count = n())

Output:

avg_mileage median_mileage count

20.09062 19.2 32


summarise_at

IF we want to calculate same summary statistics for multiple variables we can resort to summarise_at( ) function

In vars( ) statement we define the columns on which functions need to be applied.

In funs( ) statement we define the function names.

We can add a suffix to our output column in funs( ) statement.

In the below code we have added suffixes count, avg and minimum in front of the column names


Task: Calculate count, mean and minimum values for columns mpg and disp

mtcars %>% summarise_at(vars(mpg,disp),funs(count = n(),avg = mean(.,na.rm = T),minimum = min))

Output:

mpg_count disp_count mpg_avg disp_avg mpg_minimum disp_minimum

32 32 20.09062 230.7219 10.4 71.1


In the latest version funs( ) has been updated by list( )

where we need to define the functions by a tilde(~) and a dot (. )

Eg. ~n( ) , ~mean(.), ~median(.) etc.


mtcars %>% summarise_at(vars(mpg,disp),
list(~n(),avg  = ~mean(.,na.rm = T),minimum  = ~min(.)))

summarise_all

To apply the functions on all the columns we use summarise_all( )

Task: Calculate count, and mean for all the columns

mtcars %>% summarise_all(list(count = ~n() ,avg =  ~mean(.,na.rm = T)))

summarise_if

To apply the functions on the columns satisfying a particular condition we use summarise_if( )


Let us create a different data mydata where cyl and am are characters

mydata = mtcars
mydata$cyl = as.character(mydata$cyl)
mydata$am = as.character(mydata$am)

Task: Calculate count, and mean for all the numeric columns

mydata %>% summarise_if( is.numeric,list(~n(),avg  = ~mean(.,na.rm = T),minimum  = ~min(.)))

is.numeric is used to filter for numeric columns



Group by


Group by is used to divide our data on the basis of categorical column(s) and then apply the functions on the grouped data to get summary statistics for each group.

Let us understand by an example:


Task: For each combination of am and cyl find average, median and minimum mileage.

In the following code, firstly we are grouping our data by am and cyl. After this we are calculating mean, median and minimum mileage (mpg) for each of the combinations of am and cyl.

Since we have 6 combinations of am-cyl thus our resultant data will have 6 rows.

View(mtcars %>% group_by(am,cyl) %>% summarise(avg_mileage = mean(mpg),
          median_mileage = median(mpg),
           min_mileage = min(mpg)) )                         

Task: For each combination of am-cyl calculate count and average of mileage and disp

View(mtcars %>% 
group_by(am,cyl) %>% 
summarise_at(vars(mpg,disp), 
list(~n(),avg = ~mean(.,na.rm = T))))                          

We can also provide a list of consecutive columns in vars( ) using a colon ie. mpg:hp denotes all columns from mpg till hp.

Task: For each combination of am-cyl calculate count and average of columns from mpg till hp.

To do so we have written vars(mpg:hp) which indicates that all the columns between mpg and hp should be used.

View(mtcars %>% 
group_by(am,cyl) %>% 
summarise_at(vars(mpg:hp), 
list(~n(),avg = ~mean(.,na.rm = T)))  )                       

Do function( )


To do some computations within each group we use do( ) function:

Task: For each combination of am and cyl arrange the rows by mileage(mpg).

d2  = mtcars %>% group_by(am,cyl) %>% do(arrange(.,mpg))
View(d2)   

In the output for each combination of am and cyl, mpg values have been sorted in ascending order.


Joins using dplyr


Let us consider 2 dataframes:

a = data.frame(Employee_ID = 1001:1005,Name = c("A","B","C","D","E"))

b = data.frame(Employee_ID  = 1003:1007,Salary = c(15000,14000,20000,30000,23000))

Our data looks as follows:


Basic Syntax for all the joins:

join_name(left_table,right_table, by = " ")

join_name: Specify the join name (left_join, right_join, full_join, inner_join)

by: List of common column names to be matched.


1. Left join:

Left join takes all the records from left table irrespective of availability of record in right table.

left_join(a,b,by = c("Employee_ID"))

We can also skip the by option provided there are no other columns (except Employee_ID) which have same column name in both the table.

left_join(a,b)

Output:

Employee_ID Name Salary

1 1001 A NA

2 1002 B NA

3 1003 C 15000

4 1004 D 14000

5 1005 E 20000



2. Right join:

Right join takes all the records from Right table irrespective of availability of record in left table.

right_join(a,b,by = c("Employee_ID"))

Output:

Employee_ID Name Salary

1 1003 C 15000

2 1004 D 14000

3 1005 E 20000

4 1006 <NA> 30000

5 1007 <NA> 23000


3. Inner join:

Inner Join takes common elements in both the tables and join them.

inner_join(a,b,by = c("Employee_ID"))

Output:

Employee_ID Name Salary

1 1003 C 15000

2 1004 D 14000

3 1005 E 20000


4. Full join:

Full join takes all the records from both the tables

full_join(a,b,by = c("Employee_ID"))

Output:

Employee_ID Name Salary

1 1001 A NA

2 1002 B NA

3 1003 C 15000

4 1004 D 14000

5 1005 E 20000

6 1006 <NA> 30000

7 1007 <NA> 23000


5. Cross join:

a1 =data.frame(Numbers = 1:3)
a2  =data.frame(Letters = c("a","b","c"))

dataframe a1 had 3 unique numbers and a2 had 3 unique letters. A cross join refers to all 3X3 = 9 permutation combinations of these numbers and letters.


We can achieve a cross join in R using full_join ( ) only. But since we do not have any common variable to achieve a full join thus we need to create a temporary variable in both the variables to join the 2 tables. In the following code our temp variable has the value 1

a1$temp = 1
a2$temp = 1

Now we can do a full join with common key as temp and then we can drop the temp column.

full_join(a1,a2,by = "temp")

Output:

Numbers temp Letters

1 1 1 a

2 1 1 b

3 1 1 c

4 2 1 a

5 2 1 b

6 2 1 c

7 3 1 a

8 3 1 b

9 3 1 c


6. Anti - Join


Anti join comprises of those records which are present in one table but not in other.


Anti join 1:

In the following code we are fetching those rows which belong to a and are not present in b

anti_join(a,b,by = c("Employee_ID"))

Output:

Employee_ID Name

1 1001 A

2 1002 B


Anti join 2:

In the following code we are fetching those rows which belong to b and are not present in a

anti_join(b,a,by = c("Employee_ID"))

Output:

Employee_ID Salary

1 1006 30000

2 1007 23000



INTERSECT, UNION AND UNION ALL


All of these 3 functions take 2 vectors as an input.


INTERSECT:

It returns the common elements available in the 2 vectors. You can consider it like an inner join which returns the common keys.

intersect(a$Employee_ID,b$Employee_ID)

Output:

1003 1004 1005


UNION:

It returns all the elements available in the 2 vectors but keeps only unique values.

union(a$Employee_ID,b$Employee_ID)

Output:

1001 1002 1003 1004 1005 1006 1007


UNION ALL:

It returns all the elements available in the 2 vectors and retains the duplicate values.

union_all(a$Employee_ID,b$Employee_ID)

Output:

1001 1002 1003 1004 1005 1003 1004 1005 1006 1007



Ranking the data


There are various rank functions available in dplyr to rank the observations.


rank( ) function:

In case of ties rank( ) function provides an average rank to all the observations which are causing the tie. In the example below, 200 is being repeated 4 times (its actual rank would have been 2, 3, 4 and 5 if there were no ties) hence we assign an average of the ranks as 3.5 i.e., (2+3+4+5)/4 and next element i.e., 500 is assigned the next rank (i.e., 6, as we had already exhausted ranks 2,3, 4 and 5)


rank(c(100,200,200,200,200,500,600,700))

Output: 1.0 3.5 3.5 3.5 3.5 6.0 7.0 8.0


If we specify ties.method = "first" then it assigns the rank on first come first serve basis for ties.

rank(c(100,200,200,200,200,500,600,700),ties.method = "first")

Output: 1 2 3 4 5 6 7 8


If we specify ties.method = "last" then it assigns the rank on first come first serve basis for ties.

 rank(c(100,200,200,200,200,500,600,700),ties.method = "last")

Output: 1 5 4 3 2 6 7 8


If we specify ties.method = "min" then it assigns the minimum rank out of the ties (i.e., minimum of 2, 3, 4 and 5 i.e., 2)

rank(c(100,200,200,200,200,500,600,700),ties.method = "min")

Output: 1 2 2 2 2 6 7 8


If we specify ties.method = "max" then it assigns the maximum rank out of the ties (i.e., maximum of 2, 3, 4 and 5 i.e., 5)

rank(c(100,200,200,200,200,500,600,700),ties.method = "max")

Output: 1 5 5 5 5 6 7 8


Task: Rank the mtcars dataset on the basis of mpg and sort the dataset as per their ranks.

View(mtcars %>% mutate(Rank = rank(mpg)) %>% arrange(Rank))

dense_rank( )

dense_rank( ) gives equal weightage in case of ties, and for succeeding elements it provides the next rank. i.e., dense rank will never provide ranks as decimals or averages and ranks are always a continuous counting like 1,2,3, etc.


In the following code all 200s are given same rank i.e., 2 and 500 is given the next rank i.e., 3

dense_rank(c(100,200,200,200,200,500,600,700))

Output:

1 2 2 2 2 3 4 5


Task: Rank the mtcars dataset on the basis of mpg (use dense rank ) and sort the dataset as per their ranks.

View(mtcars %>% mutate(Dense_rank = dense_rank(mpg)) %>% arrange(Dense_rank))


row_number( )


row_number( ) functions assigns the row number to all the observations.

View(mtcars %>% mutate(Row_number = row_number())

Kommentare


bottom of page