Filtering data in SAS (using WHERE and IF keywords)
Updated: Jan 21, 2021
If you are familiar with SQL then you must be aware 'WHERE' keyword is used to retrieve those rows in a dataset which satisfy a particular condition. Luckily, SAS offers exactly same utility with its where keyword
Buy one get one offer in SAS! You can use the keyword IF to carry out the same operation.
For this tutorial we shall be using the already built in SAS dataset BASEBALL from the library SASHELP.
Firstly let us initialize out libpath:
libname mylib '/home/u50132927/My_datasets';
Filtering numeric columns in SAS...
Let us filter the rows where number of hits (nhits) are greater than 75.
'data' statement creates a new dataset named 'subset' in my library. 'set' statement copies the already existing dataset from sashelp.baseball. Using WHERE keyword you specify you condition.
data mylib.subset; set sashelp.baseball; where NHITS > 75; run;
A similar output can be ascertained using IF keyword:
data mylib.subset; set sashelp.baseball; IF NHITS > 75; run;
You can also write GT instead of '>' symbol
data mylib.subset; set sashelp.baseball; where NHITS GT 75; run;
In a similar fashion you can use the following operators for subsetting:
Filter the rows where number of hits is between 75 and 100.
data mylib.subset; set sashelp.baseball; where NHITS between 75 and 100; run;
Filtering string / character columns in SAS...
Let us filter the rows where team is Cleveland:
data mylib.subset; set sashelp.baseball; where Team = "Cleveland"; run;
TASK: In the above code, replace Cleveland by CLEVELAND! Did you notice something?
Resultant dataset has 0 rows! Why?
Because SAS is case sensitive when something is mentioned in quotes.
Filtering the rows where team is either Cleveland or Seattle.
We shall use IN operator to filter multiple values. It can be used even with numeric columns.
data mylib.subset; set sashelp.baseball; where Team in ("Cleveland","Seattle"); run;
To filter for all the teams except Cleveland and Seattle we use NOT IN keyword in WHERE statement:
data mylib.subset; set sashelp.baseball; where Team not in ("Cleveland","Seattle"); run;
What if I want to filter the information of the players whose NAME start with Tr? For this we use LIKE keyword:
data mylib.subset; set sashelp.baseball; where NAME like "Tr%"; run;
In above code, % means any number of characters can now follow 'Tr'
Similarly to find the information of players whose name end with Al we use:
where NAME like "%Al";
To get the list of players whose name contains 'Al' in between, write:
where NAME like "%Al%";
Filtering for multiple conditions...
Filter the rows where players belong to either of the team Seattle or Cleveland and number of hits is less than 75:
For this we shall use AND keyword
data mylib.subset; set sashelp.baseball; where NHITS < 75 AND Team in ("Cleveland","Seattle"); run;
Filter the rows where players have number of hits less than 75 or they belong to either of the team Seattle or Cleveland: For this we shall use OR keyword
data mylib.subset; set sashelp.baseball; where NHITS < 75 OR Team in ("Cleveland","Seattle"); run;
Filtering with missing data...
Let us firstly create a dataset names 'missing' containing missing observations in column 'team' when number of hits is less than 75
data mylib.missing; set sashelp.baseball; if nhits<75 then team = "" ; run;
Filter the data where there are missing observations in column TEAM:
data mylib.subset; set mylib.missing; where team IS MISSING; run;
For this we use the IS MISSING keyword of SAS.
Filter the rows where team name is not missing, we use IS NOT MISSING keyword:
data mylib.subset; set mylib.missing; where team IS NOT MISSING; run;
Which one to use: IF or WHERE?
Case 1: When you are using already existing columns for filtering:
Although in such a scenario where you only need to filter an already existing column, both where and if lead to same results, but on working with large datasets WHERE is more efficient than IF.
Following is the comparison of the log, when WHERE and IF statements are used.
WHERE firstly subsets the data (BASEBALL) and then reads it, while IF firstly reads all of the observations from the dataset (BASEBALL) and then filters it. Thus using WHERE statement in such scenarios is preferable.
Case 2: When you are using newly created columns for filtering:
Let us create a new column 'NEWCOL' = NHITS + 10 and try to fetch the rows where NEWCOL < 85:
Using where statement in such a scenario will cause an error because SAS searches for the variable 'NEWCOL' in the dataset BASEBALL to filter, but it could not find it.
However, if you run the following command, it shall publish the output, because you are not creating any new column.