Subsetting Data


The IF and WHERE statements can be used to subset a dataset. This is useful when the original dataset is large and you only need to work with a part of it. SAS operates more efficiently with smaller datasets.

Syntax

IF statement:
IF expression ;
WHERE statement:
WHERE expression;
The expression or condition part of the statement is similar to that of the IF-THEN-ELSE statement. When no THEN or THEN-ELSE follows the expression, the dataset only keeps observations satisfying the condition of the expression.

For example,

IF 20 < age < 40;
This IF statement would only keep observations where the value of the variable Age is between 20 and 40.
If the goal was not to keep specific observations, but to remove them, a IF-THEN statement can be used with the command DELETE to remove them.

Example,

IF 20 < age < 40 THEN DELETE;
Now the datasets will remove any observations where the value of Age is between 20 and 40. Obviously, this can also be accomplished by using:  IF NOT (20 < age < 40); . Often one is more useful in a particular situations.

The WHERE statement is used just like the IF statement, with one exception. The IF statement is processed AFTER SAS has already read in the entire dataset and the WHERE statement is processed WHILE SAS is reading in the dataset. So, if the dataset being subsetted is large, the WHERE statement only requires that SAS read in the observations that satisfy the expression of the WHERE statement. The IF statement would first read in all the observations of the large dataset and then subset it to satisfy the IF statement. Since the WHERE statement is processed as SAS is reading in data, it can only be used when reading in a dataset.

Example:

DATA subdata;
  SET perm.bigdata;
  WHERE Sex = "Female";
  
RUN;
If the datasets perm.bigdata contained data from both male and female subjects, the new dataset subdata would only read in the observations with Sex="Female".
Only one WHERE statement can be used in a dataset--if there are more than one, then the last one is used. Sometimes it is convenient to use separate statements for a complicated conditional expression. The WHERE statement has a companion statement called the WHERE ALSO statement, which allows additional conditions to be processed with the WHERE statement. It is used just like the WHERE statement, with only the additional keyword ALSO used. More than one WHERE ALSO statements can be used in a data step.

Example:

DATA subdata;
  SET perm.bigdata;
  WHERE Sex = "Female";
  WHERE ALSO 20 < age < 40;
  
RUN;
Now, the only observations satisfying both Female between the ages of 20 and 40 will be read in to the new dataset subdata.
One additional feature of the WHERE statement is that it can be used both for data and proc steps. So if you want to run the data through a procedure, but only want a subset of the data to be processed, you can use a WHERE statement. Example,
PROC PRINT data=perm.bigdata;
  WHERE Sex = "Female";
TITLE "Female subjects only";
RUN;
Only the observations with Sex = "Female" will be printed.