Reading a dataset with the BY statement
SAS reads data in one observation at a time, in the order given.
Sometimes it is advantageous to have the data read in a particular
order. This can be done using PROC SORT and the BY statement.
Example:
PROC SORT DATA=original;
BY State;
RUN;
DATA newdata;
SET original;
BY State;
RUN;
PROC PRINT data=newdata;
RUN;
This example sorts the dataset "original" by the variable State.
Assuming State is a character variable with the two letter
abbreviations for the US states, the dataset is read in alphabetical
order into the dataset "newdata". When the dataset is is printed, it
is still in alphabetical order.
Notes:
- If you want the sorted dataset to have a name other than it's original name, use the
OUT option on the PROC SORT statement line. You then need to use
the new dataset name in the DATA and PROC steps.
PROC SORT DATA=original OUT=SortedData;
- Both numeric and character variables can be sorted.
- If you wanted to sort the dataset in reverse order, use the
keyword DESCENDING after the BY and before the variable name.
Use the same BY statement in the DATA step.
BY DESCENDING State;
- To sort by more than one variable, simply list them all on the
same BY line. The DESCENDING keyword can be used in front of any
variable that you want sorted in reverse.
BY State Zipcode;
Put the variables in the order that makes the most sense. For
this example, it makes more sense to first sort by state and then
by zipcode because each state has many different zipcodes. But
not as much sense to sort by zipcode and then state, since no
zipcode can have more than one state.
It is not necessary to use all the variables a dataset is
sorted by in the BY statement of a DATA step. However, if you
skip one, you must leave out all variables to the right of it on
the BY line of the PROC SORT. For example:
PROC SORT DATA=original;
BY State Zipcode Plus4;
DATA newdata;
SET original;
BY Zipcode;
RUN;
This example will not work since State was skipped, which means
Zipcode and Plus4 must also be skipped. It would have to be
coded as:
PROC SORT DATA=original;
BY State Zipcode Plus4;
DATA newdata;
SET original;
BY State Zipcode;
RUN;
- A BY statement can also be used in the PROC PRINT. This would
cause PROC PRINT to separate each value of State by a BY line
(unless this option was turned off). This will be discussed
more at a later time.
A BY statement in a DATA step gives you two automatic SAS variables:
FIRST.var and LAST.var. These are boolean variables that contain the
values "1" or "0". FIRST.var is a "1" if the current observations is
the first observation containing the current value of the variable
"var", and "0" otherwise. Similarly, LAST.var is "1" if it is the last
observations with the current value of "var" and "0" otherwise.
Example,
State Zipcode FIRST.State LAST.State
CA 90066 1 0
CA 90501 0 0
CA 91436 0 0
CA 95030 0 0
CA 95054 0 1
MA 01720 1 0
MA 01887 0 1
MD 21014 1 1
NY 10019 1 0
NY 10920 0 0
NY 11763 0 0
NY 12561 0 1
WA 98052 1 0
WA 98188 0 1
A simple way to check to see if you have only one observation for a
particular BY variable is to see if FIRST.var=LAST.var=1. For example,
assume you have a dataset that contains grades for a class and each
observation is a student with an unique ID. Then if you sort BY ID,
you can check
IF NOT(FIRST.ID=LAST.ID=1) THEN MultIDs=1;.
If
you find any observations where the condition is true, that indicates
that you have multiple observations for at least 1 student ID. You can
then print these observations by using a WHERE statement in a PROC
PRINT.
PROC PRINT;
WHERE MultIDs=1;
RUN;
One last note about SAS automatic variables is that they exist only in the datastep.