Reading a dataset with the BY statement


SAS reads data in one observation at a time, in the order given. Sometimes it is advantageous to have the data read in a particular order. This can be done using PROC SORT and the BY statement.

Example:

PROC SORT DATA=original;
  BY State;
RUN;

DATA newdata;
  SET original;
  BY State;
RUN;

PROC PRINT data=newdata;
RUN;
This example sorts the dataset "original" by the variable State. Assuming State is a character variable with the two letter abbreviations for the US states, the dataset is read in alphabetical order into the dataset "newdata". When the dataset is is printed, it is still in alphabetical order.
Notes:
  1. If you want the sorted dataset to have a name other than it's original name, use the OUT option on the PROC SORT statement line. You then need to use the new dataset name in the DATA and PROC steps.
    PROC SORT DATA=original OUT=SortedData;
  2. Both numeric and character variables can be sorted.
  3. If you wanted to sort the dataset in reverse order, use the keyword DESCENDING after the BY and before the variable name. Use the same BY statement in the DATA step.
    BY DESCENDING State;
  4. To sort by more than one variable, simply list them all on the same BY line. The DESCENDING keyword can be used in front of any variable that you want sorted in reverse.
    BY State Zipcode;
    Put the variables in the order that makes the most sense. For this example, it makes more sense to first sort by state and then by zipcode because each state has many different zipcodes. But not as much sense to sort by zipcode and then state, since no zipcode can have more than one state.

    It is not necessary to use all the variables a dataset is sorted by in the BY statement of a DATA step. However, if you skip one, you must leave out all variables to the right of it on the BY line of the PROC SORT. For example:

    PROC SORT DATA=original;
      BY State Zipcode Plus4;
      
    DATA newdata;
      SET original;
      BY Zipcode;
    RUN;
    
    This example will not work since State was skipped, which means Zipcode and Plus4 must also be skipped. It would have to be coded as:
    PROC SORT DATA=original;
      BY State Zipcode Plus4;
      
    DATA newdata;
      SET original;
      BY State Zipcode;
    RUN;
    
  5. A BY statement can also be used in the PROC PRINT. This would cause PROC PRINT to separate each value of State by a BY line (unless this option was turned off). This will be discussed more at a later time.
A BY statement in a DATA step gives you two automatic SAS variables: FIRST.var and LAST.var. These are boolean variables that contain the values "1" or "0". FIRST.var is a "1" if the current observations is the first observation containing the current value of the variable "var", and "0" otherwise. Similarly, LAST.var is "1" if it is the last observations with the current value of "var" and "0" otherwise.

Example,

State	Zipcode	  FIRST.State	LAST.State

CA 90066 1 0 CA 90501 0 0 CA 91436 0 0 CA 95030 0 0 CA 95054 0 1 MA 01720 1 0 MA 01887 0 1 MD 21014 1 1 NY 10019 1 0 NY 10920 0 0 NY 11763 0 0 NY 12561 0 1 WA 98052 1 0 WA 98188 0 1
A simple way to check to see if you have only one observation for a particular BY variable is to see if FIRST.var=LAST.var=1. For example, assume you have a dataset that contains grades for a class and each observation is a student with an unique ID. Then if you sort BY ID, you can check
IF NOT(FIRST.ID=LAST.ID=1) THEN MultIDs=1;.
If you find any observations where the condition is true, that indicates that you have multiple observations for at least 1 student ID. You can then print these observations by using a WHERE statement in a PROC PRINT.
PROC PRINT;
  WHERE MultIDs=1;
RUN;

One last note about SAS automatic variables is that they exist only in the datastep.