Merging SAS datasets (Sections 5.4 & 5.5)

When you have 2 or more datasets containing data on the same subjects, you can combine the datasets by merging them together. Merging is different than stacking or interleaving because (1) usually the datasets have only one or two variables in common and (2) the observations match, or correspond to the same subject.

Syntax

DATA merged_dataset;
  MERGE data1 data2 ... ;
  BY  byvar1 ...;

RUN;


Example 1:1-1 Match Merge
data1:	
ID	Class	Name	 
1	 A	John	   
2	 C	Jill	   
3	 B	Paul 	 
4	 C	Andrea

data2:	
ID	Test1	Test2
1	 93	 85  
2	 86	 75 
3	 74	 88
4	 85	 93 
 
DATA merged;
  MERGE data1 data2;
  BY id;
RUN;

merged:
ID	Class	Name	Test1	Test2 
1	 A	John	 93	 85  
2	 C	Jill 	 86 	 75 
3	 B	Paul	 74	 88 
4	 C	Andrea	 85	 93 
Note that the variable ID identifies which observations on each of the datasets should be merged together. For this example, there is only one observation from each dataset to be joined. This is called a 1-1 match merge. Each of the datasets MUST be sorted on the BYVAR before the MERGE can work.





Example 2: 1-many Match Merge
data1:	
ID	Class	Name	 
1	 A	John	   
2	 C	Jill	   
3	 B	Paul 	 
4	 C	Andrea

data2:	
ID	Test	Score	 
1	 1	 93
1	 2	 85	    
2	 1	 86	   
3	 2	 74
3	 1	 82
3	 3	 81	  
4	 2	 85
4	 1	 93	   
 
DATA merged;
  MERGE data1 data2;
  BY id;
RUN;

merged:
ID	Class	Name	Test	Score
	 
1	A	John	 1	 93
1	A	John	 2	 85	    
2	C	Jill	 1	 86	   
3	B	Paul	 2	 74
3	B	Paul 	 1	 82
3	B	Paul 	 3	 81	  
4	C 	Andrea	 2	 85
4	C 	Andrea	 1	 93
Once again, the ID variable found in both datasets identifies which observations to merge together. For this example, data1 has only 1 observation per unique ID, but data2 has between 1 and 3. This is called a 1-many match merge. SAS first reads an observation from data1 and then looks for a matching observation from data2. SAS continues to ready observations from data2 until it finds the first non-matching observation. SAS then reads a new observation from data1. Again, the datasets must be sorted on the BYVAR.

If there are more than one observation per ID in first dataset, SAS first does a 1-1 match merge and then uses the last observation in data1 to match to the remaining observations in data2. Or vice-versa if SAS encounters the last observation in data2 first. This is usually NOT what you want to do, so you may have to specify additional BYVARs or check your data for errors before merging.



Example 3:1-1 Match Merge with duplicate variables
data1:	
ID	Class	Name	 
1	 A	John	   
2	 C	Jill	   
3	 B	Paul 	 
4	 C	Andrea

data2:	
ID	Class	Test1	Test2
1	lab1	 93	 58  
2	lab3	 86	 92 
3	lab2	 74	 88
4	lab3	 85	 68 
 
DATA merged;
  MERGE data1 data2;
  BY id;
RUN;

merged:
ID	Class	Name	Test1	Test2 
1	lab1	John	 93	 85  
2	lab3	Jill 	 86 	 75 
3	lab2	Paul	 74	 88 
4	lab3	Andrea	 85	 93 
Note that the variable Class appears both in data1 and data2. When a variable appears in more than one dataset and is not the BYVAR, SAS keeps the value found in the last dataset. This is probably something you don't want to do, so you should rename non-BYVARs that are found in more than one dataset so that this situation does not occur.


Example 4:Match Merge with multiple BYVARs
data1:	
ID	Class	Name	 
1	 A	John
1	 C	Sandy
2	 A	Carol
2	 B	Gail
2	 C	Jill	   
3	 B	Paul 
4	 B	Ryan	 
4	 C	Andrea

data2:	
ID	Class	Test1	Test2
1	  A	 93	 85  
1	  c	 85	 93
2	  A	 86	 92 
2	  B	 75	 83
2	  C 	 94	 95
3	  B	 74	 88
4	  B	 89	 98
4	  C	 85	 68 
 
DATA merged;
  MERGE data1 data2;
  BY id class;
RUN;






merged:
ID	Class	Name	Test1	Test2 

1	 A	John	 93	 85  
1	 C	Sandy	 .	 .  
1	 c	 	 85	 93  
2	 C	Jill 	 86 	 92 
2	 C	Jill 	 75 	 83 
2	 C	Jill 	 94 	 95 
3	 B	Paul	 74	 88 
4	 C	Andrea	 89	 98 
4	 C	Andrea	 85	 68 

When merging with multiple BYVARs, SAS simply finds unique combinations of the BYVARS and then does a 1-1 or 1-many match merge. Note that the BYVARs must match exactly. This is especially important when dealing with character variables. Note that in data2, the 2nd observations has the value of class as a lower case 'c' instead of an upper case 'C' as found in data1. Even though these two observations should probably match, they cannot because 'c' is not equal to 'C'.

Again, the observations in each dataset must be sorted by all the BYVARS in the order listed.