Chapter 3 Sampling

One can use almost any program to draw a random sample. In Stata, you could write your own basic program to sample observations, making use of Stata’s random number generator (e.g. gen random=runiform()). However, there is also a useful packaged program that streamlines the process for you and makes it easier to do sampling proportional to size – samplepps.

Let’s start by opening a dataset that is a list of most of the local communities (towns and cities) in the United States.

use http://people.umass.edu/schaffne/communities.dta

Now, let’s say that we wanted to use these communities to conduct a cluster sample. In essence, we want to select 100 communities (the idea is that we would then travel to those communities and randomly select households to interview). samplepps is specifically designed to conduct sampling proportional to size. In this case, we have the population of each community as one of the variables in our dataset (variable name: population). So, to take our sample of 100 communities, we could do the following:

samplepps insample, ncases(100) size(population)

This creates a new variable called insample which equals 1 if the observation in the dataset was selected into your sample (so there should be 100 cases with 1s).

But what if you don’t want to sample proportional to the size of the community. That is, what if you want each community to have an equal chance of being selected. Well, this is easy enough; you just have to trick samplepps by giving all the cases in your dataset the same size. So you could simply do this:

gen size=1

samplepps insample2, ncases(100) size(size)

Now, if you want to see the impact of sampling proportional to size versus just taking a random sample where all communities have an equal chance of being selected, check out the average population size for the communities who have a 1 for insample compared to those who have a 1 for insample2. The latter communities have much smaller populations, of course.

Finally, it is worth noting that you can use if commands with samplepps, and this can come in handy if you want to do stratified sampling. For example, in this dataset, some regions have many more communities than others. But you could use the if command to sample within each region (one region at a time).