Problem Set 6: Sampling from a Voter File

This content is from Fall 2022. Go to Fall 2023 site

You can find instructions for obtaining and submitting problem sets here.

You can find the GitHub Classroom link to download the template repository on the Ed Board

Background

In this exercise, we will focus on sampling and sampling distributions when we have access to an entire census for a given population. In this case, the data/fulton.csv file contains anonymized data on all registered voters in Fulton County, GA from 1994. The variables in this dataset are:

Name Description
turnout did person vote (1) or not (0) in 1994?
black is this person black (1) or not (0)?
sex is this person a woman (1) or not (0)?
age age of registered voter
dem registered as a Democrat (1) or not (0)?
rep registered as a Republican (1) or not (0)?
urban registered in a city (1) or not (0)?

For the purposed of this exercise, we will treat this data as the population of interest. Doing so is an increasingly common approach for polling, where pollsters are now using the voter file as a sampling frame to conduct their polls. We will repeated sample from this population to better understand sampling uncertainty.

Note: please follow the directions carefully about setting the seed for the sampling based questions.

Question 1 (7 points)

Load the voter list data into R using read_csv and save the data as fulton.

Create a density histogram of age with a bin width of 1 and save this plot as age_hist (use the aesthetic mapping y = ..density.. in to accomplish this). Create a barplot for turnout with the proportion on the y-axis (use the aesthetic mapping y = ..prop.. in geom_barplot to achieve this). Make sure both of these plots are shown in the PDF.

In the write-up, state how many units are in the population (that is, how many rows are in the fulton data).

Rubric: 1pt for Rmd file compiling (autograder); 1pt for data loaded (autograder); 2pts for age_hist (autograder); 2pts for turnout_hist (autograder); 1pt for number of rows reported (PDF).

Question 2 (5 points)

Use summarize() to calculate the population mean and standard deviation of age and turnout (that is the mean and SD of these variables in the fulton data) and save the resulting tibble as pop_parameters with the tibble output looking like:

# A tibble: 1 × 4
  age_pop_mean age_pop_sd turnout_pop_prop turnout_pop_sd
         <dbl>      <dbl>            <dbl>          <dbl>
1         XX.X       XX.X            X.XXX          X.XXX

Make sure that the column names are the same for the autograder. (Hint: you can summarize multiple variables in the same call to summarize.) Use knitr::kable() to present the values in nicely formatted table with digits = 2 to create nicely rounded numbers and informative column names (they may need to be abbreviated to fit on the page).

Rubric: 4pts for correct pop_parameters tibble (autograder); 1pts for nicely formatted table (PDF).

Question 3 (8 points)

In the first line of the code chunk for this question use the following code:

library(infer)
set.seed(02138)

Then use rep_slice_sample() to take 1,000 samples of size 20 from fulton and calculate the sample mean of age and the sample mean/proportion of turnout for each of these samples. Save these as variables named age_mean and turnout_prop and save the resulting tibble as samples_n20.

Create a density histogram of the age means and with a bin width of 1 and save this as age_mean_hist. Create a density histogram of the turnout proportions and save this as turnout_prop_hist. Make sure both of these plots are shown in the PDF and that they have informative labels.

In the write-up, compare the sampling distribution of the sample mean and sample proportion here to the population distributions from question 1. Are the shapes of the sampling distributions similar or different to the population distributions? If different, how are they different?

Rubric: 2pts for correct sample_n20 output (autograder); 2pts for correct age_hist (autograder); 2pts for correct turnout_hist (autograder); 1pt informative labels (PDF); 1pt comparison to population distributions (PDF).

Question 4 (7 points)

Use the summarize() function on samples_n20 to calculate the average (named ev_age and ev_turnout) and standard deviation (named se_age and se_turnout) of each sample mean/proportion across the repeated samples. Save this tibble as samp_dist_summary and it should look like this:

# A tibble: 1 × 4
  ev_age se_age ev_turnout se_turnout
   <dbl>  <dbl>      <dbl>      <dbl>
1   X.XX   X.XX      X.XXX      X.XXX

Make sure that the column names are the same for the autograder. Use knitr::kable() to present the values in nicely formatted table with digits = 2 to create nicely rounded numbers.

Compare the mean and SD of these sampling distributions to the population means and SDs from the previous question. Are these distributions centered on the same value? Which has more spread, the population distribution of age/turnout or the sampling distributions of their means?

Rubric: 4pts for correct samp_dist_summary tibble (autograder); 1pt for nicely formatted table (PDF); 2pts for discussion (PDF).

Question 5 (3 points)

The central limit theorem says that sums and means tend to be normally distributed so that 68% of the sampling distribution of the mean should be within 1 standard deviation of the expected value of the expected value. Let’s see if this if this approximation is good for our sampling distributions.

Use mutate to create a new variable in samples_n20 called age_error that is the absolute value of the difference between age_mean and the average of the age_mean. Then create a variable called within_1sd_age that is TRUE is this absolute difference is less than or equal to the standard deviation of age_mean. Finally, take the mean of this variable to obtain the proportion of sample means that are within one SD of the mean of their distribution. Save the resulting 1x1 tibble as age_clt_n20.

To get you started, you can calculate the absolute value of the difference between a variable x and its mean using the following code:

mydata |>
  mutate(
    error = abs(x - mean(x))
  )

Report this proportion in the main text and comment on whether it is similar to what the CLT would predict.

Rubric: 2pts for correct age_clt_n20 tibble (autograder); 1pt for the reporting value and commenting on comparison to CLT (PDF).

Question 6 (5 points)

In this question you will repeat the exercise from question 4, using turnout_prop instead.

Use mutate to create a new variable in samples_n20 called turnout_error that is the absolute value of the difference between turnout_prop and the average of the turnout_prop. Then create a variable called within_1sd_turnout that is TRUE is this absolute difference is less than or equal to the standard deviation of turnout_prop. Finally, take the mean of this variable to obtain the proportion of sample means that are within one SD of the mean of their distribution. Save the resulting 1x1 tibble as turnout_clt_n20.

Report this proportion in the main text and comment on whether it is similar to what the CLT would predict. If this is different than age, can you think of anything about the two variables that differ that might cause the CLT approximation to be better for one than the other?

Rubric: 2pts for correct turnout_clt_n20 tibble (autograder); 1pt for the reporting value and commenting on comparison to CLT (PDF); 1pt reasoning on why they might differ (PDF).

Question 7 (Extra credit, 5pts)

This problem is optional. Any points earned on this problem can be applied to lost points on other parts of the problem set. You cannot earn more than the maximum score on the problem set.

Write the following line at the beginning of the code chunk for this problem:

set.seed(02138)

Then create a tibble called samples_n200 that replicates the exercise the sampling of question 3 but with 1,000 samples of size 200. With this tibble replicate the analysis of questions 4 and 5 to get the proportion of sample means/proportions age_mean and turnout_prop that are within 1 SD of the means of those distributions. You can save the resulting 1x1 tibbles for age as age_clt_n200 and for turnout as turnout_clt_n200.

In the write-up, report these values and answer the following questions. Does the normal approximation seem better here than with a sample size of 20? Which variable sees more improvement in the approximation?

Rubric: 1pt for sample_n200 (autograder); 1pt for age_clt_n200 (autograder); 1pt for turnout_clt_n200; 2pts for write up (PDF).