```
library(infer)
set.seed(02138)
```

# Problem Set 6: Sampling from a Voter File

You can find instructions for obtaining and submitting problem sets here.

You can find the GitHub Classroom link to download the template repository on the Ed Board

## Background

In this exercise, we will focus on sampling and sampling distributions when we have access to an entire census for a given population. In this case, the `data/fulton.csv`

file contains anonymized data on all registered voters in Fulton County, GA from 1994. The variables in this dataset are:

Name | Description |
---|---|

`turnout` |
did person vote (1) or not (0) in 1994? |

`black` |
is this person black (1) or not (0)? |

`sex` |
is this person a woman (1) or not (0)? |

`age` |
age of registered voter |

`dem` |
registered as a Democrat (1) or not (0)? |

`rep` |
registered as a Republican (1) or not (0)? |

`urban` |
registered in a city (1) or not (0)? |

For the purposed of this exercise, we will treat this data as the population of interest. Doing so is an increasingly common approach for polling, where pollsters are now using the voter file as a sampling frame to conduct their polls. We will repeated sample from this population to better understand sampling uncertainty.

**Note:** please follow the directions carefully about setting the seed for the sampling based questions.

## Question 1 (7 points)

Load the voter list data into R using `read_csv`

and save the data as `fulton`

.

Create a density histogram of age with a bin width of 1 and save this plot as `age_hist`

(use the aesthetic mapping `y = ..density..`

in to accomplish this). Create a barplot for turnout with the proportion on the y-axis (use the aesthetic mapping `y = ..prop..`

in `geom_barplot`

to achieve this). Make sure both of these plots are shown in the PDF.

In the write-up, state how many units are in the population (that is, how many rows are in the `fulton`

data).

**Rubric:** 1pt for Rmd file compiling (autograder); 1pt for data loaded (autograder); 2pts for `age_hist`

(autograder); 2pts for `turnout_hist`

(autograder); 1pt for number of rows reported (PDF).

## Question 2 (5 points)

Use `summarize()`

to calculate the population mean and standard deviation of `age`

and `turnout`

(that is the mean and SD of these variables in the `fulton`

data) and save the resulting tibble as `pop_parameters`

with the tibble output looking like:

```
# A tibble: 1 × 4
age_pop_mean age_pop_sd turnout_pop_prop turnout_pop_sd
<dbl> <dbl> <dbl> <dbl>
1 XX.X XX.X X.XXX X.XXX
```

Make sure that the column names are the same for the autograder. (**Hint:** you can summarize multiple variables in the same call to `summarize`

.) Use `knitr::kable()`

to present the values in nicely formatted table with `digits = 2`

to create nicely rounded numbers and informative column names (they may need to be abbreviated to fit on the page).

**Rubric:** 4pts for correct `pop_parameters`

tibble (autograder); 1pts for nicely formatted table (PDF).

## Question 3 (8 points)

In the first line of the code chunk for this question use the following code:

Then use `rep_slice_sample()`

to take 1,000 samples of size 20 from `fulton`

and calculate the sample mean of `age`

and the sample mean/proportion of turnout for each of these samples. Save these as variables named `age_mean`

and `turnout_prop`

and save the resulting tibble as `samples_n20`

.

Create a density histogram of the age means and with a bin width of 1 and save this as `age_mean_hist`

. Create a density histogram of the turnout proportions and save this as `turnout_prop_hist`

. Make sure both of these plots are shown in the PDF and that they have informative labels.

In the write-up, compare the sampling distribution of the sample mean and sample proportion here to the population distributions from question 1. Are the shapes of the sampling distributions similar or different to the population distributions? If different, how are they different?

**Rubric:** 2pts for correct `sample_n20`

output (autograder); 2pts for correct `age_hist`

(autograder); 2pts for correct `turnout_hist`

(autograder); 1pt informative labels (PDF); 1pt comparison to population distributions (PDF).

## Question 4 (7 points)

Use the `summarize()`

function on `samples_n20`

to calculate the average (named `ev_age`

and `ev_turnout`

) and standard deviation (named `se_age`

and `se_turnout`

) of each sample mean/proportion across the repeated samples. Save this tibble as `samp_dist_summary`

and it should look like this:

```
# A tibble: 1 × 4
ev_age se_age ev_turnout se_turnout
<dbl> <dbl> <dbl> <dbl>
1 X.XX X.XX X.XXX X.XXX
```

Make sure that the column names are the same for the autograder. Use `knitr::kable()`

to present the values in nicely formatted table with `digits = 2`

to create nicely rounded numbers.

Compare the mean and SD of these sampling distributions to the population means and SDs from the previous question. Are these distributions centered on the same value? Which has more spread, the population distribution of age/turnout or the sampling distributions of their means?

**Rubric**: 4pts for correct `samp_dist_summary`

tibble (autograder); 1pt for nicely formatted table (PDF); 2pts for discussion (PDF).

## Question 5 (3 points)

The central limit theorem says that sums and means tend to be normally distributed so that 68% of the sampling distribution of the mean should be within 1 standard deviation of the expected value of the expected value. Let’s see if this if this approximation is good for our sampling distributions.

Use `mutate`

to create a new variable in `samples_n20`

called `age_error`

that is the absolute value of the difference between `age_mean`

and the average of the `age_mean`

. Then create a variable called `within_1sd_age`

that is `TRUE`

is this absolute difference is less than or equal to the standard deviation of `age_mean`

. Finally, take the mean of this variable to obtain the proportion of sample means that are within one SD of the mean of their distribution. Save the resulting 1x1 tibble as `age_clt_n20`

.

To get you started, you can calculate the absolute value of the difference between a variable `x`

and its mean using the following code:

```
|>
mydata mutate(
error = abs(x - mean(x))
)
```

Report this proportion in the main text and comment on whether it is similar to what the CLT would predict.

**Rubric:** 2pts for correct `age_clt_n20`

tibble (autograder); 1pt for the reporting value and commenting on comparison to CLT (PDF).

## Question 6 (5 points)

In this question you will repeat the exercise from question 4, using `turnout_prop`

instead.

Use `mutate`

to create a new variable in `samples_n20`

called `turnout_error`

that is the absolute value of the difference between `turnout_prop`

and the average of the `turnout_prop`

. Then create a variable called `within_1sd_turnout`

that is `TRUE`

is this absolute difference is less than or equal to the standard deviation of `turnout_prop`

. Finally, take the mean of this variable to obtain the proportion of sample means that are within one SD of the mean of their distribution. Save the resulting 1x1 tibble as `turnout_clt_n20`

.

Report this proportion in the main text and comment on whether it is similar to what the CLT would predict. If this is different than age, can you think of anything about the two variables that differ that might cause the CLT approximation to be better for one than the other?

**Rubric:** 2pts for correct `turnout_clt_n20`

tibble (autograder); 1pt for the reporting value and commenting on comparison to CLT (PDF); 1pt reasoning on why they might differ (PDF).

## Question 7 (Extra credit, 5pts)

**This problem is optional. Any points earned on this problem can be applied to lost points on other parts of the problem set. You cannot earn more than the maximum score on the problem set.**

Write the following line at the beginning of the code chunk for this problem:

`set.seed(02138)`

Then create a tibble called `samples_n200`

that replicates the exercise the sampling of question 3 but with 1,000 samples of size 200. With this tibble replicate the analysis of questions 4 and 5 to get the proportion of sample means/proportions `age_mean`

and `turnout_prop`

that are within 1 SD of the means of those distributions. You can save the resulting 1x1 tibbles for age as `age_clt_n200`

and for turnout as `turnout_clt_n200`

.

In the write-up, report these values and answer the following questions. Does the normal approximation seem better here than with a sample size of 20? Which variable sees more improvement in the approximation?

**Rubric:** 1pt for `sample_n200`

(autograder); 1pt for `age_clt_n200`

(autograder); 1pt for `turnout_clt_n200`

; 2pts for write up (PDF).