Teaching fellows

  • Dominic Valentino (Head TF)
  • Angelo Dagonel
  • Dorothy Manevich
  • Sooahn Shin
  • Dan Baissa

Course details

  •   Tue/Thu
  •   September 1st-December 20th, 2022
  •   12:00–1:15 PM
  •   Emerson 210
  •   Slack

Course assistants

  • Reem Ali
  • Autumn Dorsey
  • Osvaldo Mendoza
  • Sarah Ramberran
  • Felicia Roman
  • Dominic Skinnon
  • Angie Shin
  • Andy Wang
  • Jason Wang

Course objectives

In this course, you will learn the basics of data science as applied to the social sciences. This involves two broad skill sets: (1) learning the computing and programming tools to both manage and analyze data; and (2) understanding the conceptual foundations of why we might manage or analyze data in one way versus another. This course will address both of these topics.

Specifically, at the end of the course you should be able to:

  • Summarize and visualize data
  • Wrangle messy data into tidy forms
  • Evaluate claims about causality
  • Be able to use linear regression to analyze data
  • Understand uncertainty in data analysis and how to quantify it
  • Use professional tools for data analysis such as R, RStudio, git, and GitHub

We will also attempt to inspire a passion for data analysis and create a community among the students to deepen their learning.


In this course, you will be expected to

  • complete eight problem sets,
  • complete ten weekly tutorials,
  • take two take-home, open-book exams, and
  • write one final data analysis project.


No prerequisites will be assumed. If you are unfamiliar with downloading and installing software programs on your Mac or PC, you may want to allocate additional time to make sure those aspects of the course go smoothly. In particular, we have developed a Problem Set 0 to guide you through installing R, RStudio, and git to give you a sense of the tools we will be using.


This course satisfies the Methods requirement for the Government department and the “Quantitative Reasoning with Data” requirement in the Harvard College Curriculum. It also counts toward completion of the Government department data science track.

Speaker Series

We are planning a Gov 50 Speaker Series that will host optional events with guests from the world of data science. More information coming soon…

Course structure

Flow of the Course

The course will follow a basic flow each week, with small differences if a tutorial or problem set is due or not.

  • Monday: Complete reading/watching course material, complete tutorial (if due).
  • Tuesday: Lecture meets.
  • Wednesday: Submit problem set by 11:55 PM (if due).
  • Thursday: Lecture meets; problem sets and exams posted; sections meet.
  • Friday: Sections meet.
  • Sunday: Submit exams by 11:55 PM (if due)


We will assign short weekly tutorials to assess your knowledge of the material covered in the reading/video materials that week. While you are expected to complete them on time, they will be graded based on completion not on how correct the answers are. They will be due Mondays at 11:59pm ET.


We will meet twice a week for lectures where I will combine presenting material and doing live coding demonstrations. Ideally, you should bring your laptop to class and be ready to code along with me!


Sections will be small-group settings to practice the tools and techniques that we cover in class. These meetings are essential to gaining the skills you need for the problem sets, exams, and final project.

Problem Sets

Only reading about data science is about as instructive as reading a lot about hammers or watching someone else wield a hammer. You need to get your hands on a hammer or two. Thus, in this course, you will have mostly-weekly problem sets that will give you an opportunity to apply the statistical techniques you are learning. They will usually be focused on data analysis in general and will often involve a real dataset.

We encourage students to rely on peer working groups as they work on these homeworks, but each student will submit their own work individually. We will facilitate the formation of homework peer groups.

Grace policy

When calculating the final homework portion of the overall grade, we will drop the lowest score and use the remaining scores. Thus, if you have an emergency that forces you to miss one homework, your grade will not be severely affected.


There will be two take-home exams during the course. This exam will be similar to a homework in format and in the sense that it will be open book and open internet, but you will not be allowed to collaborate with other students or be able to communicate with any humans about the exam. You will be given several days to complete the exam. We will provide more information about the exam as it approaches.

Exam Release Date Due Date
Exam 1 Thu, Oct 13th 5:00pm ET Sun, Oct 16th 11:59pm ET
Exam 2 Thu, Dec 1st 5:00pm ET Sun, Dec 4th 11:59pm ET


We will be using the Ed platform for discussions on course material. There is a users guide to help orient yourself to the platform. We will enroll you into the platform toward the start of the semester.


You (the student) and I (the instructor) should care the most about what you learn, not what numerical/letter summary of that learning you get at the end of the semester. So I would love to not have grades at all, but unfortunately we humans are very good at procrastinating on our good intentions when there is no incentive not to. Thus, we have grades to help solve this commitment problem and to encourage you to put effort into learning the course material.

Here are how each portion of course contributes to the overall grade:

Category Percent of Final Grade
R Tutorials 10%
Problem Sets 40%
Exams 30%
Final Project 20%

We will use Gradescope for submission of the various assignments throughout the semester. Once enrollment is finished, Gradescope will automatically connect through Canvas.

Bump-up policy: We reserve the right to “bump up” the grades of students who have made valuable contributions to the course in the lecture, sections, study halls, or discussion/Slack. This also applies to students who show tremendous progress and growth over the semester.

Final Project

The final project is a data analysis project about whatever data excites you. That could be public opinion in U.S. elections, United Nations voting patterns, patterns of racial discrimination in police behavior, the distribution of salaries in basketball, or the economics of the Marvel Cinematic Universe. No matter the topic, you will formula a key research question, find data on that question, answer the question using the tools of the course, and present those results for public consumption.

The goal is for this to be a professional project that you can use to showcase the skills that you have gained to potential employers. Your final submission will be a publicly available article/website that contains: (1) a brief introduction to the research question and data collected; (2) a visualization of the data in question that speaks to your research question; (3) a presentation (as a table or graph) of a regression model assessing your question along with a plain-English interpretation; (4) a brief (one-paragraph) section that describes limitations of your analysis and threats to inference, and states how your analysis could be improved (e.g., improved data that would be useful to collect). Finally, there must be a link to the GitHub repo that contains the source code to load, clean, and analyze the data and produce the article.

The data collection and cleaning must be meaningful—it’s not sufficient to simply use a pre-cleaned data from an R package. Self-collected data is allowed and even encouraged, though beware that it can only be used for the final project itself only unless you go through human subjects approval from the University.

Milestone Due Date
Proposal Fri, Oct 28th 11:59pm ET
Draft Analyses Fri, Nov 18th 11:59pm ET
Final Report Wed, Dec 14th 11:59pm ET

Course Policies

Regrading Policy

If you feel there has been an error in the grading of one your assignment, you may request (in writing) a regrade of the assignment. Please send your request in an email to Professor Blackwell and the Head TF. A member of the teaching staff will regrade the entire assignment, not just the part you are disputing. Therefore, your regrade might increase or decrease the overall grade on the assignment.

Office Hours and Availability

My office hours are Tuesdays and Thursdays 1:30pm-3:00pm ET. If you have questions about the course material, computational issues, or other course-related issues please do not hesitate to set up an appointment with either any of us.

If you have a general question, you should post it to the Ed discussion board (where you can make your question anonymous if you would like). This is almost always the fastest way to get an answer. However, you can also email me directly at if you have a question that is about a personal situation.

Academic Honesty

The work that you do in both the problem sets should be your own work. You may seek help from others so long as this does not result in someone else completing your work for you. When asking for help, you may show others your code to help diagnose a bug or highlight a potential issue, but you should not view their (working) code. You should cite any discussions you have with other students in your problem set and note if they helped you with your code. You should never copy and paste code from another student or elsewhere (e.g., websites, former students).

I also strongly suggest that you make a solo effort at all the problems before consulting others. The exams will be very difficult if you have no experience working on your own. There is no collaboration allowed on the exams.

Course materials


We will use the following books in this class:

The following books are optional, but may be helpful to build you understanding of the material:

  • Freedman, David, Pisani, Robert, and Purves, Roger. 2007. Statistics. W.W. Norton & Company. 4th edition.
  • Gonick, Larry, and Woollcott Smith. 1993. The Cartoon Guide to Statistics. HarperPerinnial.


We’ll use R in this class to conduct data analysis. R is free, open source, and available on all major platforms (including Solaris, so no excuses). RStudio (also free) is a graphical interface to R that is widely used to work with the R language. You can find a virtually endless set of resources for R and RStudio on the internet. For beginners, there are several web-based tutorials. In these, you will be able to learn the basic syntax of R. We’ll post more R resources on the course website. We will also use git and Github to manage our projects.

You can get setup with all of these tools by completing Problem Set 0.

Mental Health

College is a stressful time in one’s life and mixing it with a global pandemic, remote learning, and dislocation makes this one of the most fraught semester any of us have probably faced. We acknowledge that nothing is quite normal and that there may be times when you feel overwhelmed by this course or by life more generally. Please feel free to reach out to any of the course staff if you want to talk about any issues you are having with the course or anything else. We will always try to help and we are committed to being extra accommodating this semester on course policy issues. Please just get in touch.

Of course, there are other resources at Harvard if you need them. A few are listed below:


Harvard University values inclusive excellence and providing equal educational opportunities for all students. Our goal is to remove barriers for disabled students related to inaccessible elements of instruction or design in this course. If reasonable accommodations are necessary to provide access, please contact the Disability Access Office (DAO). Accommodations do not alter fundamental requirements of the course and are not retroactive. Students should request accommodations as early as possible, since they may take time to implement. Students should notify DAO at any time during the semester if adjustments to their communicated accommodation plan are needed.