Previous Lecture lect03 Next Lecture

lect03, Tue 04/07

The Data Science Lifecycle and Sampling

The Data Science Lifecycle and Sampling

04:08 min into the video

Announcements

Topics for Today

Where does randomness come from?

From Last Week

Key questions:

Where is the randomness?

Random Variables

Probability tells us what could have happened. Useful for contextualizing what did happen.

If I can tell you the probability of the sample that I get, then it gives me the notion of the accuracy and representativeness of the samples that I have.

Can we make up for no Probability Sample with Big Data? (Administrative dataset)

Sample and Population Averages

The gap between these is based on three things:

Meng 2018, Annals Applied Probability

Sample and Population Averages

Administrative Data on Birth Registration

Birth certificates for good of all families in the U.S.?

UNICEF: 1 in 4 children under age 5 do not officially exist. (Discovered via the field work?)

They are not in our administrative data.

Administrative Data on Birth Registration

Large Administrative Data vs Small SRS

A tradeoff between data quantity and data quality for using sample averages to estimate population averages

Administrative Data - biased, low variance

“The bigger the data the surer we fool ourselves!” (Quote from a recent paper)

COVID-19 pandemic

Potential problem with this approach? Lots of undiagnosed people

Germany is doing a SRS: testing a random subset of the population

Randomly testing 20 people gives about the same accuracy that we are getting from the administrative data of those who voluntarily get tested.

How do we solve probability problems?

33:44 in the video

Basic Approaches

Review from 120A

Once we have the data, the outcome is determined for that sample, but probability contextualizes how confident we can be about our inference.

Recall our group of 10 mothers

What is the chance the second mom that was selected has 1 child?

Symmetry & Analogy

The above question is analogous to the following problem

4 scenarios indistinguishable “mathematically”

Chance the second draw is 1.