Previous Lecture  lect03  Next Lecture 
lect03, Tue 04/07
The Data Science Lifecycle and Sampling
The Data Science Lifecycle and Sampling
04:08 min into the video
Announcements
 All the important links are on the website at the top: bookmark the site.

Instead of directly going to the ds100.lsit.ucsb.edu, use the bit.ly in order to always have it sync with the most uptodate version
 Homework 1 is due Friday
 Important: please separately upload pdf only to gradescope assignment titled “Homework 1 PDF only”
 Office hours all week
 course calendar and schedule on the website
 Tutors attend the lectures: you can ask them questions during the lecture
 Tuesday: Noemi
 Thursday: Arthur

Use the additional Zoom reactions: go slower, etc.
 Due dates on the website will be updated.
Topics for Today
 Probability review (PSTAT 120A)
 Random variables, probability distribution, expectation and variance
 Review Error, Loss, and Risk and the Relationship between the Data and the “World” (Data science life cycle)
 An Example (10:37 in the video)
 population
 sampling frame
Where does randomness come from?
From Last Week
Key questions:
 What is the population of interest?
 What is the sampling frame?

How are the data generated?
 These questions are needed for every data analysis, Not just survey data!!
Where is the randomness?
Random Variables

“Informally, a variable whose values depend on outcomes of a random phenomenon” (Wikpedia)

Possible outcomes of a yettobeperformed “experiment”

Random variables still relevant after conducting the “experiment”. Why?
Probability tells us what could have happened. Useful for contextualizing what did happen.
 Accuracy
 Representativeness
If I can tell you the probability of the sample that I get, then it gives me the notion of the accuracy and representativeness of the samples that I have.
 Big data
 Government agency / company
 nonrandom sample
Can we make up for no Probability Sample with Big Data? (Administrative dataset)
Sample and Population Averages
The gap between these is based on three things:
 Data quality measure (how representative is our sampling? Biased sampling?)
 Data quantity measure (how big is the sample relative to the population; the “n”)
 Problem difficulty measure (how variable is the response)
Meng 2018, Annals Applied Probability
Sample and Population Averages
 Random sampling ensures high data quality by eliminating selection bias and confounding

When combining data sources for population inferences, those relatively tiny but higher quality sources should be given far more weights than suggested by their sizes.
 Active Area of Research Area
Administrative Data on Birth Registration

Can use large databases of all recorded births

This is (“big”) data. It’s meant to be comprehensive. Is it?

“Which one should we trust more, a 5% survey sample or an 80% administrative dataset?” (Meng, 2018)
 For example, 80% administrative dataset = birth certificates for 80% of US families. (NonRandom)
 5% survey sample ==> random
Birth certificates for good of all families in the U.S.?
UNICEF: 1 in 4 children under age 5 do not officially exist. (Discovered via the field work?)
They are not in our administrative data.
Administrative Data on Birth Registration
 Can use large databases of all recorded births
 Big data, meant to be comprehensive. Is it?
Large Administrative Data vs Small SRS
A tradeoff between data quantity and data quality for using sample averages to estimate population averages
Administrative Data  biased, low variance
“The bigger the data the surer we fool ourselves!” (Quote from a recent paper)
COVID19 pandemic
 Number of people in the US who tested positive
 Administrative data from the hospitals
 Tested positive / US population ==> estimate
Potential problem with this approach? Lots of undiagnosed people
Germany is doing a SRS: testing a random subset of the population
Randomly testing 20 people gives about the same accuracy that we are getting from the administrative data of those who voluntarily get tested.
How do we solve probability problems?
33:44 in the video
Basic Approaches
Review from 120A
Once we have the data, the outcome is determined for that sample, but probability contextualizes how confident we can be about our inference.
 Symmetry and Analogy
 Counting and equally likely
 Trees and conditional probability
 Computer simulation and RNG (“approximate” counting)
Recall our group of 10 mothers

Population of 10
 Select a mother at random from the 10, record her number of kids
 Do not replace
 Repeat for a total of 3 samples (n = 3 survey)
What is the chance the second mom that was selected has 1 child?
Symmetry & Analogy
The above question is analogous to the following problem
 Urn with 10 marble one for each mother, indistinguishable except for the number written on it
 Box with 10 indistinguishable tickets, except for the number on it (raffle)
 Deck of 10 indistinguishable cards, except for the number on the flip side
4 scenarios indistinguishable “mathematically”
Chance the second draw is 1.