Previous Lecture | lect02 | Next Lecture |
lect02, Thu 04/02
Data Life Cycle
Lecture 2
Announcements
-
Office hours begin today. Schedule here: https://ucsb-ds.github.io/s20/info/staff/
- Lecture recordings should be accessible when using UCSB netid.
- Use the same zoom link for lab and office hours
- Try submitting first lab to Gradescope by 5pm Friday.
- Homework 1 out in the next 1-2 days.
Question: “No required application” installed on the computer when downloading the notebook. How to open it? We’ll try to resolve it before the end of the week.
Data Life Cycle
Continuous loop that starts with trying to get the insight into the data.
- Question Formulation
- Data Design/Generation (experimentation)
- Data Analysis (EDA, visualization)
- Generalization
-
Refine the question and repeat the process.
-
Step 1 along with step 4 are crucial but the focus of this class, is step 2 along with step 3.
START SIMPLE
QUESTION: What is the typical family size (children only)?
Want to understand our population (family size); get out the survey to collect samples; get the data and then generalize to the population.
Probability (sampling distribution) ==> Data “Statistic” Inference (estimation, hypotesis testing) ==> Population
DATA:
Survey of all 250 students at UCSB and asked their family size
A single sample / counts.
ANALYSIS
Bar chart / histogram visualization provides a quick summary of the distribution
Can we provide a summary statistic?
One number, simple numeric summary
What number should we choose? Mean, median
DETOUR:
Why is the sample mean such a desirable summary?
Summarizing the Data
- DATA: x1, x2, …, xn where n is 250 in our example
- Want a single numeric statistic; let’s call it c.
- ERROR: x1 - c, x2, …, xn
- (The difference between the family size xi and summary statistic c)
- LOSS function: takes real numbers (error) and maps the error to positive numbers (the cost of making an error).
Example:
x1 = 2, c = 2, so the loss L is 0 (x1 - c)
x2 = 4, c = 2, so the loss L is 2 (x2 - c)
Average loss aka empirical risk
Empirical, because we are looking at the observed sample
Minimize the empirical risk: want to minimize over all c, solve the minimization problem (sum over all losses)
What’s a reasonable loss function to use?
- positive, increasing function
- MSE: (mean) squared error
Minimize the Average Loss
Squared error loss properties:
- maps to positive values
- continuous function (and continuous derivative, i.e. “smooth”)
- deep connection to gaussian/normal
Over all possible values of summary statistic, c, minimize the squared error
Q: Isn’t it the formula for the variance?
(c can be arbitrary, so it’s not quite the variance; see below)
Refresher
“Sample mean” (x bar): Linear operation
Linear operation property: Scaling and shifting each observation scales and shifts the sample mean.
Minimize the Average Loss
How to minimize a function? Look for the place where the derivative is 0.
Calculus + algebra skills FTW!
Shows that c = the sample mean
When L is the squared loss, c = x-bar minimizes the loss; empirical risk
What does the empirical risk look like?
- expected value
The Sample Average Minimizes Empirical Risk
Less than or equal to any other loss function that uses c, which can be arbitrary, so the empirical risk is not quite the variance, unless our c is the mean.
Specific for the squared loss function. If that’s not the loss function, the result might/will be different.
Data Life Cycle
- Question Formulation (Family size)
- Data Design/Generation (survey)
- Data Analysis (loss function + minimizing risk –> got “c”)
- In some cases we do not want/need to generalize.
If we have data for all individuals, we don’t need to generalize. ==> “Finite sample inference”
What does our data say about larger population?
Consider the Question Carefully
- What is the typical family size (children only)?
Families: all shapes, all sizes
Angelina Jolie: 3 adopted and 3 biological children.
- How well can we measure this?
- What are we trying to measure?
Who do I pick to figure out the measure?
Focus the Question
Questions for our question to help us focus:
- WHERE
- WHEN
- WHO
- WHAT
From Female Fertility Perspective: To estimate the number of children that a woman gives birth to.
- WHERE: in the USA
- WHEN: in 2015
- WHO: Females, aged 40-44 (likely they’ve had all the children they are going to have)
-
WHAT: number of births
- Q2: how does it compare to births 50 years ago?
Focus the Question
The Question gives focus to the Population that we want to study.
Data Life Cycle
Back to our diagram.
Population: women in USA in 2015. Survey them and make a larger inference about the population.
How well does UCSB represent the
group of interest?
- Mothers of children at UCSB
- Measure the mothers via the children
- Mothers who are 40-44 in 2015
How might these characteristics impact the estimate of the number of children a US woman bears in her lifetime in 2015?
Bias up, Bias down, No impact
Breakout discussion results
42-45 min into the video
- Multiple siblings on campus: their family might be represented multiple times
- The survey doesn’t account for Women with no children
- Not representative of the Demographics: wealth, race, etc. (factors that affect the number of children a woman had)
-
Age: 40-44? Having children at age 20-24 (to have their children in college)?
- What kinds of bias? Over/underestimation?
“size Biased sampling” (multiple siblings)
(Positive bias, we’ll overestimate)
According to Pew Research Center
Women with less education tend to have more children.
26% with high school diploma have 4+ children vs. 8% of women with the postgraduate degree
How might this impact the class average?
http://www.pewsocialtrends.org/2015/05/07/family-size-among-mothers/
The data used in these analyses are designed to assess women’s fertility, and as such a “mother” is here defined as any woman who has given birth. However, many women who do not bear their own children are indeed mothers.
Population of Interest
- Population of interest (All women in 2015 in USA)
- The individuals that we want to study
- The Sample is a subset of the Frame
- Mothers of UCSB students
- Individuals who aren’t reachable through the sampling frame
- Individuals not in the population
Sample: how we select individuals from the frame.
Possible case
All the same: Sample = Sampling Frame = Population
Where/when have we recently seen this?
- class survey
- Census!
Who are we missing? How might this affect our results?
Common Assumption when sampling
Sampling Frame = Population
Scenario: Access to all members of the Population when sampling
(Common Assumption)
Administrative Data
Sampling Frame = Sample
Government data: e.g., birth certificate
Most common scenario
Little overlap
How are the data generated?
- What is the population of interest?
- What is the sampling frame?
- How are the data generated?
It doesn’t matter what fancy tools / analyses you use unless you understand where data came from?
DETOUR:
-
The simple random sample
-
Why is a probability sample so desirable?
The Simple Random Sample (SRS)
- Suppose we have a population with N subjects
- We are able to sample n of them
- The SRS is a random sample where every unique subset of n subjects has the same chance of appearing in the sample
- This means each person is equally likely to be in the sample (“N choose n”, N! / n! (N-n)!)
Example for N = 4 and n = 2, gives 6 possible outcomes.
The Advantages of a SRS
- Representative: The sample tends to look like the population
- Statistics based on the sample tend to be close to statistics based on the population
- We can provide typical deviations of sample statistics from population values.
- AND MORE…
Start Simple
Suppose our population contains only 10 mothers and we take a Simple Random Sample of 3 for our survey.
A table with true numbers.
10 choose 3 possible samples (120 possible samples)
If I got “unlucky”, I chose 2 mothers with 1 child and 1 mother with 2. (Why “unlucky”? Because I selected the individuals with the smallest number of children and my representative sample is much smaller than the true values in my dataset.)
Can I compute the chance of getting “unlucky”? That’s what we did in 120A and what 120B gets at.
Formal Set Up
Like a Lotto Ball machine with the number of children as the number on the ball.
Drawing them without replacement
X1: The number of children for the first mother chosen X2: The number of children for the second mother chosen X3: The number of children for the third mother chosen
Random variable: we don’t know what we’ll get.
Truth table
Probability Distribution
What is the expected value of X1?
- Sum over all possibilities times the probability of that possibility (from the truth table).
- On average, in our example, the family size is 2.3
Wrap-up
Key questions:
- What is the population of interest?
- What is the sampling frame?
- How are the data generated?
Where is the randomness coming from?
DETOUR CONTINUED:
Why is the expected value a desirable summary of a probability distribution?
Random Variables
Random Variables: Random ERROR: LOSS:
Summarizing the Probability Distribution
EXPECTED LOSS:
AKA RISK
Minimize the risk
Properties of Expected Value
The Expected Value Minimizes Risk
Can we make up for no Probability Sample with Big Data?