Previous Lecture lect02 Next Lecture

lect02, Thu 04/02

Data Life Cycle

Lecture 2

Announcements

Question: “No required application” installed on the computer when downloading the notebook. How to open it? We’ll try to resolve it before the end of the week.

Data Life Cycle

Continuous loop that starts with trying to get the insight into the data.

  1. Question Formulation
  2. Data Design/Generation (experimentation)
  3. Data Analysis (EDA, visualization)
  4. Generalization

START SIMPLE

QUESTION: What is the typical family size (children only)?

Want to understand our population (family size); get out the survey to collect samples; get the data and then generalize to the population.

Probability (sampling distribution) ==> Data “Statistic” Inference (estimation, hypotesis testing) ==> Population

DATA:

Survey of all 250 students at UCSB and asked their family size

A single sample / counts.

ANALYSIS

Bar chart / histogram visualization provides a quick summary of the distribution

Can we provide a summary statistic?

One number, simple numeric summary

What number should we choose? Mean, median

DETOUR:

Why is the sample mean such a desirable summary?

Summarizing the Data

Example:

x1 = 2, c = 2, so the loss L is 0 (x1 - c)
x2 = 4, c = 2, so the loss L is 2 (x2 - c)

Average loss aka empirical risk

Empirical, because we are looking at the observed sample

Minimize the empirical risk: want to minimize over all c, solve the minimization problem (sum over all losses)

What’s a reasonable loss function to use?

Minimize the Average Loss

Squared error loss properties:

Over all possible values of summary statistic, c, minimize the squared error

Q: Isn’t it the formula for the variance?

(c can be arbitrary, so it’s not quite the variance; see below)

Refresher

“Sample mean” (x bar): Linear operation

Linear operation property: Scaling and shifting each observation scales and shifts the sample mean.

Minimize the Average Loss

How to minimize a function? Look for the place where the derivative is 0.

Calculus + algebra skills FTW!

Shows that c = the sample mean

When L is the squared loss, c = x-bar minimizes the loss; empirical risk

What does the empirical risk look like?

The Sample Average Minimizes Empirical Risk

Less than or equal to any other loss function that uses c, which can be arbitrary, so the empirical risk is not quite the variance, unless our c is the mean.

Specific for the squared loss function. If that’s not the loss function, the result might/will be different.

Data Life Cycle

  1. Question Formulation (Family size)
  2. Data Design/Generation (survey)
  3. Data Analysis (loss function + minimizing risk –> got “c”)
  4. In some cases we do not want/need to generalize.

If we have data for all individuals, we don’t need to generalize. ==> “Finite sample inference”

What does our data say about larger population?

Consider the Question Carefully

Families: all shapes, all sizes

Angelina Jolie: 3 adopted and 3 biological children.

Who do I pick to figure out the measure?

Focus the Question

Questions for our question to help us focus:

From Female Fertility Perspective: To estimate the number of children that a woman gives birth to.

Focus the Question

The Question gives focus to the Population that we want to study.

Data Life Cycle

Back to our diagram.

Population: women in USA in 2015. Survey them and make a larger inference about the population.

How well does UCSB represent the

group of interest?

How might these characteristics impact the estimate of the number of children a US woman bears in her lifetime in 2015?

Bias up, Bias down, No impact

Breakout discussion results

42-45 min into the video

“size Biased sampling” (multiple siblings)

(Positive bias, we’ll overestimate)

According to Pew Research Center

Women with less education tend to have more children.

26% with high school diploma have 4+ children vs. 8% of women with the postgraduate degree

How might this impact the class average?

http://www.pewsocialtrends.org/2015/05/07/family-size-among-mothers/

The data used in these analyses are designed to assess women’s fertility, and as such a “mother” is here defined as any woman who has given birth. However, many women who do not bear their own children are indeed mothers.

Population of Interest

Sample: how we select individuals from the frame.

Possible case

All the same: Sample = Sampling Frame = Population

Where/when have we recently seen this?

Who are we missing? How might this affect our results?

Common Assumption when sampling

Sampling Frame = Population

Scenario: Access to all members of the Population when sampling

(Common Assumption)

Administrative Data

Sampling Frame = Sample

Government data: e.g., birth certificate

Most common scenario

Little overlap

How are the data generated?

It doesn’t matter what fancy tools / analyses you use unless you understand where data came from?

DETOUR:

  1. The simple random sample

  2. Why is a probability sample so desirable?

The Simple Random Sample (SRS)

Example for N = 4 and n = 2, gives 6 possible outcomes.

The Advantages of a SRS

Start Simple

Suppose our population contains only 10 mothers and we take a Simple Random Sample of 3 for our survey.

A table with true numbers.

10 choose 3 possible samples (120 possible samples)

If I got “unlucky”, I chose 2 mothers with 1 child and 1 mother with 2. (Why “unlucky”? Because I selected the individuals with the smallest number of children and my representative sample is much smaller than the true values in my dataset.)

Can I compute the chance of getting “unlucky”? That’s what we did in 120A and what 120B gets at.

Formal Set Up

Like a Lotto Ball machine with the number of children as the number on the ball.

Drawing them without replacement

X1: The number of children for the first mother chosen X2: The number of children for the second mother chosen X3: The number of children for the third mother chosen

Random variable: we don’t know what we’ll get.

Truth table

Probability Distribution

What is the expected value of X1?

Wrap-up

Key questions:

Where is the randomness coming from?


DETOUR CONTINUED:

Why is the expected value a desirable summary of a probability distribution?

Random Variables

Random Variables: Random ERROR: LOSS:

Summarizing the Probability Distribution

EXPECTED LOSS:

AKA RISK

Minimize the risk

Properties of Expected Value

The Expected Value Minimizes Risk

Can we make up for no Probability Sample with Big Data?