Previous Lecture

lect05

lect05, Tue 04/14

Pandas and Question formulation

Question formulation

What do we want to know Are we generating new hypotheses?

Data Acquisition / Cleaning

How will we collect

Related to metrics for success.

Population, frame, sample.

How do we organize the data for analysis?

Exploratory Data Analysis

What’s the best way to visualize the data?

Inference and Prediction

How robust are our conclusions / what is our uncertainty?

Can we come up with a robust answer despite the uncertainty.

Pandas

Pandas Data Structures

Data Frame: 2D tabular data
Indices: 1D data series; a sequence of labels
loc vs. iloc (think if i as indicated an integer)

Goals for today

Discuss aggregation: * *

A case study

Method chaining

Also sometimes called “piping” Making multiple method calls sequentially and returning the resulting object

Groupby

Group by Major
Mean of “Random Number”

x_i comes from random Normal distribution (mean 0, std 1)

E[x_i] = 0, Var(x_i) = 1

Y = 1/N_g * sum(x_i), where i = g

E(Y) =

Var(Y) = 1/(N**2)*N = 1/N

SD = sqrt(var)

STSDS =

Multi-index

groupby