Previous Lecture  lect05 
lect05, Tue 04/14
Pandas and Question formulation
Question formulation
 What do we want to know Are we generating new hypotheses?
Data Acquisition / Cleaning
 How will we collect
Related to metrics for success.
Population, frame, sample.
 How do we organize the data for analysis?
Exploratory Data Analysis
 What’s the best way to visualize the data?
Inference and Prediction
 How robust are our conclusions / what is our uncertainty?
Can we come up with a robust answer despite the uncertainty.
Pandas
Pandas Data Structures
 Data Frame: 2D tabular data

Indices: 1D data series; a sequence of labels
loc
vs.iloc
(think ifi
as indicated an integer)
Goals for today
Discuss aggregation: * *
 A case study
Method chaining
Also sometimes called “piping” Making multiple method calls sequentially and returning the resulting object
Groupby
 Group by Major
 Mean of “Random Number”
x_i
comes from random Normal distribution (mean 0, std 1)
E[x_i] = 0, Var(x_i) = 1
Y = 1/N_g * sum(x_i), where i = g
E(Y) =
Var(Y) = 1/(N**2)*N = 1/N
SD = sqrt(var)
STSDS =
Multiindex
groupby