Previous Lecture | lect05 |
lect05, Tue 04/14
Pandas and Question formulation
Question formulation
- What do we want to know Are we generating new hypotheses?
Data Acquisition / Cleaning
- How will we collect
Related to metrics for success.
Population, frame, sample.
- How do we organize the data for analysis?
Exploratory Data Analysis
- What’s the best way to visualize the data?
Inference and Prediction
- How robust are our conclusions / what is our uncertainty?
Can we come up with a robust answer despite the uncertainty.
Pandas
Pandas Data Structures
- Data Frame: 2D tabular data
-
Indices: 1D data series; a sequence of labels
loc
vs.iloc
(think ifi
as indicated an integer)
Goals for today
Discuss aggregation: * *
- A case study
Method chaining
Also sometimes called “piping” Making multiple method calls sequentially and returning the resulting object
Groupby
- Group by Major
- Mean of “Random Number”
x_i
comes from random Normal distribution (mean 0, std 1)
E[x_i] = 0, Var(x_i) = 1
Y = 1/N_g * sum(x_i), where i = g
E(Y) =
Var(Y) = 1/(N**2)*N = 1/N
SD = sqrt(var)
STSDS =
Multi-index
groupby