lect01, Tue 03/31
Introductions and Course Overview
Starts about 10 min into the video
- Alex Franks (Prof. Franks)
- Yekaterina Kharitonova (Prof. K)
- Franky (Meng) - STAT
- Brian (Lim) - CS
- Piazza for all course-related announcements and questions
- Course website: https://ucsb-ds.github.io/s20
- Jupyterhub (for labs and homework): ds100.lsit.ucsb.edu
- (+ Gradescope for submissions)
Textbook / references
- Reference text: Python Data Science Handbook
- Lecture notes
- 50% - approximately 5 homeworks
- 25% - Labs on Wednesdays (not graded on attendance, submitted online, shorter than HW)
- 20% - Final project (due Wed, June 10)
- like a “double homework”
- more info later in the quarter
- 5% - Participation
- usually 1-2 weeks
- will be accepted up to 2 days late
- less than 24 hours late: 80% credit
- 24-48 hours late: 60% credit
- after 48 hours: no credit (0%)
- contact the instructors in case of an emergency
- Labs are required; need to be turned in at the end of the lab as a record of your attendance.
- Section attendance: try to stay with the time you are assigned on GOLD.
- Submit it by Friday at 5pm
- Lecture attendance is strongly encouraged but not the only way to participate.
- join the lecture and ask questions (raise your hand on Zoom to be unmuted to ask your question)
- join the lab and ask questions
- participate on Piazza (asking/answering questions)
- office hours
In-class group work
Will be split into groups of about 3 to introduce yourself and discuss the question that was posed.
Timestamp in the video 32:36
Programming Languages for Data Science
- R (PSTAT 10), created specifically for statistical computing and graphics
- Python (CS 8), “general purpose” programming language + important packages
Who is this class for?
- advanced beginners
- not for those who took advanced stats or cs courses
Potential List of Topics
Taking computational / conceptional approach instead of the rigorous mathematical treatment
What is a data scientist?
T vs -shaped
What is data science?
Domain expertise helps to know which tools to use and when.
Pokemon-card-style collection of data science explanation visualizations
Learning from Data
The way to learn about the world is to take a claim and prove that it’s false.
Induction. Can we generalize from specific instances?
How is knowledge created from a sociological perspective?
- science is a communal activity
- e.g., threshold for p-values
(null hypothesis): “All swans are white.”
: “Not all swans are white.”
(null hypothesis): “The ivory-billed woodpecker is extinct.”
Can I be sure? Maybe I have been missing it.
Induction and Evidence
- data provides evidence for the truth of a conclusion
- argue about what is probable rather than what is impossible
- often about inferring general principles from specific observation
- Bayes’ Theorem for describing probabilities based on evidence
Inference: “Black swans are rare.” Fraction of black swans in the world?
The role of models
- Make assumption about how the data is generated.
- Models can still be used to develop statistical tests
- Can also be used to make predictions / forecasts and describe sources of variability
- Can (and should) be continuously refined
- probability (sampling distribution) ==> data ==> population
Given facts about the world, what might I see?
Learn the facts about the world
DS 100 philosophy
- Raw data is not information (pandas)
- Information is not knowledge (EDA)
- visualization (Altair)
- Knowledge is not understanding (domain expertise)
- Understanding is not wisdom (Ethical data science, consequences, data privacy)
Wisdom and Data Science
Questions to ask yourself
Mark Twain’s quote about statistics.
UC Berkeley Gender Bias case (1973)
- graduate school admissions
What possible truths are consistent with this information? 1:10:11?