Website for the UCSB Data Science Capstone Preparation Workshop
Welcome!
pip
or Anaconda
.
Image source: Twitter @oxOspy
Despite being a well-rounded language with an easy learning curve and its simple syntax, when compared to traditional general-purpose programming languages like Java or C/C++, Python is EXTREMELY slow. Procedures that take seconds to complete in Java/C++ may take a minute if run in Python. Given its native data structures, Python is not as efficient as Matlab for scientific computation (those that deal with matrices). This inefficient is due to the fact that Python is a dynamically-typed language, meaning no type declaration of variables is required, no type checking is performed, and at different points in your code, a variable can hold data of different types. The language is optimized for freedom when coding by forfeiting some computing efficiency.
NumPy was introduced to levitate the above problem and is the basic and essential building block for scientifc computing in Python. NumPy offers new data types (numpy.int16, numpy.int32, numpy.float, etc) that replace Python’s native data types. The NumPy’s n-dimensional array is the better version of Python’s lists (or nested lists) for scientific computing purpose. A NumPy array can only hold data of one numerical type while a Python list can hold data of different types at once, however, this restriction makes Numpy arrays better at storing and manipulating matrices. Note that NumPy is built upon C/C++ so its operations are quite fast.
You can read more here
From now on, you will see that any scientific library in Python is either built upon or intergratable with NumPy.
SciPy (Science + Python) is the library developed for scientific library in Python. SciPy is built over NumPy, taking advantage of NumPy efficient data structures and computations. The whole SciPy library consists of a large number of modules, each of which corresponds to a particular scientific topic. Some useful modules that we may use in data science:
Pandas is a library for handling tabular data in Python. It is also built on top of NumPy. Pandas main data structure is a pandas.DataFrame
, which is similar to the R.frame
. Pandas can be integrated well with other scientific libraries like SciPy, Matplotlib, or Scikit-learn.
Matplotlib is originally a visualization component of Matlab, which used to be the go-to language for scientific computing before Python’s data science ecosystem became popular. As people moved to Python, they wanted to bring the visualization package to the new language so Matplotlib was adapted for Python. As a result, it does not feel very Pythonic when programming with Matplotlib in Python.
We have put up a comprehensive slide that summarizes the basic syntax and funtions of the above Python libraries.
You can access the slide here.