Learning Goals

Website for the UCSB Data Science Capstone Preparation Workshop

Learning Goals

Getting Started

Development environment

Code development is usually done with an integrated development environment (IDE). The Python interpreter is a software that executes code written in the Python programming language. An interactive interpreter is a program that allows the user to execute one line of code at a time.

Note that on December 3rd, 2008, Python 3.0 final was released: Python 2.7 programs cannot run on Python 3.0 or later interpreters, i.e., Python 3.0 is not backwards compatible.

You can also write your code using a plain-text editor and then run it on the command line (e.g., via a Terminal).

Problem-solving

Develop pseudocode

We highly encourage you to adopt an iterative program development, where you break up the problem into smaller parts, and then develop and test each part individually, iteratively improving and refining it. For instance, if you need to count how many students from each class level are on the roster stored as a CSV, the first step of the iterative program development is to test that you can open and close the file, and read/print its contents. (See my note below, in the Workflow section, regarding working with large files). The next step is to figure out and note down what kind of processing you need to do, what values you need to retrieve, store, etc. Write down these notes as instructions that you can give to someone else to follow.

A set of instructions that we might later translate into code can be represented by pseudocode: it is not real code but provides the steps that can then be converted into any programming language.

Pseudocode can be used a way to represent an algorithm, which describes a solution to a problem.

When you approach a solution to any problem, try to apply the following steps:

  1. Write down the goal: what does the solution supposed to address or look like? Is it a stepping stone to another process, if so, how do they align?
  2. What is given? E.g., what format is it in? Can/should it be changed to another format?
  3. What do you need to do? This section includes the pre-/post-processing steps, data structures you need to use, functions to call, etc.
  4. How to verify correctness? It is helpful to think critically of the results that you get, so how would you test that your solution is indeed correct? Can you compute things manually? Use another tool to compare the output?

As a data scientist, it is crucial for you to develop an ability to outline a set of steps that are necessary for a solution. Even if you are not able to program them youself, it is invaluable to be able to provide unambiguous, orderly instructions.

Before you ever get to coding a solution, you need to come up with its components first. It’s too easy to get lost in the code without having pseudocode/algorithm as a roadmap.

Read Documentation

It is essential for you to be able to read documentation to understand what some functions require as input, what they produce as output, and what alternative parameters they might accept.

Python 3 documentation: https://docs.python.org/3. Note that Python has many built-in functions and you can learn more about them by clicking on the function name: https://docs.python.org/3/library/functions.html.

Each module, e.g., numpy, scipy, typically has their separate documentation sites.

Programming Style

Python has an official PEP 8 – Style Guide for Python Code. Look it over and refer to it periodically. It provides the correct and incorrect ways to format the code and includes recommendations that we wished everyone always used, e.g.:

We recommend to use parentheses to make order of evaluation explicit, rather than relying on precedence rules. It is always better to make your logic more explicit, and sometimes, not having parentheses can lead to unexpected results.

Do you think that the following expressions produce the same or different results?

4**5**2
(4**5)**2
4**(5**2)

Comments

Your comments should be informative and provide the reader with an explanation that’s not necessarily evident from the code.

The # (hash/pound) characters denote single-line comments, which are optional but essential for explaining some parts of code. Get into the habit of using them regularly to document your work.

You can also use a documentation string (docstring) to document your code and comment parts of it out. Conventionally, docstrings are used as part of the function documentation, instead of as regular comments. Docstrings, are represented by a set of 3 matching single or double quotes: ''' This is a docstring, which should always be on its own line.''' Docstrings allow you to create multiline comments (block comments).

Naming Convention

Please see the corresponding section in the PEP 8 – Style Guide for Python Code.

Variable names (identifiers):

Conventionally, variable names that are in all-caps are used for constants, which shouldn’t change throughout the progam.

Workflow

Do not write all of your code all at once and then run it.

Write your code incrementally (remember, incremental program development), execute it frequently (e.g., every 3-5 lines). This will help to avoid or catch errors, especially, syntax and runtime errors.

Write down the steps of your pseudocode/algorithm first, before you start coding. Thinking through these steps will help you structure your code and make you realize what data/clarifications you might need.

If you are developing code to process a really large dataset, then make a small copy of that dataset (one that has just a few lines). This way you can rapidly develop your solution without having to wait for the entire file to be processed each time.

In programming, every little detail counts, so you must get into a habit of paying extreme attention to detail. Such attention can lead you to make fewer mistakes when writing code, writing code that’s correct, thus spending less time debugging.

Side note: you can run a web search for “programmer attention to details” to see tests that some employers give during an interview.

Python Errors

A syntax is a programming language’s rules on how symbols can be combined to create a proper line of code.

A syntax error is the easiest one to deal with, since they are found before the program is ever run by the interpreter. The error message will report the number of the offending line: keep in mind that sometimes the error occurred on a previous line from the one that’s reported in the error message.

A runtime error occurs when the syntax is correct and the code runs but the program attempts an impossible operation, such as dividing by zero or multiplying strings together (like ‘5’ * ‘hours’). This usually causes your code to crash.

A sidenote: The term “bug” to describe a runtime error was introduced by a computer scientist Grace Hopper, who found an actual bug that was messing up one of the relays: http://en.wikipedia.org/wiki/Computer_bug.

Common error types:

Retype the statements, correcting the syntax error in each print statement.

print('Predictions are hard.")
    print(Especially about the future.)
the_answer = 42
print('the Answer to the Ultimate Question of Life, the Universe and Everything is:' user_num)
print(the_answer + the_answer + "is", the_answer*2 ))

A logic error can be most difficult to debug, since that’s when the program runs but doesn’t perform as intended. Paying careful attention and frequently running your code after writing just a few lines can help minimize the number of such errors.

Some examples of logic errors:

More information, including how to deal with exceptions: https://docs.python.org/3/tutorial/errors.html

Check your understanding