Jupyter Notebooks - clizarraga-UAD7/Workshops GitHub Wiki

Jupyter Project (Image Credit: Wikimedia Foundation, CC)

(URL: https://github.com/clizarraga-UAD7/Workshops/wiki/Jupyter-Notebooks)

Getting started with Jupyter Notebooks: A Python Programming Environment for Data Analysis and Modeling.


There are multiple scientific programming environments for carrying out data analysis and modeling. Python is one of the top rated general-purpose programming language and widely used in Data Science. Using Python with a Jupyter Notebook, provides an efficient way of presenting the input code and the results, offering a reproducible interactive computing environment as well as an explainable technical document for readers to follow. In this workshop, participants will use a Python programming environment based on Jupyter Notebooks for performing data analytics and data visualization.

Learning objectives

  • List of offline and online Jupyter Notebooks that support Python programming.

  • Understand how to run and start a Jupyter Notebook session.

  • Describe the Jupyter Notebooks user interface.

  • Identify main Python general libraries and their purpose.

  • Demonstrate how to read data files into Jupyter using the Pandas library.

  • Try out NumPy library for numerical/mathematical functions.

  • Show the use of SciPy library for scientific computing capabilities.

  • Exhibit the use of Matplotlib and Seaborn visualization libraries to make simple data plots.

  • Indicate how to save a Jupyter Notebook and end your Jupyter session.

Please see the Slides of the Workshop


What are Jupyter Notebooks?

Jupyter Notebooks is a product of the Jupyter Project, which is a community dedicated to produce open-source interactive development environment for science and the scientific computing supporting a group of programming languages, mainly Julia, Python and R. Today, there is a community of 150+ options of programming languages that run on Jupyter Notebooks.

In what sort of Applications are Jupyter Notebooks being used?

You can find a wide variety of scientific applications where Jupyter Notebooks are being used:

  • Artificial Intelligence and Machine Learning
  • Biology, Chemistry and Physics
  • Earth Sciences and Geospatial Analysis
  • Economics and Finance
  • Linguistics, Natural Language Processing and Text Mining
  • Mathematics and Statistics
  • Psychology and Neurosciences
  • Signal, Sound and Video Analysis
  • Many other...

Jupyter Notebooks is composed of two types of cells. A Code cell, where the user inputs segments of code, and a Text cell, where the user can input text segments enhanced with Markdown Language.

The next generation of Jupyter Notebooks has been named Jupyter Lab, which is the one used now, even when people refer to it as a Jupyter Notebook (See Jupyter Lab Documentation)


Jupyter Notebook origins.

Jupyter Notebooks started as IPython Notebooks back in 2011, and became the Jupyter Project in 2015. Initially it was designed by Fernando Perez from UC Berkeley around 2001, when he tried to replicate a Wolfram Mathematica Notebook from Wolfram (Mathematica Notebook Examples).

There are other Notebooks that are used for more special programming environments:


Jupyter Notebooks on local computer.

If desired to work locally on a computer, there several options for installing Jupyter Notebooks. We mention two of them.

First method: Anaconda Python. After downloading and installing it. From a terminal window run the command jupyter lab. A new tab browser will open with Jupyter. Any new desired package that is not installed, can be installed via the conda install command using a Command Line Interfase (i.e. terminal window).

Second method: Jupyter SciPy Notebook - Docker image. This is a basic Jupyter Notebook with Python common libraries installed. Prior, you need to install Docker Desktop if you don't have it already. Then from a terminal, you need to download the latest Jupyter Notebook with SciPy installed from DockerHub. To do this, in the terminal run the command:

docker pull jupyter/scipy-notebook:latest

You will see that it is downloading a series of different software layers that compose the Docker image.

After downloading the desired image, in your computer, go to your desired working directory. The next step is to launch a Docker container by running:

docker run -it --rm -v "${PWD}":/home/jovyan/work -p 8888:8888 jupyter/scipy-notebook

Your files will be stored in whatever local directory from where you launched the Docker container. You can find more information about Docker containers in these notes.

Then open a new browser tab with URL: localhost:8888. It will ask you for a token, go and copy the series of 48 characters that appear after token=. Or simply copy the URL included in the running terminal, similar to this http://127.0.0.1:8888/lab?token=7d5a143bfe924787eba5e20110407204854bc7664fa8f1d4 and paste it into a new browser.

There are several options of Docker images of Jupyter Notebooks. The Docker image jupyter-scipy-notebook has a size of 948MB, and has an Ubuntu OS underneath.

The Jupyter Datascience Notebook - Docker Image has a larger size of 1.49GB. It includes Python, R and Julia programming language options.

Available online Jupyter Notebook programming environments.

There are several cloud-based options for using Jupyter Notebooks via the web browser, which need no installation or configuration.

Open Platforms Use:

Reserved to University of Arizona users:

  • Cyverse.Org. Cyberinfrastructure support for Life Science research, National Science Foundation.
  • UA HPC. University of Arizona High Performance Computing.

Python Libraries

Main libraries used in Python.

Python has a collection of Libraries that are used in data analysis and modeling.

Other specialized Machine Learning Libraries in Python.

The development of special applications of big data analysis in machine learning modeling, is very dynamic. There is algo a large set of libraries, which we only mention a few.


Jupyter Notebook Examples.

You can download the following Notebooks to your Google Colab session, to follow the tutorial and do the proposed exercises found there.


Known datasets.

There are many sources of popular datasets used for learning Data Analysis in Python.

More datasets: Kdnuggets: Complete Collection of Data Repositories (Part 1)

Built-in datasets in Python

Some libraries in Python come with a collection of datasets that help us practice data analysis.

  • pydataset: Package with 700+ datasets,
  • seaborn: Data Visualization package,
  • sklearn: Machine Learning package,
  • statsmodel: Statistical Model package
  • nltk: Natural Language Tool Kit package.

For example, the pydatasetlibrary:

# Install pydataset package (if not installed)
!pip install pydataset
# Import package
from pydataset import data
# Check out datasets
data()
	dataset_id	title
0	AirPassengers	Monthly Airline Passenger Numbers 1949-1960
1	BJsales	Sales Data with Leading Indicator
2	BOD	Biochemical Oxygen Demand
3	Formaldehyde	Determination of Formaldehyde
4	HairEyeColor	Hair and Eye Color of Statistics Students
...	...	...
752	VerbAgg	Verbal Aggression item responses
753	cake	Breakage Angle of Chocolate Cakes
754	cbpp	Contagious bovine pleuropneumonia
755	grouseticks	Data on red grouse ticks from Elston et al. 2001
756	sleepstudy	Reaction times in a sleep deprivation study

# Load as a dataframe
df = data('iris')
df.head(n=25)

Using seaborn datasets:

# Import seaborn
import seaborn as sns# Check out available datasets
print(sns.get_dataset_names())

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']

Python Cheat Sheets.


General References

Extra:


Created: 01/22/2022 (C. Lizarraga); Last update: 10/04/2022 (C. Lizarraga)

CC BY-NC-SA