A03_MLuong - mluong56/BIDD330_Spring2024 GitHub Wiki
Module 03 Assignment
Assignment 03 requires you to install python, pip, & Jupyter Notebook. The first assignment in Jupyter Notebook is more about setting things up than creating visualizations. Operationally, we need to keep python, pip updated and install several packages on our machine. Additionally, we need to connect to our SQL Server Instance. Again, I always like to repeat this. This is a SQL Server Database certificate. Therefore, we always build for future scaling. Do not upload csv files. Always use a data warehouse for the backend.
Besides the Jupyter Notebook you need to insert the Bing Covid database into your local instance using the documentation from the final. Then create a backup file as a data Extract data tier application. Next write a store procedure to change the datatypes to a more appropriate type.
Last of all, do a couple page write up on the following questions: • What did you learn about python & Jupyter Notebook? • What did you learn about inserting .csv files into SQL Server? How can we apply this to the final? • If you chose Jupyter Notebook as the final, how would you build it? • Would you recommend students attempt to build a demo of Databricks or do you think this is above the scope of this certificate?
REQUIREMENTS: Jupyter Notebook file completed. 40 POINTS C:\Python or where your python.exe was placed File Name should be: HW3_TimPauley.ipynp Follow the instructions in the homework. Overall, I am looking for you to demonstrate you can query a database in Jupyter Notebook. Your own, not mine. In a folder entitled: Jupyter_Notebook Your Database .bacpac file which should have your Team Color and Name : 35 POINTS C:\Backup In class we will complete the demo: InClassDemo_HowToInsertDataIntoSQLServer. If you do not complete it in class, this is homework File Name should be: Black_Tim.bacpac In a folder: Backup • Write Up: 25 POINTS Summary of what you learned 2 helpful screenshots 2 pages At your GutHub account optionally word.doc
REVIEW THE FOLLOWING ARTICLES/VIDEOS: • Scan https://en.wikipedia.org/wiki/Project_Jupyter • Also review https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/What%20is%20the%20Jupyter%20Notebook.html • Why Jupyter is data scientists’ computational notebook of choice: https://www.nature.com/articles/d41586-018-07196-1 o Review this example from the article in this GitHub repo: https://github.com/jperkel/example_notebook
ALSO SCAN THE FOLLOWING HELPFUL RESOURCES: • https://towardsdatascience.com/do-you-know-python-has-a-built-in-database-d553989c87bd • https://www.dataquest.io/blog/python-pandas-databases/ • https://datatofish.com/pandas-dataframe-to-sql/
CREATE A JUPYTER NOTEBOOK DOCUMENT:
Step 1: Open Command Prompt
Note: I always confirm Python is updated and running correctly. Then I quit
C:\Users\BIDD24>pyhton 'pyhton' is not recognized as an internal or external command, operable program or batch file.
C:\Users\BIDD24>python Python 3.12.6 (tags/v3.12.6:a4a2d2b, Sep 6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
quit()
Step 2: Install packages
Note: I always check my pip (package installer is up-to-date). This is optional, but as a power user I always like to double check
C:\Users\BIDD24>pip install update Requirement already satisfied: update in c:\users\bidd24\appdata\local\programs\python\python312\lib\site-packages (0.0.1) Requirement already satisfied: style==1.1.0 in c:\users\bidd24\appdata\local\programs\python\python312\lib\site-packages (from update) (1.1.0)
[notice] A new release of pip is available: 24.2 -> 25.0.1 [notice] To update, run: python.exe -m pip install --upgrade pip
C:\Users\BIDD24>
pip install pyodoc
pip install pandas
pip install sqlalchemy
pip install matplotlib
pip install warnings
pip install notebook
Step 3: Run Jupyter Notebook
jupyter notebook
Step 4 Create new file for your homework. This should be something like HW3_TimPauley
Your file will be saved at this location by default on the Azure VM. If you are not using the Azure VM it should be in a similar place.
This is why we might have put our Python folder close to the C;/ root. If you did not, it probably is in the users folder our close to it. You can also use a Python command to search for it
In this assignment you will be running code from the database you built. My database is black_pauley. Yet your database will be your team name and your last name, just like mine
Reminder. Make sure the UW VPN is running
Step 5 Run a query for Fact Unemployment. Perhaps do a TOP 10
You will be graded on the correct table pull. I realize you might have named you table different. Therefore, you will never be graded on the correct table name. Yet the table name should make sense. I renamed mine raw. To highlight the datatypes are incorrect and I will need to parse out the correct data types. This would make sense to any developer or database administrator.
import pyodbc conn = pyodbc.connect('DRIVER={SQL Server};SERVER=uwc-studentsql.continuum.uw.edu\uwcbiddsql;DATABASE=black_pauley;UID=BICert;PWD=BICert') ###conn = pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER=test;DATABASE=test;UID=user;PWD=password')
cursor = conn.cursor() cursor.execute("SELECT TOP 10 * FROM [black_pauley].[dbo].[Unemployment_raw]") for i in cursor: print(i)
This code above represents my check for all the packages we will need. I am explicitly calling this out. Normally this is just the best practices. If I see a package is returning an error, I fix it. Maybe my python is out-of-date, maybe I have created a firewall, maybe I have installed multiple version of python (in which case you should uninstall all python programs and restart)
Step 6: Do a Jupyter Magic command. You could borrow one of mine. Yet the point is to type it in. This would be an excellent idea if you forgot a package. A Jupyter magic command allows you to not have to restart your command window. Rather you can install it directly from this interface
!pip install ipython-sql
Step 7: Query the Fact Covid table
Query a table
cursor = conn.cursor() cursor.execute('SELECT TOP 10 * FROM [black_pauley].[dbo].[bing_covid_raw]')
for i in cursor: print(i)
Using Intermediate Packages like Pandas
Perhaps you will recall us checking all our packages we installed with pip. Now we will use those intermediate packages. For your reference this is a basic data integration. I am hoping folks are keeping up as if you are not, doing the advanced stuff will be impossible. If you made it here, maybe you have what it takes to be a data scientist 😊
Note: On this exercise you will have to switch databases or parse the data.
Reasoning: We have not parsed the strings. Did you catch that? Hopefully you are following along. I in fact, didn’t catch it at first. Yet, this is the solution I came up with. The goal is to get the code to work. What other ways could you accomplish this 😊
import pyodbc import pandas as pd from sqlalchemy import create_engine import matplotlib.pyplot as plt import warnings warnings.filterwarnings("ignore", category=DeprecationWarning)
This should be run on your own time and Jupyter Magic Commands should fix the problem. We say goodbye to the beginners and strive for the mountain.
Step 8: Set a pandas data frame to unemployment table
#Not necessary every time we import the package. Yet I did it just in case you skipped a step where we imported all packages #import panda #import warning #The code below gets rid of the warning. It the advanced part of this class we will switch packages. Yet this demos how to handle warning messages warnings.filterwarnings("ignore", category=DeprecationWarning)
Select from SQL Server using the Pandas unemployment table
dataframeUnemployment = pd.read_sql("SELECT U.[State], SUM(U.[Initial Claims]) [ClaimsYTD] FROM [Black_Unemployment].[dbo].[FactUnemployment] U GROUP BY U.[State];", conn) print(dataframeUnemployment)
Step 9: Apply advanced windows functions, temp tables and CTE to Pandas
For this exercise the correct output is a failure. The goal here is to attempt to create a function in Pandas, insert into the SQL code link and watch it fail. The advanced part of the class will be finding a better package
Step 10: Using TOP 5 commands
This should be simple and just select the top 5 using the head command to output the data frame
SUBMIT YOUR DOCUMENT TO CANVAS:
Use the process of zipping your folder and submitting to the assignment 3 link. The homework folder should be: BIDD330_HW3_TimPauley. With subfolders for each part of the assignment. You should also have a folder with supporting SQL code. I am unaware of any Jupyter Notebook developers who do not test there code in SQL first.
• Ideally, I would like you to copy / paste this in GutHub. Yet, if you have a problem publishing it, please keep it here