jupyter_labs_eda_sql_coursera_sqllite - Fatimahhg/special-potato GitHub Wiki
SpaceX has gained worldwide attention for a series of historic milestones.
It is the only private company ever to return a spacecraft from low-earth orbit, which it first accomplished in December 2010. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars wheras other providers cost upward of 165 million dollars each, much of the savings is because Space X can reuse the first stage.
Therefore if we can determine if the first stage will land, we can determine the cost of a launch.
This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.
This dataset includes a record for each payload carried during a SpaceX mission into outer space.
This assignment requires you to load the spacex dataset.
In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. Click on the link below to download and save the dataset (.CSV file):
!pip install sqlalchemy==1.3.9
%load_ext sql
import csv, sqlite3con = sqlite3.connect("my_data1.db") cur = con.cursor()
!pip install -q pandas==1.1.5
%sql sqlite:///my_data1.db
import pandas as pd df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_2/data/Spacex.csv") #df.to_sql("SPACEXTBL", con, if_exists='replace', index=False,method="multi")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101 entries, 0 to 100 Data columns (total 10 columns):
0 Date 101 non-null object 1 Time (UTC) 101 non-null object 2 Booster_Version 101 non-null object 3 Launch_Site 101 non-null object 4 Payload 101 non-null object 5 PAYLOAD_MASS__KG_ 101 non-null int64 6 Orbit 101 non-null object 7 Customer 101 non-null object 8 Mission_Outcome 101 non-null object 9 Landing_Outcome 101 non-null object dtypes: int64(1), object(9) memory usage: 8.0+ KB
df.head(2)
Date | Time (UTC) | Booster_Version | Launch_Site | Payload | PAYLOAD_MASS__KG_ | Orbit | Customer | Mission_Outcome | Landing_Outcome | |
---|---|---|---|---|---|---|---|---|---|---|
2010-06-04 | 18:45:00 | F9 v1.0 B0003 | CCAFS LC-40 | Dragon Spacecraft Qualification Unit | 0 | LEO | SpaceX | Success | Failure (parachute) | |
2010-12-08 | 15:43:00 | F9 v1.0 B0004 | CCAFS LC-40 | Dragon demo flight C1, two CubeSats, barrel of... | 0 | LEO (ISS) | NASA (COTS) NRO | Success | Failure (parachute) |
Estimated time needed: 60 minutes.
Introduction
Using this Python notebook you will:
Understand the Spacex DataSet Load the dataset into the corresponding table in a Db2 database Execute SQL queries to answer assignment questions Overview of the DataSet
SpaceX has gained worldwide attention for a series of historic milestones.
It is the only private company ever to return a spacecraft from low-earth orbit, which it first accomplished in December 2010. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars wheras other providers cost upward of 165 million dollars each, much of the savings is because Space X can reuse the first stage.
Therefore if we can determine if the first stage will land, we can determine the cost of a launch.
This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.
This dataset includes a record for each payload carried during a SpaceX mission into outer space.
Download the datasets
This assignment requires you to load the spacex dataset.
In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. Click on the link below to download and save the dataset (.CSV file):
!pip install sqlalchemy==1.3.9 Connect to the database
Let us first load the SQL extension and establish a connection with the database
%load_ext sql import csv, sqlite3
con = sqlite3.connect("my_data1.db") cur = con.cursor() !pip install -q pandas==1.1.5 %sql sqlite:///my_data1.db import pandas as pd df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_2/data/Spacex.csv") #df.to_sql("SPACEXTBL", con, if_exists='replace', index=False,method="multi") df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 101 entries, 0 to 100 Data columns (total 10 columns):
0 Date 101 non-null object 1 Time (UTC) 101 non-null object 2 Booster_Version 101 non-null object 3 Launch_Site 101 non-null object 4 Payload 101 non-null object 5 PAYLOAD_MASS__KG_ 101 non-null int64 6 Orbit 101 non-null object 7 Customer 101 non-null object 8 Mission_Outcome 101 non-null object 9 Landing_Outcome 101 non-null object dtypes: int64(1), object(9) memory usage: 8.0+ KB df.head(2) Date Time (UTC) Booster_Version Launch_Site Payload PAYLOAD_MASS__KG_ Orbit Customer Mission_Outcome Landing_Outcome 0 2010-06-04 18:45:00 F9 v1.0 B0003 CCAFS LC-40 Dragon Spacecraft Qualification Unit 0 LEO SpaceX Success Failure (parachute) 1 2010-12-08 15:43:00 F9 v1.0 B0004 CCAFS LC-40 Dragon demo flight C1, two CubeSats, barrel of... 0 LEO (ISS) NASA (COTS) NRO Success Failure (parachute) Note:This below code is added to remove blank rows from table
%sql create table SPACEXTABLE as select * from SPACEXTBL where Date is not null Tasks
Now write and execute SQL queries to solve the assignment tasks.
Note: If the column names are in mixed case enclose it in double quotes For Example "Landing_Outcome"
Task 1
Display the names of the unique launch sites in the space mission set(df['Launch_Site']) {'CCAFS LC-40', 'CCAFS SLC-40', 'KSC LC-39A', 'VAFB SLC-4E'} Task 2
Display 5 records where launch sites begin with the string 'CCA' df['Launch_Site'].value_counts() CCAFS SLC-40 34 CCAFS LC-40 26 KSC LC-39A 25 VAFB SLC-4E 16 Name: Launch_Site, dtype: int64 CCA1 = df[df['Launch_Site'].str.contains('CCAFS SLC-40')] CCA1.head() Task 3
Display the total payload mass carried by boosters launched by NASA (CRS) #by NASA (CRS) df_CRS = df[df['Customer'].str.contains('CRS')] df_Mass = df_CRS'PAYLOAD_MASS__KG_','Booster_Version' df_Mass Task 4
Display average payload mass carried by booster version F9 v1.1 df_F9 = df[df['Booster_Version'].str.contains('F9 v1.1')] df_F9['PAYLOAD_MASS__KG_'].mean() 2534.6666666666665 Task 5
List the date when the first succesful landing outcome in ground pad was acheived.
Hint:Use min function
df['Landing_Outcome'].value_counts() df_Land = df[df['Landing_Outcome'] == 'Success (ground pad)'] df_Land['Date'].min() '2015-12-22' Task 6
List the names of the boosters which have success in drone ship and have payload mass greater than 4000 but less than 6000 df_boost = df[(df['PAYLOAD_MASS__KG_'] > 4000) & (df['PAYLOAD_MASS__KG_']<6000) & (df['Landing_Outcome'] == 'Success (drone ship)')] set(df_boost['Booster_Version']) {'F9 FT B1021.2', 'F9 FT B1031.2', 'F9 FT B1022', 'F9 FT B1026'} Task 7
List the total number of successful and failure mission outcomes df['Mission_Outcome'].value_counts() Success 98 Failure (in flight) 1 Success (payload status unclear) 1 Success 1 Name: Mission_Outcome, dtype: int64 dF_success = df[df['Mission_Outcome'].str.contains('Success')] len(dF_success['Mission_Outcome']) 100 Task 8
List the names of the booster_versions which have carried the maximum payload mass. Use a subquery df_Load = df[df['PAYLOAD_MASS__KG_'] == df['PAYLOAD_MASS__KG_'].max()] print(set(df_Load['Booster_Version'])) {'F9 B5 B1048.5', 'F9 B5 B1049.5', 'F9 B5 B1049.4', 'F9 B5 B1051.3', 'F9 B5 B1056.4', 'F9 B5 B1051.4', 'F9 B5 B1060.2 ', 'F9 B5 B1058.3 ', 'F9 B5 B1048.4', 'F9 B5 B1051.6', 'F9 B5 B1049.7 ', 'F9 B5 B1060.3'} Task 9
List the records which will display the month names, failure landing_outcomes in drone ship ,booster versions, launch_site for the months in year 2015.
Note: SQLLite does not support monthnames. So you need to use substr(Date, 6,2) as month to get the months and substr(Date,0,5)='2015' for year.
import datetime df_date = df'Date','Booster_Version','Launch_Site','Landing_Outcome' df_date df_date['Date'] = pd.to_datetime(df_date['Date'] , format='%Y-%m-%d') :1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_date['Date'] = pd.to_datetime(df_date['Date'] , format='%Y-%m-%d') df_2015 = df_date[df_date['Date'].dt.strftime('%Y') == '2015'] df_month = df_2015[df_2015['Landing_Outcome'] == 'Failure (drone ship)'] df_month['Date']
13 2015-01-10 16 2015-04-14 Name: Date, dtype: datetime64[ns] Task 10
Rank the count of landing outcomes (such as Failure (drone ship) or Success (ground pad)) between the date 2010-06-04 and 2017-03-20, in descending order. df['Landing_Outcome'].value_counts() df_new = df[(df['Date'] > '2010-06-04') & (df['Date'] < '2017-03-20') & (df['Landing_Outcome'] == 'Success (ground pad)')] df_new df_new['Landing_Outcome'].count() 3 Reference Links
Hands-on Lab : String Patterns, Sorting and Grouping
Hands-on Lab: Built-in functions
Hands-on Lab : Sub-queries and Nested SELECT Statements
Hands-on Tutorial: Accessing Databases with SQL magic
Hands-on Lab: Analyzing a real World Data Set
Author(s)
Lakshmi Holla Other Contributors
Rav Ahuja Change log
Date Version Changed by Change Description 2021-07-09 0.2 Lakshmi Holla Changes made in magic sql 2021-05-20 0.1 Lakshmi Holla Created Initial Version © IBM Corporation 2021. All rights reserved.