Data Engineering Review - tps-train/edp GitHub Wiki


Only this page is required for the 3rd assessment

Quick intro to the absolute basics of Python


How does Jupyter notebook or Google Colab work with Python, and why would you want to use these instead of Python on the command line?

  • It allows running code by block
  • Other than just running Python code it can also allow me to write documentation in markdown
    • I can write documentation as well as being able to run code to present information to people live, we can also include visualizations and narrative text.

What is meant by REPL in Python terms, when we are talking about the interpreter? READ, EVAL, PRINT, LOOP

  • This is the fact we can type Pyhon on the command line and write code interactively.


What characters are allowed in variable names?

  • numbers, letters and underscores

Can I start my variable name with a digit?

  • NO

How can I pretty write a long number, e.g. 2,345,343 in Python?

  • 2_345_343

Data types

Does Python have data types in the sense of Java or C such as int or bool or float when declaring your variables? No Do I have to declare the type of a variable to use it? No

So when do data types matter in Python?

  • When concatenating variables of a different type, e.g. a string to an integer would cause the exception TypeError.
c = a+b # this would generate a TypeError
c = a+str(b) # is ok

This is known as context conversion.

What are the types in Python?

  • Numeric types (primitive)
    • bool, int, float, complex
  • Sequence types
    • str, bytes, list, tuple
    • Initialize str as name = ""
    • Initialize bytes as data = b""
    • Lists as mylist = []
    • Tuple as mythrees = ()
  • Mapping types
    • dict, set, frozenset
    • Brackets of use
      • Initialize with dict = {}
      • Set would be initialized with ()


2 types of parameters that we can specify when passing inputs into functions?

  • Positional
    def add2nums(a,b):
       # do something here
  • Named parameters
    • Allow us to use the variable name of the input parameters to assign values to them.
    • add2nums(a=23,b=42)

Positional parameters allow for optional parameters, how do we do this? How can I create a function that has an unknown quantity of positional parameters?

def add2nums(*args)

args is just the variable name, it can be what ever you want it to be.

But what data type will args be inside the body of your function?

  • List (Python name for an array)

What is **kwargs? It does not have to be called kwargs, again it can be what ever you want to call the variable in your function definition. Key word arguments. Which means what data type will I be working with in my Python function?

  • Dictionary (key/value pairs)

Default input parameter values. Adding =value to the parameter name. e.g.

def add2nums(a,b=42):

If you have a default the input parameter can be skipped to use the default, e.g. x=add2nums(13), or x=add2nums(a=13)

What do we call an anonymous function in Python? lambda function.

mylist = lambda x: x+x, (1,2,3,4,5)

We can also assign a function definition to a variable, using lambdas.

If I place a lambda function inside a list, what operation do we call this?

mynewlist = [ lambda x: x+x, (1,2,3,4,5) ]
  • List comprehension
mylist = [ i*i for i in range(100) ]

Libraries - Modules.

Module contain - Functions, variables (constants), classes.

A collection (or directory) of modules is called a PACKAGE.

How do I include a module into my Python script?

  • import What ways can we import modules;
  • import moduleName
    • Functions and variables have to be explicitly named using the moduleName.
  • import moduleName as alias
    • example: import datetime as dt
  • from moduleName import functions
    • Of we could specify * after the import key word to include all functions
    • What namespace does the above import the functions/variables into main
    • What issues might occur if we import in this manner?
      • Naming conflict. I might already have a function called bob() in my script and moduleName might also have a bob() function.

If a module does not import in my code, what do I need to do to make it available to me? e.g. I've done import pandas, but it reports back that pandas is not installed.

  • We need to use pip install moduleName, e.g. pip install pandas

How do I control what modules I've installed using pip?

  • pip freeze
    • Allows me to list all modules that have been installed
  • pip freeze > requirements.txt
    • This allows us to save our modules and versions into a file for others to install using pip install -r requirements.txt

In some data science and financial institutions, Anaconda is another distribution of Python and has it's own anaconda package manager.

What about ensuring that I only get the modules that are required by my script, how do I create an environment that ignores the operating system Python libraries/modules?

  • How do I set up a virtual environment?
    python -m venv NameOfYourEnvironment
  • To use the environment on Windows:
  • To use on Linux
    . NameOfYourEnvironment/bin/activate
    # or
    source NameOfYourEnvironment/bin/activate

Data Engineering

4 parts of Hadoop Ecosystem - Tools that can make use of Hadoop

  • PIG
  • Apache Spark
  • Sqoop
  • Hive
  • Flume

Components of Hadoop

  • MapReduce
    • Optimize and compress data
    • Data Engineering role tool
  • Yarn
    • Processing model of the Hadoop system
    • Yet another resource negotiator
  • HDFS
    • The distributed file system hadoop uses to store the data

Hadoop partitions data - another name for partitioning = Sharding.

What are the 3 Vs of Big Data?

  • Velocity
    • Speed of change/delivery of data
  • Volume
    • Capacity of the data being stored
  • Variety
    • Different types of storage - RDBMS (table/row), Columnar store, Document stores, ....

CAP allows us to identify what type of Database will suit the data's need;

  • C = Consistency
  • A = Availability
  • P = Partition (performance)

What is a Data Engineer

  • Look for data sources of relevance
    • Your client needs extra data beyond what you have in your organization
    • Data to compliment your own data - e.g. geographic financial data to determine likelihood of monthly payments
  • ETL
    • Extract Transform and Load pipelines
    • Taking the data from differing data sources - performing some wrangling/munging, code transformation, filtering - loading into customer ready data sources (e.g. from JSON to Parquet)
    • Transformation may include normalizing and repairing data;
      • Find missing values
      • Remove duplicates
      • Eliminate outliers
  • Providing data products
    • example is CDDB
      • Crawl internet and other sources for artist information and tracks, etc
      • Provide a database and interface
      • Provide API for application developers to use the data source
      • Provide billing model
  • You will need to code
    • Python or JavaScript are most common
    • You might end up coding in something completely different
      • Logstash = it has it's own transformation language (it is an ETL) DSL = Domain Specific Language

Data workflow systems normally make use of the Directed Acyclic Graph or DAG for short.

RDBMS are row based data, Parquet is a Columnar style database.