Data Science - bobbae/gcp GitHub Wiki

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.

Data Science is about data gathering, analysis and decision-making. Data Science is also about finding patterns in data, to make future predictions. There are awesome and freely available Data Science curriculum that are online.

https://developers.google.com/learn/topics/datascience

https://towardsdatascience.com/is-data-science-really-a-science-9c2249ee2ce4

https://github.com/datasciencemasters/go

https://www.unifyingdatascience.org/html/index.html

Data Science-Driven Organization

A data science-driven organization is an entity that maximizes the value from the data available while using machine learning and analytics to create a sustainable competitive advantage.

Data Engineering

Data engineers build and maintain the systems that allow data scientists to access and interpret data. The role generally involves creating data models, building data pipelines and overseeing ETL (extract, transform, load).

Data Analytics

What's the difference between Data Analytics vs Data Science? What is the difference between Data Science vs Data Engineering?

https://cloud.google.com/training/data-engineering-and-analytics

Google Smart Analytics Platform

GCP smart analytics platform can help strip out layers of complexity and analyze data to solve problems in broad areas of applications such as anomaly detection, data monetization, general analytics, log analytics, pattern recognition, predictive forecasting, real-time clickstream analytics, time-series analytics and working with data lakes.

These topics require diverse range of knowledge and skills from many disciplines. The need to distinguish data engineers from analysts and scientists diminish when faced with such multi-disciplinary scope of endeavors.

Data Science platform

https://medium.com/adevinta-tech-blog/enabling-data-science-on-google-cloud-platform-at-adevinta-c14e67703fb2

AI Platform

AI Platform is a development platform to build AI apps that run on Google Cloud and on-premises. Take your ML projects to production, quickly, and cost-effectively.

AI Platform training with built-in algorithms.

AI Hub

AI Hub offers a collection of components for developers and data scientists building artificial intelligence (AI) systems.

AI Explanations

https://www.youtube.com/watch?v=XXvFHqLv9p8

Document AI

Document AI let's you unlock insights from documents with machine learning. Google Cloud’s Vision OCR (optical character recognition) and form parser technology uses industry-leading deep-learning neural network algorithms to perform text, character, and image recognition in over 200 languages with exceptional accuracy. Using the same deep machine learning technology that powers Google Search and Assistant, Google Cloud’s Document AI products enable you to derive valuable insights from your unstructured documents.

Dialogflow

Dialogflow is a natural language understanding platform that makes it easy to design and integrate a conversational user interface into your mobile app, web application, device, bot, interactive voice response system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your product.

Business messages

https://cloud.google.com/blog/products/workspace/strengthening-conversations-with-customers-using-ai-powered-business-messages

Machine Learning

Machine Learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

There are common areas of interests in both Machine Learning Engineering and Data Science and some differences.

AutoML Tables

AutoML Tables enables your entire team to automatically build and deploy state-of-the-art machine learning models on structured data at massively increased speed and scale.

Cloud Interference API

Time-series analysis is essential for day-to-day operation of many companies. Most popular use cases include analyzing foot traffic and conversion for retailers, detecting data anomalies, identifying correlations in real time over sensor data, or generating high-quality recommendations. With Cloud Inference API, you can gather insights in real time from your typed time-series datasets.

Scikit Learn

Scikit Learn python library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

Tensorflow

Tensorflow is an end-to-end open source platform for machine learning. It has a comprehensive ecosystem of tools and libraries to build and deploy ML powered applications.

Cloud GPUs

Compute Engine provides graphics processing units (GPUs) that you can add to your virtual machine instances. You can use these GPUs to accelerate specific workloads on your instances such as machine learning and data processing.

https://www.youtube.com/watch?v=jUZhe1aTnFk

Cloud TPU

Tensor Processing Units (TPUs) are Google’s custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google’s deep experience and leadership in machine learning.

https://www.youtube.com/watch?v=2kSo7Az4ZOs

R

https://www.tidyverse.org/

Keras

Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the Tensorflow library.

Datalab

Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models.

Colab

Colab supports Jupyter notebooks allow you to combine executable code and rich text.

Jupyter

Jupyter Notebooks combine code, data and visualizations for reproducible analytics.

Cookiecutter Data Science

Cookiecutter Data Science template is a logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Data Science at Command Line

You may become a more efficient, practical and productive data scientist by learning to leverage the power of the command line.

https://www.datascienceatthecommandline.com/

CLI tools

Some CLI tools can be useful in data science.

Datamash

GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.

Parallel

GNU parallel is a shell tool for executing jobs in parallel using one or more computers.

Awk

Awk is a record processing tool written by Aho, Weinberger, and Kernighan in the 1970s. AWK is an acronym of their names. Data scientists have rediscovered awk recently.

Data Lineage

Data lineage uncovers the life cycle of data—it aims to show the complete data flow, from start to finish. Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. This includes all transformations the data underwent along the way—how the data was transformed, what changed, and why.

https://www.keboola.com/blog/data-lineage-tools

Data Science Tutorial

https://www.guru99.com/data-science-tutorial.html

JavaScript for Data Science

https://js4ds.org/

Data Science related Math

Basic Math

https://towardsdatascience.com/mathematics-for-data-science-e53939ee8306

Scalar and Tensor

https://medium.datadriveninvestor.com/from-scalar-to-tensor-fundamental-mathematics-for-machine-learning-with-intuitive-examples-part-163727dfea8d

Norm and Orthogonality

https://towardsdatascience.com/from-norm-to-orthogonality-fundamental-mathematics-for-machine-learning-with-intuitive-examples-57bb898e69f2

Eigendecomposition to Determinant

https://towardsdatascience.com/from-eigendecomposition-to-determinant-fundamental-mathematics-for-machine-learning-with-1b6b449a82c6

Calculus

https://www.kdnuggets.com/2022/02/mlm-hidden-building-block-machine-learning.html

Single Variable Calculus

Calculus 1A: Differentiation

Calculus 1B: Integration

Calculus 1C: Coordinate Systems & Infinite Series

Linear Algebra

https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/

Multivariable Calculus

http://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/index.htm

Statistics & Probability

Introduction to Probability

Intro to Descriptive Statistics

Intro to Inferential Statistics

Cheatsheets

https://www.kdnuggets.com/2022/02/complete-collection-data-science-cheat-sheets-part-1.html

Books

https://www.kdnuggets.com/2022/03/best-data-science-books-beginners.html

Streamlit

https://docs.streamlit.io/

Build ML Webapps

https://www.kdnuggets.com/2022/03/build-machine-learning-web-app-5-minutes.html

https://medium.com/talabat-tech/data-apps-from-local-to-live-in-10-minutes-a886d5453c7

Tutorials

Qwiklabs

Data Science Qwiklabs.