Data Science - bobbae/gcp GitHub Wiki
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.
Data Science is about data gathering, analysis and decision-making. Data Science is also about finding patterns in data, to make future predictions. There are awesome and freely available Data Science curriculum that are online.
https://developers.google.com/learn/topics/datascience
https://towardsdatascience.com/is-data-science-really-a-science-9c2249ee2ce4
https://github.com/datasciencemasters/go
https://www.unifyingdatascience.org/html/index.html
Data Science-Driven Organization
A data science-driven organization is an entity that maximizes the value from the data available while using machine learning and analytics to create a sustainable competitive advantage.
Data Engineering
Data engineers build and maintain the systems that allow data scientists to access and interpret data. The role generally involves creating data models, building data pipelines and overseeing ETL (extract, transform, load).
Data Analytics
What's the difference between Data Analytics vs Data Science? What is the difference between Data Science vs Data Engineering?
https://cloud.google.com/training/data-engineering-and-analytics
Google Smart Analytics Platform
GCP smart analytics platform can help strip out layers of complexity and analyze data to solve problems in broad areas of applications such as anomaly detection, data monetization, general analytics, log analytics, pattern recognition, predictive forecasting, real-time clickstream analytics, time-series analytics and working with data lakes.
These topics require diverse range of knowledge and skills from many disciplines. The need to distinguish data engineers from analysts and scientists diminish when faced with such multi-disciplinary scope of endeavors.
Data Science platform
AI Platform
AI Platform is a development platform to build AI apps that run on Google Cloud and on-premises. Take your ML projects to production, quickly, and cost-effectively.
AI Platform training with built-in algorithms.
AI Hub
AI Hub offers a collection of components for developers and data scientists building artificial intelligence (AI) systems.
AI Explanations
https://www.youtube.com/watch?v=XXvFHqLv9p8
Document AI
Document AI let's you unlock insights from documents with machine learning. Google Cloud’s Vision OCR (optical character recognition) and form parser technology uses industry-leading deep-learning neural network algorithms to perform text, character, and image recognition in over 200 languages with exceptional accuracy. Using the same deep machine learning technology that powers Google Search and Assistant, Google Cloud’s Document AI products enable you to derive valuable insights from your unstructured documents.
Dialogflow
Dialogflow is a natural language understanding platform that makes it easy to design and integrate a conversational user interface into your mobile app, web application, device, bot, interactive voice response system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your product.
Business messages
Machine Learning
Machine Learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.
There are common areas of interests in both Machine Learning Engineering and Data Science and some differences.
AutoML Tables
AutoML Tables enables your entire team to automatically build and deploy state-of-the-art machine learning models on structured data at massively increased speed and scale.
Cloud Interference API
Time-series analysis is essential for day-to-day operation of many companies. Most popular use cases include analyzing foot traffic and conversion for retailers, detecting data anomalies, identifying correlations in real time over sensor data, or generating high-quality recommendations. With Cloud Inference API, you can gather insights in real time from your typed time-series datasets.
Scikit Learn
Scikit Learn python library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
Tensorflow
Tensorflow is an end-to-end open source platform for machine learning. It has a comprehensive ecosystem of tools and libraries to build and deploy ML powered applications.
Cloud GPUs
Compute Engine provides graphics processing units (GPUs) that you can add to your virtual machine instances. You can use these GPUs to accelerate specific workloads on your instances such as machine learning and data processing.
https://www.youtube.com/watch?v=jUZhe1aTnFk
Cloud TPU
Tensor Processing Units (TPUs) are Google’s custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google’s deep experience and leadership in machine learning.
https://www.youtube.com/watch?v=2kSo7Az4ZOs
R
Keras
Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the Tensorflow library.
Datalab
Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models.
Colab
Colab supports Jupyter notebooks allow you to combine executable code and rich text.
Jupyter
Jupyter Notebooks combine code, data and visualizations for reproducible analytics.
Cookiecutter Data Science
Cookiecutter Data Science template is a logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
Data Science at Command Line
You may become a more efficient, practical and productive data scientist by learning to leverage the power of the command line.
https://www.datascienceatthecommandline.com/
CLI tools
Some CLI tools can be useful in data science.
Datamash
GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.
Parallel
GNU parallel is a shell tool for executing jobs in parallel using one or more computers.
Awk
Awk is a record processing tool written by Aho, Weinberger, and Kernighan in the 1970s. AWK is an acronym of their names. Data scientists have rediscovered awk recently.
Data Lineage
Data lineage uncovers the life cycle of data—it aims to show the complete data flow, from start to finish. Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. This includes all transformations the data underwent along the way—how the data was transformed, what changed, and why.
https://www.keboola.com/blog/data-lineage-tools
Data Science Tutorial
https://www.guru99.com/data-science-tutorial.html
JavaScript for Data Science
Data Science related Math
Basic Math
https://towardsdatascience.com/mathematics-for-data-science-e53939ee8306
Scalar and Tensor
Norm and Orthogonality
Eigendecomposition to Determinant
Calculus
https://www.kdnuggets.com/2022/02/mlm-hidden-building-block-machine-learning.html
Single Variable Calculus
Calculus 1C: Coordinate Systems & Infinite Series
Linear Algebra
https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/
Multivariable Calculus
http://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/index.htm
Statistics & Probability
Intro to Descriptive Statistics
Intro to Inferential Statistics
Cheatsheets
https://www.kdnuggets.com/2022/02/complete-collection-data-science-cheat-sheets-part-1.html
Books
https://www.kdnuggets.com/2022/03/best-data-science-books-beginners.html
Streamlit
Build ML Webapps
https://www.kdnuggets.com/2022/03/build-machine-learning-web-app-5-minutes.html
https://medium.com/talabat-tech/data-apps-from-local-to-live-in-10-minutes-a886d5453c7
Tutorials
- https://cloud.google.com/vertex-ai/docs
- https://cloud.google.com/training/machinelearning-ai#data-scientist-learning-path
- https://codelabs.developers.google.com/?cat=machinelearning
- https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/notebooks/official/pipelines
- https://cloud.google.com/blog/topics/developers-practitioners/lets-get-it-started-triggering-ml-pipeline-runs
- https://cloud.google.com/blog/products/ai-machine-learning/building-the-data-science-driven-organization
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://www.w3schools.com/datascience/
- https://www.guru99.com/data-science-tutorial.html
- https://www.tutorialspoint.com/python_data_science/index.htm
- https://www.classcentral.com/course/data-science-crash-course-4392
- https://towardsdatascience.com/10-resources-to-learn-data-science-on-google-cloud-c19fb3033df5