2.1.2.1.Data Science Tools_Open Source Tools - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Categories of Data Science Tools

Open source tools are available for various data science tasks.

Data Management is the process of persisting and retrieving data. (데이터 κ΄€λ¦¬λŠ” 데이터λ₯Ό μœ μ§€ν•˜κ³  κ²€μƒ‰ν•˜λŠ” ν”„λ‘œμ„ΈμŠ€μž…λ‹ˆλ‹€.)

Data Integration and Transformation, often referred to as Extract, Transform, and Load, or β€œETL,” is the process of retrieving data from remote data management systems. Transforming data and loading it into a local data management system is also part of Data Integration and Transformation. (데이터 톡합 및 λ³€ν™˜(ETL)은 원격 데이터 관리 μ‹œμŠ€ν…œμ—μ„œ 데이터λ₯Ό κ²€μƒ‰ν•˜λŠ” ν”„λ‘œμ„ΈμŠ€μž…λ‹ˆλ‹€. 데이터λ₯Ό λ³€ν™˜ν•˜μ—¬ 둜컬 데이터 관리 μ‹œμŠ€ν…œμœΌλ‘œ λ‘œλ“œν•˜λŠ” 것도 데이터 톡합 및 λ³€ν™˜μ˜ μΌλΆ€μž…λ‹ˆλ‹€.)

Data Visualization is part of an initial data exploration process, as well as being part of a final deliverable. (데이터 μ‹œκ°ν™”λŠ” 초기 데이터 탐색 ν”„λ‘œμ„ΈμŠ€μ˜ μΌλΆ€μ΄μž μ΅œμ’… 결과물의 μΌλΆ€μž…λ‹ˆλ‹€.)

Model Building is the process of creating a machine learning or deep learning model using an appropriate algorithm with a lot of data. (λͺ¨λΈ λΉŒλ“œλž€ λ§Žμ€ 데이터가 ν¬ν•¨λœ μ μ ˆν•œ μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜μ—¬ 기계 ν•™μŠ΅ λ˜λŠ” 심측 ν•™μŠ΅ λͺ¨λΈμ„ λ§Œλ“œλŠ” ν”„λ‘œμ„ΈμŠ€μž…λ‹ˆλ‹€.)

Model deployment makes such a machine learning or deep learning model available to third-party applications. (λͺ¨λΈ λ°°ν¬λŠ” μ΄λŸ¬ν•œ 기계 ν•™μŠ΅ λ˜λŠ” 심측 ν•™μŠ΅ λͺ¨λΈμ„ 타사 μ‘μš© ν”„λ‘œκ·Έλž¨μ—μ„œ 톡합 될 수 μžˆμŠ΅λ‹ˆλ‹€.)

Model monitoring and assessment ensures continuous performance quality checks on the deployed models. These checks are for accuracy, fairness, and adversarial robustness. (λͺ¨λΈ λͺ¨λ‹ˆν„°λ§ 및 ν‰κ°€λŠ” 배포 된 λͺ¨λΈμ— λŒ€ν•œ 지속적인 μ„±λŠ₯ ν’ˆμ§ˆ 검사λ₯Ό 보μž₯ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ κ²€μ‚¬λŠ” μ •ν™•μ„±, 곡정성 및 μ λŒ€μ  견고성을 μœ„ν•œ κ²ƒμž…λ‹ˆλ‹€.)

Code asset management uses versioning and other collaborative features to facilitate teamwork. (μ½”λ“œ μžμ‚° κ΄€λ¦¬λŠ” 버전 관리 및 기타 ν˜‘μ—… κΈ°λŠ₯을 μ‚¬μš©ν•˜μ—¬ νŒ€μ›μ„ μš©μ΄ν•˜κ²Œ ν•©λ‹ˆλ‹€.)

Data asset management brings the same versioning and collaborative components to data. Data asset management also supports replication, backup, and access right management. (데이터 μžμ‚° κ΄€λ¦¬λŠ” λ™μΌν•œ 버전 관리 및 ν˜‘μ—… ꡬ성 μš”μ†Œλ₯Ό 데이터에 μ œκ³΅ν•©λ‹ˆλ‹€. λ˜ν•œ 데이터 μžμ‚° κ΄€λ¦¬λŠ” 볡제, λ°±μ—… 및 μ•‘μ„ΈμŠ€ κΆŒν•œ 관리λ₯Ό μ§€μ›ν•©λ‹ˆλ‹€.)

Development environments, commonly known as Integrated Development Environments, or β€œIDEs”, are tools that help the data scientist to implement, execute, test, and deploy their work. (IDE(톡합 개발 ν™˜κ²½)라고 ν•˜λŠ” 개발 ν™˜κ²½μ€ 데이터 κ³Όν•™μžκ°€ μžμ‹ μ˜ μž‘μ—…μ„ κ΅¬ν˜„, μ‹€ν–‰, ν…ŒμŠ€νŠΈ 및 λ°°ν¬ν•˜λŠ” 데 도움이 λ˜λŠ” λ„κ΅¬μž…λ‹ˆλ‹€.)

Execution environments are tools where data preprocessing, model training, and deployment take place. (μ‹€ν–‰ ν™˜κ²½μ€ 데이터 사전 처리, λͺ¨λΈ ꡐ윑 및 배포가 μˆ˜ν–‰λ˜λŠ” λ„κ΅¬μž…λ‹ˆλ‹€.)

Finally, there is fully integrated, visual tooling available that covers all the previous tooling components, either partially or completely. (λ§ˆμ§€λ§‰μœΌλ‘œ, μ΄μ „μ˜ λͺ¨λ“  곡ꡬ μ„€λΉ„ μ»΄ν¬λ„ŒνŠΈλ₯Ό λΆ€λΆ„μ μœΌλ‘œ λ˜λŠ” μ™„μ „νžˆ ν¬κ΄„ν•˜λŠ” μ™„μ „νžˆ ν†΅ν•©λœ μ‹œκ° 곡ꡬ가 μžˆμŠ΅λ‹ˆλ‹€.)


Open Source Tools for Data Science

Data Management - The most widely used open source data management tools are relational databases such as MySQL and PostgreSQL; NoSQL databases such as MongoDB, Apache CouchDB, and Apache Cassandra; and file-based tools such as the Hadoop File System or Cloud File systems like Ceph. Finally, Elasticsearch is mainly used for storing text data and creating a search index for fast document retrieval.

Data Integration and Transformation - Data scientists often propose the term β€œELT” – Extract, Load, Transformβ€œELT”, stressing the fact that data is dumped somewhere and the data engineer or data scientist themself is responsible for data. Apache AirFlow, originally created by AirBNB; KubeFlow, which enables you to execute data science pipelines on top of Kubernetes; Apache Kafka, which originated from LinkedIn; Apache Nifi, which delivers a very nice visual editor; Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute clusters of 1000s of nodes), and NodeRED, which also provides a visual editor. NodeRED consumes so little in resources that it even runs on small devices like a Raspberry Pi.

Data Visualization - We have to distinguish between programming libraries where you need to use code and tools that contain a user interface. A similar approach uses Hue, which can create visualizations from SQL queries. Kibana, a data exploration and visualization web application, is limited to Elasticsearch (the data provider). Finally, Apache Superset is a data exploration and visualization web application.

Model Deployment - Once you’ve created a machine learning model capable of predicting some key aspects of the future, you should make that model consumable by other developers and turn it into an API. Apache PredictionIO currently only supports Apache Spark ML models for deployment, but support for all sorts of other libraries is on the roadmap. Seldon is an interesting product since it supports nearly every framework, including TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of Kubernetes and Redhat OpenShift. Another way to deploy SparkML models is by using MLeap. Finally, TensorFlow can serve any of its models using the TensorFlow service. You can deploy to an embedded device like a Raspberry Pi or a smartphone using TensorFlow Lite, and even deploy to a web browser using TensorFlow.JS.

Model Monitoring and Assessment - Once you’ve deployed a machine learning model, you need to keep track of its prediction performance as new data arrives in order to maintain outdated models. ModelDB is a machine model metadatabase where information about the models are stored and can be queried. It natively supports Apache Spark ML Pipelines and scikit-learn. A generic, multi-purpose tool called Prometheus is also widely used for machine learning model monitoring, although it’s not specifically made for this purpose. Model performance is not exclusively measured through accuracy. Model bias against protected groups like gender or race is also important. The IBM AI Fairness 360 open source toolkit does exactly this. It detects and mitigates against bias in machine learning models. Machine learning models, especially neural-network-based deep learning models, can be subject to adversarial attacks, where an attacker tries to fool the model with manipulated data or by manipulating the model itself. The IBM Adversarial Robustness 360 Toolbox can be used to detect vulnerability to adversarial attacks and help make the model more robust. Machine learning modes are often considered to be a black box that applies some mysterious β€œmagic.” The IBM AI Explainability 360 Toolkit makes the machine learning process more understandable by finding similar examples within a dataset that can be presented to a user for manual comparison. The IBM AI Explainability 360 Toolkit can also illustrate training for a simpler machine learning model by explaining how different input variables affect the final decision of the model.

Code Asset Management - For code asset management – also referred to as version management or version control – Git is now the standard. Multiple services have emerged to support Git, with the most prominent being GitHub, which provides hosting for software development version management. The runner-up is definitely GitLab, which has the advantage of being a fully open source platform that you can host and manage yourself. Another choice is Bitbucket.

Data Asset Management - Data has to be versioned and annotated with metadata. Apache Atlas is a tool that supports this task. Another interesting project, ODPi Egeria, is managed through the Linux Foundation and is an open ecosystem. It offers a set of open APIs, types, and interchange protocols that metadata repositories use to share and exchange data. Finally, Kylo is an open source data lake management software platform that provides extensive support for a wide range of data asset management tasks.

Development Environments - One of the most popular current development environments that data scientists are using is β€œJupyter.” Jupyter first emerged as a tool for interactive Python programming; it now supports more than a hundred different programming languages through β€œkernels.” Kernels shouldn’t be confused with operating system kernels. Jupyter kernels are encapsulating the different interactive interpreters for the different programming languages. A key property of Jupyter Notebooks is the ability to unify documentation, code, output from the code, shell commands, and visualizations into a single document. JupyterLab is the next generation of Jupyter Notebooks and in the long term, will actually replace Jupyter Notebooks. The architectural changes being introduced in JupyterLab makes Jupyter more modern and modular. From a user’s perspective, the main difference introduced by JupyterLab is the ability to open different types of files, including Jupyter Notebooks, data, and terminals. You can then arrange these files on the canvas. Although Apache Zeppelin has been fully reimplemented, it’s inspired by Jupyter Notebooks and provides a similar experience. One key differentiator is the integrated plotting capability. In Jupyter Notebooks, you are required to use external libraries in Apache Zeppelin, and plotting doesn’t require coding. You can also extend these capabilities by using additional libraries. RStudio is one of the oldest development environments for statistics and data science, having been introduced in 2011. It exclusively runs R and all associated R libraries. However, Python development is possible and R is therefore tightly integrated into this tool to provide an optimal user experience. RStudio unifies programming, execution, debugging, remote data access, data exploration, and visualization into a single tool. Spyder tries to mimic the behaviour of RStudio to bring its functionality to the Python world. Although Spyder does not have the same level of functionality as RStudio, data scientists do consider it an alternative. But in the Python world, Jupyter is used more frequently. Spyder integrates code, documentation, visualizations, and other components into a single canvas.

Execution Environments - The well known cluster-computing framework Apache Spark is among the most active Apache projects and is used across all industries, including in many Fortune 500 companies. The key property of Apache Spark is linear scalability. This means, if you double the number of servers in a cluster, you’ll also roughly double its performance. After Apache Spark began to gain market share, Apache Flink was created. The key difference between Apache Spark and Apache Flink is that Apache Spark is a batch data processing engine, capable of processing huge amounts of data file by file. Apache Flink, on the other hand, is a stream processing image, with its main focus on processing real-time data streams. Although engine supports both data processing paradigms, Apache Spark is usually the choice in most use cases. One of the latest developments in the data science execution environments is called β€œRay,” which has a clear focus on large-scale deep learning model training.

Fully Integrated Visual Tools - Most important tasks are supported by these tools; these tasks include data integration, transformation, data visualization, and model building. KNIME originated at the University of Konstanz in 2004. As you can see, KNIME has a visual user interface with drag-and-drop capabilities. It also has built-in visualization capabilities. Knime can be be extended by programming in R and Python, and has connectors to Apache Spark. Another example of this group of tools is Orange. It’s less flexible than KNIME, but easier to use. In this video, you’ve learned about the most common data science tasks and which open source tools are relevant to those tasks.


Quiz