BigData - xinshuaiqi/My_books GitHub Wiki

Big Data & Data Science

怎样进行大数据的入门级学习?知乎

XSEDE Big Data 2018 Feb

John Hopkins Data Science lab

Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data

[从零开始,成为数据科

www.zhihu.com/question/27630156)

Hadoop

  • focused on disk based data
  • a basic map-reduce scheme
  • HBase
  • HIVE
  • Hadoop Distributed File System (HDFS)
hdfs -dfs -ls

Apache Spark

ohn oins Data Scienceab

[Cheat heets or AI, ural eorks achine learing ee earning ​ * summary statistics, correlations, stratified sampling, hypothesis testing, random data generation ​ * classification and regression: support vector machines, logistic regression, linear regression, decision trees, naive Bayes classification ​ * collaborative filtering techniques including alternating least squares (ALS) ​ * cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA) ​ * dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA) feature extraction and transformation functions ​ * optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)

RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. ref

相對於Hadoop的MapReduce會在執行完工作後將中介資料存放到磁碟中,Spark使用了記憶體內運算技術,能在資料尚未寫入硬碟時即在記憶體內分析運算。Spark在記憶體內執行程式的運算速度能做到比Hadoop MapReduce的運算速度快上100倍,即便是執行程式於硬碟時,Spark也能快上10倍速度。 在分散式儲存方面,Spark可以和HDFS、 Cassandra 、OpenStack Swift和Amazon S3等介面搭載。

Pachyderm/Kubernetes

http://www.pachyderm.io/

https://pachyderm.readthedocs.io/en/latest/index.html

run the pipeline in a distributed, streaming fashion.

as new data is added, the pipeline will automatically process it and output the results.

https://pachyderm.readthedocs.io/en/latest/getting_started/beginner_tutorial.html

https://kubernetes.io/

What is Kubernetes

https://www.youtube.com/watch?v=R-3dfURb2hA