BigData - xinshuaiqi/My_books GitHub Wiki

Big Data & Data Science

[从零开始，成为数据科

www.zhihu.com/question/27630156)

Hadoop

focused on disk based data
a basic map-reduce scheme
HBase
HIVE
Hadoop Distributed File System (HDFS)

hdfs -dfs -ls

Apache Spark

wiki
XSEDE HPC Workshop BIG DATA Spark
GraphX is Apache Spark's API for graphs and graph-parallel computation, with a built-in library of common algorithms.
MLlib is Apache Spark's scalable **mXSED Big ata 201

ohn oins Data Scienceab

[Cheat heets or AI, ural eorks achine learing ee earning * summary statistics, correlations, stratified sampling, hypothesis testing, random data generation * classification and regression: support vector machines, logistic regression, linear regression, decision trees, naive Bayes classification * collaborative filtering techniques including alternating least squares (ALS) * cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA) * dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA) feature extraction and transformation functions * optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)

RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. ref

相對於Hadoop的MapReduce會在執行完工作後將中介資料存放到磁碟中，Spark使用了記憶體內運算技術，能在資料尚未寫入硬碟時即在記憶體內分析運算。Spark在記憶體內執行程式的運算速度能做到比Hadoop MapReduce的運算速度快上100倍，即便是執行程式於硬碟時，Spark也能快上10倍速度。在分散式儲存方面，Spark可以和HDFS、 Cassandra 、OpenStack Swift和Amazon S3等介面搭載。

Pachyderm/Kubernetes

http://www.pachyderm.io/

https://pachyderm.readthedocs.io/en/latest/index.html

run the pipeline in a distributed, streaming fashion.

as new data is added, the pipeline will automatically process it and output the results.

https://pachyderm.readthedocs.io/en/latest/getting_started/beginner_tutorial.html

https://kubernetes.io/

What is Kubernetes

https://www.youtube.com/watch?v=R-3dfURb2hA