BigData - xinshuaiqi/My_books GitHub Wiki
Big Data & Data Science
Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data
[从零开始,成为数据科
www.zhihu.com/question/27630156)
Hadoop
- focused on disk based data
- a basic map-reduce scheme
- HBase
- HIVE
- Hadoop Distributed File System (HDFS)
hdfs -dfs -ls
- Hadoop快速入门;
- Hadoop Map/Reduce教程; * Hadoop 入门教程
- 分布式计算开源框架Hadoop入门实践 (精品) 1 2 3
- 大数据时代的宠儿——Hadoop简介和实践分享
Apache Spark
- wiki
- XSEDE HPC Workshop BIG DATA Spark
- GraphX is Apache Spark's API for graphs and graph-parallel computation, with a built-in library of common algorithms.
- MLlib is Apache Spark's scalable **mXSED Big ata 201
[Cheat heets or AI, ural eorks achine learing ee earning * summary statistics, correlations, stratified sampling, hypothesis testing, random data generation * classification and regression: support vector machines, logistic regression, linear regression, decision trees, naive Bayes classification * collaborative filtering techniques including alternating least squares (ALS) * cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA) * dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA) feature extraction and transformation functions * optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)
RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. ref
相對於Hadoop的MapReduce會在執行完工作後將中介資料存放到磁碟中,Spark使用了記憶體內運算技術,能在資料尚未寫入硬碟時即在記憶體內分析運算。Spark在記憶體內執行程式的運算速度能做到比Hadoop MapReduce的運算速度快上100倍,即便是執行程式於硬碟時,Spark也能快上10倍速度。 在分散式儲存方面,Spark可以和HDFS、 Cassandra 、OpenStack Swift和Amazon S3等介面搭載。
- SQL RUNOOB Tutorial qxs
- 从零开始,成为数据科学“大咖”
- Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data
Pachyderm/Kubernetes
https://pachyderm.readthedocs.io/en/latest/index.html
run the pipeline in a distributed, streaming fashion.
as new data is added, the pipeline will automatically process it and output the results.
https://pachyderm.readthedocs.io/en/latest/getting_started/beginner_tutorial.html
What is Kubernetes