Fewer larger Hadoop clusters are better than more smaller clusters
1 large prod cluster, 1 large dev cluster (2013)
Can't have data scientists learning the intricacies of Hadoop
"Worse is Better" (WIB) article/concept
"MIT School" - everything has to be perfect, acceptable to push complexity to implementation to maintain a perfect API (C/C++)
"NJ School" (Bell Labs) - willing to push responsibilities to user for sake of keeping impl simple (LISP)
Hadoop is an example of WIB - easy interface, i.e. Map-Reduce (MR), like REST (limited method set)
Original Hadoop required writing Map-Reduce which was tricky though (especially for data scientists)
Solution: PIG - higher level interface to Hadoop at Yahoo
Solution: Hive - same at Facebook
Lineage of query languages - PIG -> HIVE -> Presto?
Presto is an MPP (massively parallel processor) database running on top of HDFS
Presto ~ Vertica or Greenplum
Runs everything as prestouser (unlike Spark/Hive/PIG which run as individual userId) but then superimposes authentication/authorization roles on top
This implies that UDFs (user defined SQL functions) can't be added w/out vetting in Presto
So if you're comfortable with SQL (and its standard built-in UDFs) then there's no reason not to use Presto
Spark is more efficient caching data to memory as opposed to caching to disk as MR does
MR designers knew that this "forced materialization" to disk was a performance problem but they erred on the side of recoverability (persistence) if jobs failed
Another issue with MR was that it only allowed 1 reduce stage per map job, which was a problem
Simple (WIB) APIs also make it easier to evolve an API (fewer users using fewer interface components)
Spark == "Hadoop version 2"
Clean and elegant API
Single managed project (no PIG no HIVE), single repo
Dali project at LI (originally "Data Access at LI"; history of suffixing LI projects with "LI")
Attempt to combine best aspects of relational DBs with best of big-data architectures
Leverage abstractions from relational, e.g. tables/views
Decoupling of logical views from underlying data
Big-data, more interface than just SQL, more file formats, more storage layers
Decouple implementations from APIs
Look at the queries people run against a dataset to inform format of stored data (FWC - not stored data, cached! data), etc.
Decouple schema/API of a dataset from its actual implementation (i.e. views on top of HDFS)
Hide details of HDFS/Hadoop/Spark
File formats, cluster where a query executes, how a dataset is partitioned
FWC - This is like my "SQL on top of simple directory of files with intermediate cache files" idea
Through a service maintain a mapping from a particular dataset name to location/file-format/exec-cluster
All public data has to be accessed through Dali APIs now as they migrate to Azure
Manage client side dependencies in more controlled fashion
For example, "physical"/hardware system level dependencies are implemented as VPN (so can see who is accessing data and how)
Make clients as thin as possible and push impl to services side
Provide dataset API as opposed to file system API, avoid monolithic write locks
Think in terms of datasets & partitions rather than directories & files (move the thinking up one level of abstraction)
Another problem with filesystem based approach is permissioning, ACLs, can't grant access to a subset of columns (record-oriented API column level, row level access control policies)
New: Azure Blob Storage (shared data, ABS) + Azure Data Lake Storage (user home dirs, ADLS)
Map from dataset names to ABS (FWC - similar to S3)
Monitoring latency from client side very important in cloud to prevent/identify abusive clients quickly (the client software notifies central command if a client is being abusive)
Spark on YARN - so people don't spin up individual clusters => utilization rates of 90% (vs. 20-30% typical)
Every event (events from the LI website) gets set to Kafka and then from there ingested to HDFS
Thousands of Kafka event topics
Why save all old events?
Especially when materializing to other data formats
Using Apache Gobblin to convert to AVRO files and further plans to ingest to ORC (like Parquet)
Previous ingestion engine implemented in Azkaban, but over time couldn't find source code to workflows
So in some cases they re-linked JAR files (created a custom JAR re-linker called ByteRay) to migrate from Hadoop 1.0 to 2.0
Because nobody will ever work at LI forever - they call it a "tour of duty"
In conclusion/summary/future
Hadoop provided a distributed filesystem when what was needed was a distributed dataset system
Migrating from distributed filesystem to distributed blob (object) store
Borrow ideas from conventional relational databases combined with Hadoop/Spark
HDFS - scale-limiting factor is Name Node and managing that namespace
Low latency - read data as soon as it lands, isn't streaming really just the limit as latency->0
Not having to wait for it to be in a special format that a DB requires