2020.12.28 Notes on LinkedIn Data Platform with Carl Steinbach - fcrimins/fcrimins.github.io GitHub Wiki

https://softwareengineeringdaily.com/2019/10/23/linkedin-data-platform-with-carl-steinbach/

3 types of systems at LinkedIn (LI)
- online serving systems (servers)
- nearline streaming systems (real-time)
- offline batch analytics systems (Hadoop in '09)
Fewer larger Hadoop clusters are better than more smaller clusters
1 large prod cluster, 1 large dev cluster (2013)
Can't have data scientists learning the intricacies of Hadoop
"Worse is Better" (WIB) article/concept
- "MIT School" - everything has to be perfect, acceptable to push complexity to implementation to maintain a perfect API (C/C++)
- "NJ School" (Bell Labs) - willing to push responsibilities to user for sake of keeping impl simple (LISP)
Hadoop is an example of WIB - easy interface, i.e. Map-Reduce (MR), like REST (limited method set)
Original Hadoop required writing Map-Reduce which was tricky though (especially for data scientists)
Solution: PIG - higher level interface to Hadoop at Yahoo
Solution: Hive - same at Facebook
Lineage of query languages - PIG -> HIVE -> Presto?
Presto is an MPP (massively parallel processor) database running on top of HDFS
- Presto ~ Vertica or Greenplum
- Runs everything as prestouser (unlike Spark/Hive/PIG which run as individual userId) but then superimposes authentication/authorization roles on top
- This implies that UDFs (user defined SQL functions) can't be added w/out vetting in Presto
- So if you're comfortable with SQL (and its standard built-in UDFs) then there's no reason not to use Presto
Spark is more efficient caching data to memory as opposed to caching to disk as MR does
- MR designers knew that this "forced materialization" to disk was a performance problem but they erred on the side of recoverability (persistence) if jobs failed
- Another issue with MR was that it only allowed 1 reduce stage per map job, which was a problem
Simple (WIB) APIs also make it easier to evolve an API (fewer users using fewer interface components)
Spark == "Hadoop version 2"
- Clean and elegant API
- Single managed project (no PIG no HIVE), single repo
Dali project at LI (originally "Data Access at LI"; history of suffixing LI projects with "LI")
- Attempt to combine best aspects of relational DBs with best of big-data architectures
- Leverage abstractions from relational, e.g. tables/views
- Decoupling of logical views from underlying data
- Big-data, more interface than just SQL, more file formats, more storage layers
- Decouple implementations from APIs
- Look at the queries people run against a dataset to inform format of stored data (FWC - not stored data, cached! data), etc.
- Decouple schema/API of a dataset from its actual implementation (i.e. views on top of HDFS)
- Hide details of HDFS/Hadoop/Spark
  - File formats, cluster where a query executes, how a dataset is partitioned
  - FWC - This is like my "SQL on top of simple directory of files with intermediate cache files" idea
- Through a service maintain a mapping from a particular dataset name to location/file-format/exec-cluster
- All public data has to be accessed through Dali APIs now as they migrate to Azure
Manage client side dependencies in more controlled fashion
- For example, "physical"/hardware system level dependencies are implemented as VPN (so can see who is accessing data and how)
- Make clients as thin as possible and push impl to services side
- Provide dataset API as opposed to file system API, avoid monolithic write locks
- Think in terms of datasets & partitions rather than directories & files (move the thinking up one level of abstraction)
- Another problem with filesystem based approach is permissioning, ACLs, can't grant access to a subset of columns (record-oriented API column level, row level access control policies)
New: Azure Blob Storage (shared data, ABS) + Azure Data Lake Storage (user home dirs, ADLS)
- Map from dataset names to ABS (FWC - similar to S3)
- Monitoring latency from client side very important in cloud to prevent/identify abusive clients quickly (the client software notifies central command if a client is being abusive)
- Spark on YARN - so people don't spin up individual clusters => utilization rates of 90% (vs. 20-30% typical)
- Every event (events from the LI website) gets set to Kafka and then from there ingested to HDFS
- Thousands of Kafka event topics
- Why save all old events?
  - Especially when materializing to other data formats
  - Using Apache Gobblin to convert to AVRO files and further plans to ingest to ORC (like Parquet)
- Previous ingestion engine implemented in Azkaban, but over time couldn't find source code to workflows
  - So in some cases they re-linked JAR files (created a custom JAR re-linker called ByteRay) to migrate from Hadoop 1.0 to 2.0
- Because nobody will ever work at LI forever - they call it a "tour of duty"
In conclusion/summary/future
- Hadoop provided a distributed filesystem when what was needed was a distributed dataset system
- Migrating from distributed filesystem to distributed blob (object) store
- Borrow ideas from conventional relational databases combined with Hadoop/Spark
- HDFS - scale-limiting factor is Name Node and managing that namespace
- Low latency - read data as soon as it lands, isn't streaming really just the limit as latency->0
  - Not having to wait for it to be in a special format that a DB requires