2020.12.28 Notes on LinkedIn Data Platform with Carl Steinbach - fcrimins/fcrimins.github.io GitHub Wiki

https://softwareengineeringdaily.com/2019/10/23/linkedin-data-platform-with-carl-steinbach/

  • 3 types of systems at LinkedIn (LI)
    • online serving systems (servers)
    • nearline streaming systems (real-time)
    • offline batch analytics systems (Hadoop in '09)
  • Fewer larger Hadoop clusters are better than more smaller clusters
  • 1 large prod cluster, 1 large dev cluster (2013)
  • Can't have data scientists learning the intricacies of Hadoop
  • "Worse is Better" (WIB) article/concept
    • "MIT School" - everything has to be perfect, acceptable to push complexity to implementation to maintain a perfect API (C/C++)
    • "NJ School" (Bell Labs) - willing to push responsibilities to user for sake of keeping impl simple (LISP)
  • Hadoop is an example of WIB - easy interface, i.e. Map-Reduce (MR), like REST (limited method set)
  • Original Hadoop required writing Map-Reduce which was tricky though (especially for data scientists)
  • Solution: PIG - higher level interface to Hadoop at Yahoo
  • Solution: Hive - same at Facebook
  • Lineage of query languages - PIG -> HIVE -> Presto?
  • Presto is an MPP (massively parallel processor) database running on top of HDFS
    • Presto ~ Vertica or Greenplum
    • Runs everything as prestouser (unlike Spark/Hive/PIG which run as individual userId) but then superimposes authentication/authorization roles on top
    • This implies that UDFs (user defined SQL functions) can't be added w/out vetting in Presto
    • So if you're comfortable with SQL (and its standard built-in UDFs) then there's no reason not to use Presto
  • Spark is more efficient caching data to memory as opposed to caching to disk as MR does
    • MR designers knew that this "forced materialization" to disk was a performance problem but they erred on the side of recoverability (persistence) if jobs failed
    • Another issue with MR was that it only allowed 1 reduce stage per map job, which was a problem
  • Simple (WIB) APIs also make it easier to evolve an API (fewer users using fewer interface components)
  • Spark == "Hadoop version 2"
    • Clean and elegant API
    • Single managed project (no PIG no HIVE), single repo
  • Dali project at LI (originally "Data Access at LI"; history of suffixing LI projects with "LI")
    • Attempt to combine best aspects of relational DBs with best of big-data architectures
    • Leverage abstractions from relational, e.g. tables/views
    • Decoupling of logical views from underlying data
    • Big-data, more interface than just SQL, more file formats, more storage layers
    • Decouple implementations from APIs
    • Look at the queries people run against a dataset to inform format of stored data (FWC - not stored data, cached! data), etc.
    • Decouple schema/API of a dataset from its actual implementation (i.e. views on top of HDFS)
    • Hide details of HDFS/Hadoop/Spark
      • File formats, cluster where a query executes, how a dataset is partitioned
      • FWC - This is like my "SQL on top of simple directory of files with intermediate cache files" idea
    • Through a service maintain a mapping from a particular dataset name to location/file-format/exec-cluster
    • All public data has to be accessed through Dali APIs now as they migrate to Azure
  • Manage client side dependencies in more controlled fashion
    • For example, "physical"/hardware system level dependencies are implemented as VPN (so can see who is accessing data and how)
    • Make clients as thin as possible and push impl to services side
    • Provide dataset API as opposed to file system API, avoid monolithic write locks
    • Think in terms of datasets & partitions rather than directories & files (move the thinking up one level of abstraction)
    • Another problem with filesystem based approach is permissioning, ACLs, can't grant access to a subset of columns (record-oriented API column level, row level access control policies)
  • New: Azure Blob Storage (shared data, ABS) + Azure Data Lake Storage (user home dirs, ADLS)
    • Map from dataset names to ABS (FWC - similar to S3)
    • Monitoring latency from client side very important in cloud to prevent/identify abusive clients quickly (the client software notifies central command if a client is being abusive)
    • Spark on YARN - so people don't spin up individual clusters => utilization rates of 90% (vs. 20-30% typical)
    • Every event (events from the LI website) gets set to Kafka and then from there ingested to HDFS
    • Thousands of Kafka event topics
    • Why save all old events?
      • Especially when materializing to other data formats
      • Using Apache Gobblin to convert to AVRO files and further plans to ingest to ORC (like Parquet)
    • Previous ingestion engine implemented in Azkaban, but over time couldn't find source code to workflows
      • So in some cases they re-linked JAR files (created a custom JAR re-linker called ByteRay) to migrate from Hadoop 1.0 to 2.0
    • Because nobody will ever work at LI forever - they call it a "tour of duty"
  • In conclusion/summary/future
    • Hadoop provided a distributed filesystem when what was needed was a distributed dataset system
    • Migrating from distributed filesystem to distributed blob (object) store
    • Borrow ideas from conventional relational databases combined with Hadoop/Spark
    • HDFS - scale-limiting factor is Name Node and managing that namespace
    • Low latency - read data as soon as it lands, isn't streaming really just the limit as latency->0
      • Not having to wait for it to be in a special format that a DB requires