Data Management - WEHI-RCPStudentInternship/data-commons GitHub Wiki
This page shall show the result of how data could be managed in Data Common by looking at Data Lakehouse concept.
- The two major proponents championing the idea of a data lake house are Databricks (originator and creator of their Delta Lake concept) and AWS.
- In general, a data lake house system will consist of five layers.
- Ingestion layer: The first layer in the system takes care of pulling data from a variety of sources and delivering it to the storage layer. Unifying batch and streaming data processing capabilities, the layer may use different protocols to connect to a bunch of internal and external sources
- Storage layer: The lakehouse design is supposed to allow keeping all kinds of data in low-cost object stores. The client tools then can read these objects directly from the store using open file formats. Thanks to this, multiple APIs and consumption layer components can get to and make use of the same data. The schemas of structured and semi-structured datasets are kept in the metadata layer for the components to apply them to data while reading it.
- Metadata layer: The foundation of a data lake house that sets this architecture apart is the metadata layer. It’s a unified catalog that provides metadata (data giving information about other data pieces) for all objects in the lake storage and gives users the opportunity to implement management features
- API layer: the layer of the architecture that hosts various APIs to enable all end users to process tasks faster and get more advanced analytics. Metadata APIs help understand what data items are required for a particular application and how to retrieve them.
- Consumption layer: The consumption layer hosts various tools and apps such as Power BI, Tableau, and others
Challenges to Data Lakehouse
- You already use a data lake and want to complement it with SQL performance capabilities while saving money on building and maintaining the two-tier architecture with warehouses.
- You want to get rid of data redundancy and inconsistency due to the use of multiple systems.
- Your company looks for the versatility of data management and analytics use cases from BI to AI.
- You want to improve data security, reliability, and compliance while still keeping big data in low-cost lake storage.
The relatively simple and fast way to implement the lake house architecture is to choose the out-of-the-box product offered by industry vendors. The product already comes with an open data-sharing protocol, open APIs, and many native connectors to different databases, applications, and tools, including Apache Spark, Hive, Athena, Snowflake, Redshift, and Kafka. That all makes it much easier for data engineering teams to build and manage data pipelines.
- Databricks with their Delta Lake platform
- AWS S3
- Azure Data Lake Storage
- Google Storage
Amazon S3
- Short introduction video: https://www.youtube.com/watch?v=e6w9LwZJFIA
- Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance.
- How to use: To store data in Amazon S3, you need to create a bucket, and you can then upload any number of objects within one bucket. Amazon is key-value storage, which means that each bucket has a globally unique name. Since all AWS accounts share the same namespace, no two buckets can have the same identity.
Features
- Storage management and monitoring
- Storage analytics and insights
- Storage classes
- Access management and security
- Data processing
With S3 Object Lambda you can add your own code to S3 GET, HEAD, and LIST requests to modify and process data as it is returned to an application. You can use custom code to modify the data returned by standard S3 GET requests to filter rows, dynamically resize images, redact confidential data, and much more. You can also use S3 Object Lambda to modify the output of S3 LIST requests to create a custom view of objects in a bucket and S3 HEAD requests to modify object metadata like object name and size.
- Query in place
- Data transfer
- Data exchange
- Performance
Metadata
Metadata is stored as key-value pairs and can be used to store any type of information that describes the object. Some examples of metadata include the object’s creation date, size, content type, and custom user-defined metadata.
AWS S3 Metadata
There are two categories of system metadata:
- System controlled: Metadata such as the object-creation date is system controlled, meaning that only Amazon S3 can modify the value.
- User controlled: Other system metadata, such as the storage class configured for the object and whether the object has server-side encryption enabled, are examples of system metadata whose values you control.
Data Lakes on AWS S3
- Amazon S3 is the best place to build data lakes because of its unmatched durability, availability, scalability, security, compliance, and audit capabilities.
Features
- Seamlessly integrate and move data: You can import any amount of data, in real-time or batch, with AWS Glue. Data can be collected from multiple sources and moved into the data lake in its original format. AWS analytics services can also be used to query your data lake directly. Having data integration, discovery, preparation, and transformation tools like AWS Glue allows you to scale while saving time defining data structures, schema, and transformations. AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
- Easily enable purpose-built analytics: It’s easy for diverse users across your organization, like data scientists, data developers, and business analysts, to access data with their choice of purpose-built AWS analytics tools and frameworks. You can easily and quickly run analytics without the need to move your data to a separate analytics system.
- Discover, catalog, and secure data: AWS Glue provides a streamlined and centralized data catalog so you can better understand the data in your data lake. AWS Lake Formation lets you centralize data governance and security so you can deploy data with confidence.
AWS Lake Formation easily creates secure data lakes, making data available for wide-ranging analytics.
- Quickly deploy machine learning: Data lakes on AWS allow you to innovate faster with the most comprehensive set of AI and ML services.
Building Data Lakes on AWS
Amazon Omics
- AWS HealthOmics helps healthcare and life science organizations build at-scale to store, query, and analyze genomic, transcriptomic, and other omics data. By removing the undifferentiated heavy lifting, you can generate deeper insights from omics data to improve health and advance scientific discoveries.
- Introduction Article: https://aws.amazon.com/blogs/industries/part-1-introducing-amazon-omics-from-sequence-data-to-insights-securely-and-at-scale/
Features
- Purpose-built storage
AWS HealthOmics storage is compatible with bioinformatics file formats such as FASTQ, BAM, and CRAM and allows you to store, discover, and share this data efficiently and at low cost. These file formats are stored as read-set objects within a sequence store. You can also store reference genomes in the FASTA format. Data is imported as immutable objects with unique identifiers to support workloads that require strict data provenance. Access to individual data objects, including references and read-set objects, can be controlled using tags and attribute-based access controls through AWS Identity and Access Management (IAM). To reduce long-term storage costs, data objects that have not been accessed within 30 days are automatically moved to an archive storage class. Archived objects can be reactivated at any time with an API call.
- Bioinformatics workflows
AWS HealthOmics helps you run bioinformatics workflows at scale. You can choose Ready2Run workflows or bring-your-own private workflows to process your biological data without the need to manage the underlying infrastructure.
- Analysis at scale
- Data collaboration and provenance
AWS HealthOmics makes it easier for researchers to tag collaborators, set up their permissions, and share data securely with them. This simplifies how you make your omics data findable, accessible, interoperable, and reusable (FAIR). With domain-specific metadata, you can link AWS HealthOmics data stores with other omics and healthcare data to facilitate multi omic and multimodal analysis.