01‐new‐azure‐Structural Best Practices - octten28/BigData GitHub Wiki

Best practice: Store external data in the raw folder using immutable, timestamped, and partitioned formats like Parquet or CSV, with clear naming conventions and minimal transformation.

Here’s a breakdown of best practices for storing external data in the raw folder within Microsoft Azure, particularly in Azure Data Lake Storage or Blob Storage:


🧱 Structural Best Practices

  • Immutable Storage: Treat the raw zone as append-only. Never overwrite or delete files. This ensures traceability and supports reprocessing.
  • Timestamped File Naming: Use filenames that include ingestion timestamps (e.g., sourceA_20251105_1145.csv) to track data lineage and ingestion time.
  • Partitioning by Date/Source:
    • Organize folders by source and date: raw/sourceA/year=2025/month=11/day=05/
    • This improves query performance and simplifies downstream processing.

📁 Format and Compression

  • Preferred Formats:
    • Parquet: Efficient for analytics, supports schema evolution.
    • CSV or JSON: Acceptable for semi-structured or loosely typed data.
  • Compression: Use gzip or snappy to reduce storage costs and improve I/O performance.

🔐 Security and Access Control

  • Access Policies:
    • Apply least privilege access using Azure RBAC or ACLs.
    • Separate raw data access from curated zones to prevent accidental modification.
  • Data Classification:
    • Tag sensitive data early using metadata or folder-level classification.

🔄 Ingestion and Metadata

  • Automated Ingestion Pipelines:
    • Use Azure Data Factory, Synapse Pipelines, or Event Grid for scheduled or event-driven ingestion.
  • Metadata Capture:
    • Store schema, source system, and ingestion metadata alongside the data (e.g., in a _metadata.json file or Azure Purview).

🧪 Validation and Logging

  • Audit Logging:
    • Log ingestion events, file sizes, and checksums for traceability.
  • Validation Checks:
    • Perform basic schema and format validation before moving to curated zones.

🧭 Example Folder Structure

/raw/
  └── sourceA/
      └── year=2025/
          └── month=11/
              └── day=05/
                  ├── sourceA_20251105_1145.parquet
                  └── _metadata.json

For more detailed guidance, Microsoft’s official [Data Lake Storage best practices](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices) offer excellent insights.

Would you like help scaffolding a modular ingestion pipeline or audit logging template for this setup? I can tailor it to your mentee onboarding goals.