01‐new‐azure‐Structural Best Practices - octten28/BigData GitHub Wiki

Best practice: Store external data in the raw folder using immutable, timestamped, and partitioned formats like Parquet or CSV, with clear naming conventions and minimal transformation.

Here’s a breakdown of best practices for storing external data in the raw folder within Microsoft Azure, particularly in Azure Data Lake Storage or Blob Storage:

🧱 Structural Best Practices

Immutable Storage: Treat the raw zone as append-only. Never overwrite or delete files. This ensures traceability and supports reprocessing.
Timestamped File Naming: Use filenames that include ingestion timestamps (e.g., sourceA_20251105_1145.csv) to track data lineage and ingestion time.
Partitioning by Date/Source:
- Organize folders by source and date: raw/sourceA/year=2025/month=11/day=05/
- This improves query performance and simplifies downstream processing.

📁 Format and Compression

Preferred Formats:
- Parquet: Efficient for analytics, supports schema evolution.
- CSV or JSON: Acceptable for semi-structured or loosely typed data.
Compression: Use gzip or snappy to reduce storage costs and improve I/O performance.

🔐 Security and Access Control

Access Policies:
- Apply least privilege access using Azure RBAC or ACLs.
- Separate raw data access from curated zones to prevent accidental modification.
Data Classification:
- Tag sensitive data early using metadata or folder-level classification.

🔄 Ingestion and Metadata

Automated Ingestion Pipelines:
- Use Azure Data Factory, Synapse Pipelines, or Event Grid for scheduled or event-driven ingestion.
Metadata Capture:
- Store schema, source system, and ingestion metadata alongside the data (e.g., in a _metadata.json file or Azure Purview).

🧪 Validation and Logging

Audit Logging:
- Log ingestion events, file sizes, and checksums for traceability.
Validation Checks:
- Perform basic schema and format validation before moving to curated zones.

🧭 Example Folder Structure

/raw/
  └── sourceA/
      └── year=2025/
          └── month=11/
              └── day=05/
                  ├── sourceA_20251105_1145.parquet
                  └── _metadata.json

For more detailed guidance, Microsoft’s official [Data Lake Storage best practices](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices) offer excellent insights.

Would you like help scaffolding a modular ingestion pipeline or audit logging template for this setup? I can tailor it to your mentee onboarding goals.