01‐new‐azure‐Structural Best Practices - octten28/BigData GitHub Wiki
Best practice: Store external data in the raw folder using immutable, timestamped, and partitioned formats like Parquet or CSV, with clear naming conventions and minimal transformation.
Here’s a breakdown of best practices for storing external data in the raw folder within Microsoft Azure, particularly in Azure Data Lake Storage or Blob Storage:
🧱 Structural Best Practices
- Immutable Storage: Treat the raw zone as append-only. Never overwrite or delete files. This ensures traceability and supports reprocessing.
- Timestamped File Naming: Use filenames that include ingestion timestamps (e.g.,
sourceA_20251105_1145.csv) to track data lineage and ingestion time. - Partitioning by Date/Source:
- Organize folders by source and date:
raw/sourceA/year=2025/month=11/day=05/ - This improves query performance and simplifies downstream processing.
- Organize folders by source and date:
📁 Format and Compression
- Preferred Formats:
- Parquet: Efficient for analytics, supports schema evolution.
- CSV or JSON: Acceptable for semi-structured or loosely typed data.
- Compression: Use gzip or snappy to reduce storage costs and improve I/O performance.
🔐 Security and Access Control
- Access Policies:
- Apply least privilege access using Azure RBAC or ACLs.
- Separate raw data access from curated zones to prevent accidental modification.
- Data Classification:
- Tag sensitive data early using metadata or folder-level classification.
🔄 Ingestion and Metadata
- Automated Ingestion Pipelines:
- Use Azure Data Factory, Synapse Pipelines, or Event Grid for scheduled or event-driven ingestion.
- Metadata Capture:
- Store schema, source system, and ingestion metadata alongside the data (e.g., in a
_metadata.jsonfile or Azure Purview).
- Store schema, source system, and ingestion metadata alongside the data (e.g., in a
🧪 Validation and Logging
- Audit Logging:
- Log ingestion events, file sizes, and checksums for traceability.
- Validation Checks:
- Perform basic schema and format validation before moving to curated zones.
🧭 Example Folder Structure
/raw/
└── sourceA/
└── year=2025/
└── month=11/
└── day=05/
├── sourceA_20251105_1145.parquet
└── _metadata.json
For more detailed guidance, Microsoft’s official [Data Lake Storage best practices](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices) offer excellent insights.
Would you like help scaffolding a modular ingestion pipeline or audit logging template for this setup? I can tailor it to your mentee onboarding goals.