🗄️ Database and File Management - fuhui14/SWEN90017-2024-TAP GitHub Wiki

Technical Design Document for Large Audio/Video File Upload and Processing System

📝 Introduction

This document provides a comprehensive technical design for handling large audio/video file uploads, transcription output storage, and related file/database operations in the TAP system.

Technical Design Document for Large Audio/Video File Upload and Processing System

🚧 Technical Challenges and Solutions

2.1 Reliable Large File Uploads

Challenges:

Large files are prone to upload failures due to network instability or server timeouts.
Ensuring that the entire file is uploaded without corruption.
Allowing users to resume interrupted uploads without starting over.

Solutions:

Chunked File Upload with Resumable Capability

Method:
- Split files into smaller chunks on the client side.
- Upload chunks sequentially or in parallel to the server.
- Implement mechanisms to track and resume interrupted uploads.
Reasons:
- Improves reliability by reducing the amount of data retransmitted after a failure.
- Allows for efficient use of network resources.
- Enhances user experience by providing feedback on upload progress.

Implementation Steps:

Client-Side Chunking:
- Use the File API (e.g., JavaScript FileReader) to read and divide files into manageable chunks (e.g., 5 MB per chunk).
- Generate a unique identifier (upload_id) for each file upload session, possibly using a UUID.
Server-Side Chunk Handling:
- Create API endpoints to receive each chunk along with metadata:
  - upload_id: Unique identifier for the upload session.
  - chunk_number: The sequence number of the current chunk.
  - total_chunks: The total number of chunks for the file.
  - file_name: Original file name (for reference).
- Store each chunk temporarily in a designated directory associated with the upload_id.
Resume Support:
- Implement an endpoint that allows the client to query which chunks have been successfully uploaded.
- The client can then resume uploading from the last successful chunk.
Chunk Assembly:
- Once all chunks are received, the server assembles them into the final file in the correct order.
- Ensure atomic operations during assembly to prevent data corruption.
Error Handling and Feedback:
- Provide clear error messages to the client in case of failures.
- Implement retry logic for failed chunk uploads.

2.2 Server-side File Storage Strategy

Challenges:

Efficiently storing and managing large audio/video files.
Ensuring quick and direct access for processing programs.
Preventing file name conflicts and unauthorized access.

Solutions:

Store Files Directly in the File System

Method:
- Use the server's file system to store uploaded files.
- Organize files in a structured directory hierarchy based on user IDs and upload dates.
- Use unique identifiers for file names to avoid conflicts and enhance security.
Reasons:
- Direct file system access provides efficient read/write operations for large files.
- Simplifies file handling for processing programs that need direct access to the files.
- File systems are optimized for handling large files and support various file operations needed by the processing programs.

Implementation Steps:

Directory Structure:

Define a structured path for storing files:

/storage/
  └── {user_id}/
      └── {YYYY}/
          └── {MM}/
              └── {DD}/
                  └── {upload_id}/
                      └── original_filename.ext

Organizing files by user and date improves manageability and scalability.

File Naming Convention:
- Use a secure hash (e.g., SHA-256) or UUID as the file name to prevent name collisions and enhance security.
- Store the original file name in metadata or the database for reference.
File Permissions and Access Control:
- Set appropriate file system permissions to restrict unauthorized access.
- Files should not be accessible directly through the web server.
- Access to files should be controlled through authenticated and authorized application logic.
Efficient File Access for Processing:
- Ensure that processing programs have the necessary permissions to read the files.
- Optimize file paths and access methods to reduce latency and improve processing speed.
Storage Capacity Planning:
- Monitor storage usage and plan for scaling storage resources as needed.
- Consider using high-performance storage solutions (e.g., SSDs) for faster read/write operations.

2.3 Database Design for Transcribed Text

Challenges:

Efficiently storing and retrieving potentially large transcribed text data.
Ensuring performance for read and write operations on long text fields.
Supporting search and analysis on transcribed text if required.

Solutions:

Utilize PostgreSQL's Text Handling Capabilities

Method:
- Use the TEXT data type in PostgreSQL for storing transcribed text.
- Implement indexing strategies if full-text search capabilities are needed.
Reasons:
- PostgreSQL efficiently handles large text data with the TEXT type.
- Provides robust support for text search and indexing features.
- Separating transcribed text storage from file storage simplifies data management.

Implementation Steps:

Database Schema Design:

Transcriptions Table:

CREATE TABLE transcriptions (
    transcription_id SERIAL PRIMARY KEY,
    file_id INTEGER REFERENCES files(file_id),
    transcribed_text TEXT NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Files Table (Metadata Only):

CREATE TABLE files (
    file_id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(user_id),
    upload_id UUID UNIQUE NOT NULL,
    original_filename VARCHAR(255) NOT NULL,
    storage_path VARCHAR(1024) NOT NULL,
    file_size BIGINT NOT NULL,
    status VARCHAR(50) NOT NULL DEFAULT 'uploaded',
    upload_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

The storage_path field stores the file system path to the file.
No binary file data is stored in the database.

Indexing:
- If full-text search is needed, create a GIN index on the transcribed_text column:
```
CREATE INDEX transcriptions_text_search_idx ON transcriptions USING GIN (to_tsvector('english', transcribed_text));
```
- Enhances search performance on large text fields.
Performance Optimization:
- Regularly analyze and vacuum the database to maintain performance.
- Monitor query performance and optimize as needed.

2.4 Security Considerations

Challenges:

Preventing the upload and storage of malicious files (e.g., viruses, trojans).
Protecting system integrity and safeguarding user data.
Ensuring that only authorized users can access or manipulate files.

Solutions:

Implement File Scanning and Validation

Method:
- Scan uploaded files for malware after assembly.
- Validate file types and content before processing.
- Restrict file access through proper permissions and authentication mechanisms.
Reasons:
- Detects and prevents the storage and execution of harmful content.
- Ensures system stability and security.
- Maintains user trust by safeguarding their data.

Implementation Steps:

Virus Scanning:
- Integration:
  - Install antivirus software on the server (e.g., ClamAV).
- Scanning Process:
  - After the file has been fully assembled, perform a virus scan:
```
clamscan --stdout --no-summary /path/to/assembled/file
```
  - Parse the scan results to determine if the file is safe.
- Handling Infected Files:
  - If a file is found to be infected:
    - Delete the file securely from the file system.
    - Update the file's status in the database to indicate the failure.
    - Notify the user and/or system administrators about the issue.
File Type Validation:
- Use file type detection libraries (e.g., python-magic) to verify the actual content type of the file matches the expected audio/video formats.
```
import magic

def is_valid_file_type(file_path):
    mime_type = magic.from_file(file_path, mime=True)
    allowed_types = ['audio/', 'video/']
    return any(mime_type.startswith(allowed) for allowed in allowed_types)
```
- Reject files that do not conform to allowed types before processing.
Access Control:
- Implement authentication mechanisms to ensure that only logged-in users can upload files.
- Enforce authorization checks so that users can only access their own files.
- Use secure communication protocols (e.g., HTTPS) to protect data in transit.
Processing in a Secure Environment:
- Process files in isolated environments (e.g., Docker containers) to minimize the risk of malicious code affecting the host system.
- Limit the permissions of processing programs to only what is necessary.

🔧 Implementation Details

3.1 File Upload Mechanism

Chunked Upload Protocol:

Metadata for Each Chunk:
- upload_id: Unique identifier for the upload session.
- chunk_number: Sequence number of the current chunk.
- total_chunks: Total number of chunks.
- chunk_size: Size of the current chunk.
- file_size: Total size of the file.
- file_name: Original file name.

Server-Side API Endpoints:

Initiate Upload Session:
- Endpoint to register a new upload session.
- Returns an upload_id to the client.
- Stores initial metadata (e.g., total chunks expected) in a temporary data store (database table or in-memory cache).
Upload Chunk:
- Endpoint to receive and store each chunk.
- Validates upload_id and chunk_number.
- Saves the chunk to a temporary location in the file system.
Query Upload Status:
- Endpoint for the client to check which chunks have been received.
- Facilitates resuming interrupted uploads.
Complete Upload:
- Endpoint to signal that all chunks have been uploaded.
- Triggers the assembly of chunks into the final file.
- Initiates virus scanning and validation processes.

Chunk Assembly Process:

Verify that all chunks have been received by comparing the number of chunks stored with total_chunks.
Read chunks in the correct order and write them to the final file location.
Perform the assembly operation atomically to prevent partial writes.
Delete temporary chunk files after successful assembly.

Error Handling:

If any chunk is missing or corrupted, notify the client to re-upload the specific chunk.
Implement timeouts or expiration for incomplete uploads to clean up unused temporary data.

3.2 File Storage Format and Organization

File System Storage:

Advantages:
- Direct access to files for processing without the overhead of database retrieval.
- Efficient handling of large files due to file system optimizations.
- Simplifies backup and restoration processes.

Storage Path Example:

/storage/
  └── {user_id}/
      └── {YYYY}/
          └── {MM}/
              └── {DD}/
                  └── {upload_id}/
                      └── {secure_file_name}.{ext}

Implementation Steps:

Determine Storage Base Path:
- Define a root directory (e.g., /storage/) outside of the web server's document root.
Set Up Directory Hierarchy:
- When a new file is uploaded, create directories based on user ID and date if they do not already exist.
- Use secure methods to create directories to prevent race conditions.
File Naming and Metadata:
- Rename files using a secure hash or UUID to avoid conflicts and enhance security.
- Store the mapping between the original file name and the stored file name in the database.
File Access Permissions:
- Set file permissions so that only the application and authorized processes can read/write the files.
- Ensure that files cannot be accessed directly via URLs.
Processing Programs Access:
- Provide processing programs with the necessary file paths and permissions.
- Optimize file read/write operations if processing is resource-intensive.
Storage Maintenance:
- Monitor disk usage and implement policies for archiving or deleting old files if necessary.
- Consider implementing quotas to prevent individual users from consuming excessive storage.

3.3 Database Schema Design

Files Table:

Stores metadata about uploaded files, not the files themselves.

CREATE TABLE files (
    file_id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(user_id),
    upload_id UUID UNIQUE NOT NULL,
    original_filename VARCHAR(255) NOT NULL,
    storage_path VARCHAR(1024) NOT NULL,
    file_size BIGINT NOT NULL,
    status VARCHAR(50) NOT NULL DEFAULT 'uploaded',
    upload_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    processing_start_time TIMESTAMP WITH TIME ZONE,
    processing_end_time TIMESTAMP WITH TIME ZONE
);

Fields Explanation:
- file_id: Unique identifier for the file record.
- user_id: References the user who uploaded the file.
- upload_id: Unique identifier for the upload session, used for resuming uploads.
- original_filename: The name of the file as provided by the user.
- storage_path: The file system path where the file is stored.
- file_size: Size of the file in bytes.
- status: Current status of the file (e.g., 'uploaded', 'processing', 'completed', 'failed').
- upload_timestamp: Timestamp when the file was uploaded.
- processing_start_time and processing_end_time: Timestamps for tracking processing duration.

Transcriptions Table:

Stores the transcribed text associated with a file.

CREATE TABLE transcriptions (
    transcription_id SERIAL PRIMARY KEY,
    file_id INTEGER REFERENCES files(file_id),
    transcribed_text TEXT NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Indexes and Constraints:

Create indexes on commonly queried fields (e.g., user_id, status) to improve query performance.
Use foreign key constraints to maintain referential integrity between files and transcriptions.

3.4 Data Integrity and Completeness

Checksum Verification:

Purpose:
- Ensures that the file received is complete and uncorrupted.
Implementation Steps:
1. Client-Side (Optional):
  - Calculate a checksum (e.g., MD5, SHA-256) of the entire file before uploading.
  - Send the checksum value to the server along with upload initiation.
2. Server-Side:
  - After assembling the file from chunks, calculate the checksum.
  - Compare it with the client-provided checksum (if available).
  - Alternatively, calculate and store the server-side checksum for future reference.
Handling Mismatches:
- If checksums do not match, mark the upload as failed.
- Notify the client to re-upload the file.

Transactional Operations:

Use database transactions when updating multiple related records to ensure atomicity.
For example, when updating the file status and creating a transcription record, perform both operations within a transaction.

BEGIN;

-- Update file status
UPDATE files SET status = 'completed', processing_end_time = NOW() WHERE file_id = {file_id};

-- Insert transcription record
INSERT INTO transcriptions (file_id, transcribed_text) VALUES ({file_id}, {transcribed_text});

COMMIT;

Error Handling and Logging:

Implement comprehensive error handling throughout the upload and processing workflows.
Log errors and exceptions to assist with debugging and auditing.

3.5 Handling Malicious Files

Antivirus Scanning:

Integration with ClamAV:
- Install ClamAV on the server.
- Update virus definitions regularly using freshclam.
Scanning Process:
- After the file is assembled and before processing, scan the file:
```
clamscan --stdout --no-summary /path/to/file
```
- Parse the output to determine if any threats were detected.
Handling Detected Threats:
- If a file is identified as malicious:
  - Delete the file securely.
  - Update the file's status in the database to 'infected' or 'failed'.
  - Notify the user that their file could not be processed due to security concerns.

File Type Validation:

Verify that the file's MIME type and content match allowed audio/video formats.
Implementation Steps:
1. Detect MIME Type:
  - Use libraries such as python-magic to detect the MIME type.
2. Validate Against Allowed Types:
  - Define a list of allowed MIME types (e.g., audio/mpeg, video/mp4).
  - Reject files that do not match the allowed types.

Processing in Isolation:

Run file processing tasks in sandboxed environments (e.g., Docker containers) to limit potential damage from malicious files.
Benefits:
- Limits the scope of any potential exploit.
- Makes it easier to manage dependencies and environments for processing tasks.

Access Control and Permissions:

Ensure that the application and processing services run with the least privileges necessary.
Avoid running processes as the root user.

🚀 Deployment Considerations

4.1 Database Deployment on Server

Installation of PostgreSQL:

Install PostgreSQL on the server:

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib

Configuration:

Configure postgresql.conf and pg_hba.conf to set up authentication and network settings.
Security Best Practices:
- Use strong passwords for database users.
- Restrict remote access if not necessary.
- Enable SSL for database connections if transmitting sensitive data over networks.

Database Setup Steps:

Create a Database User:

CREATE USER app_user WITH PASSWORD 'secure_password';

Create the Application Database:
```
CREATE DATABASE app_db OWNER app_user;
```
Set Up Permissions:
- Grant the necessary permissions to app_user on app_db.

Maintenance:

Regularly back up the database using tools like pg_dump.
Monitor database performance and adjust configurations as needed.

4.2 File Storage Setup

Directory Creation:

Create the root storage directory:
```
sudo mkdir -p /storage
```

Setting Permissions:

Assign ownership to the application user (e.g., app_user):
```
sudo chown -R app_user:app_group /storage
```
Set appropriate permissions:
```
sudo chmod -R 700 /storage
```

Storage Considerations:

Disk Space Monitoring:
- Implement monitoring solutions to track disk usage.
- Set up alerts for low disk space conditions.
Scalability:
- Plan for additional storage capacity as the application grows.
- Consider using network-attached storage (NAS) or storage area networks (SAN) if required.
Backup and Recovery:
- Implement regular backups of the storage directories.
- Use incremental backups to save space and time.
- Test backup restoration procedures periodically.

Security Measures:

Ensure that storage directories are not accessible via the web server.
Use firewalls and security groups to restrict access to storage resources.

✅ Conclusion

This technical design document outlines a robust approach to handling large audio and video file uploads, storage, and processing, with a focus on efficiently storing files directly in the file system rather than the database. By implementing chunked uploads with resumable capabilities, organizing files in a structured directory hierarchy, and leveraging PostgreSQL for storing metadata and transcribed text, the system addresses the key challenges of reliability, efficiency, data integrity, and security. The detailed implementation steps provided aim to guide developers in building a scalable, maintainable, and secure system that meets the requirements specified.

🗄️ Database and File Management - fuhui14/SWEN90017-2024-TAP GitHub Wiki

Technical Design Document for Large Audio/Video File Upload and Processing System

📝 Introduction

Table of Contents

🚧 Technical Challenges and Solutions

2.1 Reliable Large File Uploads

2.2 Server-side File Storage Strategy

2.3 Database Design for Transcribed Text

2.4 Security Considerations

🔧 Implementation Details

3.1 File Upload Mechanism

3.2 File Storage Format and Organization

3.3 Database Schema Design

3.4 Data Integrity and Completeness

3.5 Handling Malicious Files

🚀 Deployment Considerations

4.1 Database Deployment on Server

4.2 File Storage Setup

✅ Conclusion