HMIS release of information documentation files: upload strategy - servinglynk/hslynk-open-source-docs GitHub Wiki

Purpose

The idea behind this wiki page is to give direction to store HMIS release of information and other client documentation data with in HMISWarehouse.

Solution

Hence the idea is to store the file in PostgreSQL via a LOB/BYTEA column. When designing a "perfect" solution for storing files, I've defined the following desired features:

transactional
properly enforced referential integrity
low memory requirements (no problems with memory limits)
support for streaming (in ideal case)

And I've came up with solution based on two tables - "Files" and "File_data." Table "Files" contains information about files (metadata) but not the data or any reference to filesystem or BLOB:

Here's how it is going to be done

CREATE TABLE Files (
    id       SERIAL      PRIMARY KEY,
    filename VARCHAR(64) NOT NULL,
    filesize INT         NOT NULL,
    created  TIMESTAMP   NOT NULL,
    ... other metadata ...
);

The data of the file are contained in the "File_data" table, split into segmens of arbitrary length:

CREATE TABLE File_data (
    file_id  INT,
    segment  INT,
    data     BYTEA,
    PRIMARY KEY (file_id, segment)
);

ALTER TABLE File_data ADD FOREIGN KEY (file_id) REFERENCES Files(id);

Regarding the desired features mentioned above, the first two (transactional, referential integrity) are obviously met. The third (low memory requirements) depends solely on the segment size - the segment size implies the amount of memory required) - the 8kB requires about 32kB of memory.

The last requirement (support for streaming) is not met - at least not literally as it is not possible to open the file as a stream (as is possible with BLOB). But when using sufficiently small segments, the effect is sufficient.

Why PostgreSQL ?

PostgreSQL offers faster retrieval of data from the LOBS when compared of other RDBMS. Lobs can store large volume of data but the important thing to know is you must access LOBs with in a SQL transaction block. Since it supports large amount of data we could incorporate versioning of documents.

Where does Big Data Hadoop(HDFS/HBASE) in this picture?

Once the data is loaded inside the PostgreSQL we can then store the data in HDFS using a sync process.HDFS is Hadoop distributed file system and is a file system verses a Database. Hadoop is more scalable and also provides replication support.

Conclusion

Storing data inside the File and File_Data table tables could be part of a separate Microservice just like notification/auditservice.All these documents could be in a totally different PostgreSQL database which may run on a totally different database server to provide security for documents and a separate database would also make sure we don't run into space constraints with existing HMIS DB when our data grows.

Ps: I'll have to draw an architecture diagram to accomplish the same.