File Uploads - opengovsg/GoGovSG GitHub Wiki

Within the near future, GoGovSG will release a feature that allows users to upload files and share them via shortlinks. This page seeks to document our design decisions and thought process behind the implementation of this feature.

Constraints

When designing our implementation of this feature, these were the constraints we took into account.

  1. One-to-one mapping between a shortlink and an S3 bucket's object key - This allows us to very quickly tell at a glance which link does an object belong to. This provides us with another guarantee—if a particular short link has not been taken yet, this also means the corresponding S3 bucket's key is available.
  2. Deletion of shortlinks are not allowed.

S3 configuration

The S3 bucket was originally configured with a bucket-wide public-read policy. This was in alignment with the philosophy of GoGovSG being a public link shortener. However, if we wished to allow file urls to be disabled, we would need to be able to set certain S3 objects to be private. This could only be done through setting an object's access control list (ACL). The behavior of S3 bucket policies and object-specific ACLs necessitated a switch in our configuration. Instead, we now have a bucket policy that sets all objects to be private by default; each object would need to have the 'public-read' ACL set in order to be visible.

Phases in a file upload

In light of the constraints and S3 configuration, the file upload process is to be split into three operations.

  1. Creation of the shortUrl - This serves as a way for us to 'reserve' both the shortlink and S3 bucket key. If this operation fails, we know that there might be a collision in bucket key, and therefore should not perform the upload operation.
  2. Upload file to S3 - In this upload step, the client could either obtain a pre-signed URL to upload the file directly to S3, or send the file to the server and have it forwarded to the bucket.
  3. Set object's ACL to be 'public-read'

Ensuring atomicity

The fact that this upload operation spans multiple services necessitates a guarantee on atomicity. We would not want shortlinks pointing to nonexistent S3 objects, and neither should there be any orphaned S3 objects that do not belong to a shortlink.

Option 1: Client-side upload

One option we considered was to let the upload task be done directly from the client. This would entail the following steps:

  1. Client makes a regular request to create a link. This would count as a reservation of the shortlink.
  2. Client requests a pre-signed URL from the server, which will allow the client to send an authorised upload request to S3.
  3. Client makes another API call to the server to trigger an S3 update of the ACL.

Benefits: Having the upload operation done from the client would save bandwidth usage since the binary data goes directly to the S3 bucket.

Problems encountered: Difficult, if not impossible to guaranteed atomicity in this entire upload operation because of constraint 2, which states that deletion of shortlinks are not allowed. If something goes wrong with step 2 or 3, constraint 2 makes it impossible to roll back the creation of the shortlink.

Option 2: Server-side upload

If the upload was done server-side, we can make use of the DB's transaction on an application level to ensure atomicity in our entire upload flow.

  1. Client sends shortlink and file to server.
  2. Server opens a DB transaction and a shortlink. This process reserves the shortlink because ACID guarantees no dirty-reads.
  3. Server uploads file to S3.
  4. Depending on the outcome of the upload operation, the server can either commit the transaction, or rollback (which will 'undo' the link creation).

Benefits: Atomicity can be guaranteed on the server via a database transaction.

Drawbacks: More bandwidth and RAM used to send files to the server to be relayed to S3.

Decision

The team decided on option 2 (server-side uploads) on the following grounds:

  1. Ensuring atomicity in the application of utmost importance—at no point should the state of the database and files be out-of-sync.
  2. File uploads are limited to 10MB, making the resource consumption much less of a problem.