Data Storage with AWS S3 - Green-Biome-Institute/AWS GitHub Wiki
Data Storage with S3
As its title suggests, Amazon Simple Storage Service (S3) is a straightforward, robust utility for storing and retrieving data in the cloud. The basic building block of S3 is called an S3 bucket, which are the containers where we store information. You can create “prefixes” within buckets, which are like folders and used for organization. Finally, when you store a file in the cloud, maybe a fastq or fast5 file, it is called an "object." There should be no issue with file size as objects in S3 buckets can be up to 5 TB. However, like with computing power, every bit of storage used must be paid for so it is best to only have data in the cloud when it is being used.
There are two main methods for creating buckets, uploading data into them, and deleting them. First is from the AWS online interface and the second is from the command line.
Creating a bucket from the AWS online interface
- Type S3 in the search bar and navigate to the S3 dashboard, which should list all the S3 buckets in the account that you have permission to see.
- Click the orange button that says “Create bucket.” You should see the Create bucket interface now.
- Come up with a simple name for your bucket describing who is using it and what you will be storing in it (name-genome-assembly). AWS S3 buckets must all have unique names, meaning that the name for every bucket on AWS (including ones from other peoples accounts) have to be different. They also cannot have spaces or uppercase letters.
- Use the AWS region "US West (N. California) us-west-1".
- Keep “Block all public access” checked.
- Keep “Bucket Versioning” disabled.
- Add a cost-allocation tag (as described here) so that we can track the usage of this bucket.
- Leave “Default encryption” on "disable"
- Click “Create bucket.”
You’ll then be led back to the S3 dashboard where you can see your new bucket listed.
To upload data, click on your bucket from the S3 dashboard. It will take you to a new dashboard that shows all of the objects currently stored in the bucket. Click the orange “Upload” button, drag and drop your file from your computer or click “Add files” and navigate to the file you want to upload and click open. Once your file is listed, click “Upload” at the bottom of the dashboard and wait for the file to be uploaded. When the upload finishes (it will say “Upload succeeded”), you can click the orange “Close” button.
To delete data, click on your bucket from the S3 dashboard again. Check the box next to the file you want to delete and press the white/grey “Delete” button. You will have to confirm deleting the file by typing “delete” when prompted.
Lastly, to delete a bucket, go to the main S3 dashboard where all the buckets are listed. Check the box next to the bucket you want to delete (and make sure it is the correct one), and press the delete button at the top of the dashboard. It will prompt you to type the name of the bucket you are deleting before allowing you to delete it.
That’s it!
Interacting with S3 using the command line
In order to interface from the command line, you will have to be logged into an AWS EC2 instance with read/write access to S3 via ssh or AWS Instance connect (as described here).
First, the instance needs to have the AWS CLI package:
sudo apt install awscli
If this is giving you trouble try (for Mac):
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg
sudo installer -pkg AWSCLIV2.pkg -target /
Then you can use the following command to list all of the s3 buckets that you have access to on the account:
aws s3 ls
To create a new bucket:
aws s3 mb s3://[bucket-name] --region [region-name]
For example to create the bucket my-first-bucket in the us-west-1 region, the command would be:
aws s3 mb s3://my-first-bucket --region us-west-1
Where [bucket-name] is the name of the bucket you want to create, which still must follow the naming rules from (3) above.
To list the objects (the files) that are in your bucket:
aws s3 ls s3://[bucket-name]
To upload a file to your bucket
aws s3 cp [my-file] s3://[bucket-name]/[path-to-desired-location]
To delete a file from your bucket:
aws s3 rm s3://[bucket-name]/[path-to-file-you-want-to-delete]/[filename.filetype]
Finally, to delete a bucket itself, it must be empty of all objects, then you can use:
aws s3 rb s3://[bucket-name]
In order to copy multiple files with similar names you can use wild cards and do:
aws s3 cp s3://[bucket-name] ./ --recursive --exclude="*" --include="your_file_pattern*"
How do we store data at the GBI?
Since the hope is that if you are reading this page you will be doing analysis work for the GBI! That means you will be generating results of some sort. In order to help you store these results on our S3, here is an overview of the structure of our S3 infrastructure.
We have 4 main buckets of information. These consist of:
gbi-raw-data
This is where we store the sequencing data for the rare and endangered plants that the GBI is working with. There is no student access to store or edit any files in this bucket as this data is sensitive. These files can be copied and worked with, just not edited in any way. For any files that you feel should be stored in this bucket, please talk to your professor. Within this bucket you will find a folder for every single plant that we have data on. Inside of each folder will have that plants corresponding raw sequencing data (or any other relevant data files.gbi-analysis
This bucket has a similar structure as thegbi-raw-data
bucket, with one folder for each plant. Instead of containing the raw data associated with that plant, these folders will instead have all analysis results. Most students will also not have access to edit files within this bucket. If you are doing work that will generate files that you think should be stored here, please ask your professor.gbi-faculty-projects
andgbi-student-projects
These two buckets will have folders within them respective to students or faculty who are doing analysis work on the GBI AWS account. If you are a student and you have been given permission to store data on our AWS account, please create a folder with you name on it inside of thegbi-student-projects
bucket and store your data and results there.
It is worth looking in the gbi-analysis
bucket to see how we organize results within the plant analysis files. You will likely see folders that contain information that is very specific to whatever experiment or analysis is being done. This might include folder names with the software name being tested, certain variables that are being changed, dates, plant names, the type of analysis (assembly vs annotation for example), etc. Please be very descriptive when storing your data and do not be afraid to use lots of folders to create good organization. It will save you lots of time in the future (and potentially save your entire project if you end up needing to go back and see what you did or re-do analysis work!)
More information:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html
https://www.sqlshack.com/learn-aws-cli-interact-with-aws-s3-buckets-using-aws-cli/