Working with remote inputs - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki
Snakemake has built-in support to handle remote file access for many protocols:
- Amazon Simple Storage Service (AWS S3)
- Google Cloud Storage (GS)
- Microsoft Azure Storage
- File transfer over SSH (SFTP)
- Read-only web (HTTP[S])
- File transfer protocol (FTP)
- GenBank / NCBI Entrez
- Dropbox
- XRootD
- WebDAV
- GFAL
- GridFTP
- iRODS
- EGA
Using a remote file in a rule requires the following steps:
- Import a Python module for the specific remote access protocol in the snakefile; these modules are included in Snakemake's distribution
- Initialize a remote provider object in the snakefile
- Declare the input file as remote in the rule with a syntax that depends on the protocol; generally, the syntax looks like this:
input: <remote_provider_object_name>.remote('path/to/file.txt')
The following example from Snakemake's documentation shows how to use a remote file over HTTP:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider # Import provider module
HTTP = HTTPRemoteProvider() # Initialize provider object
rule remote_input_rule:
input:
HTTP.remote("www.example.com/path/to/document.txt") # Declare input as remote
output:
'results/output.txt'
shell:
'grep "Snakemake" {input} > {output}'When working with remote files, Snakemake downloads a local copy of the file to the current directory, in a folder named after the remote access protocol (e.g. http for HTTP remote files). By default, the local copy is deleted after the job requiring it is completed. To keep a local copy of the remote file, use the parameter keep_local=True in the <provider>.remote() function.