Working with remote inputs - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki

Snakemake has built-in support to handle remote file access for many protocols:

- Amazon Simple Storage Service (AWS S3)
- Google Cloud Storage (GS)
- Microsoft Azure Storage
- File transfer over SSH (SFTP)
- Read-only web (HTTP[S])
- File transfer protocol (FTP)
- GenBank / NCBI Entrez
- Dropbox
- XRootD
- WebDAV
- GFAL
- GridFTP
- iRODS
- EGA

Using a remote file in a rule requires the following steps:

  • Import a Python module for the specific remote access protocol in the snakefile; these modules are included in Snakemake's distribution
  • Initialize a remote provider object in the snakefile
  • Declare the input file as remote in the rule with a syntax that depends on the protocol; generally, the syntax looks like this: input: <remote_provider_object_name>.remote('path/to/file.txt')

The following example from Snakemake's documentation shows how to use a remote file over HTTP:

from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider  # Import provider module

HTTP = HTTPRemoteProvider()  # Initialize provider object

rule remote_input_rule:
    input:
        HTTP.remote("www.example.com/path/to/document.txt")  # Declare input as remote
    output:
        'results/output.txt'
    shell:
        'grep "Snakemake" {input} > {output}'

When working with remote files, Snakemake downloads a local copy of the file to the current directory, in a folder named after the remote access protocol (e.g. http for HTTP remote files). By default, the local copy is deleted after the job requiring it is completed. To keep a local copy of the remote file, use the parameter keep_local=True in the <provider>.remote() function.

⚠️ **GitHub.com Fallback** ⚠️