Configuration - CERIT-SC/funnel-gdi GitHub Wiki

Funnel can run without external configuration as it comes with default configuration. However, for most cases it's not enough.

Configuration From YAML File

You can begin with the configuration provided in the default-config.yaml file. Create a local YAML file, remove the configuration parts you don't need, modify the ones that need fixing for your scenario (following sections give an overview of the setup). Finally, provide the location of the configuration to the funnel command using either -c or --config flag:

./funnel -c my-config.yaml server run

The location of the flag is not important as long as it is followed by the file-path: ./funnel server run -c my-config.yaml is also valid.

Configuration From Flags

Though it's not commonly used, it's good to know that there are also command-line flags that can override the configuration-file options. The list of flags is displayed using this command:

./funnel server run --help

Logger

This section in the configuration file helps to fine-tune verbosity and format of the Funnel program:

Level: verbosity as debug, ìnfo (default), warn, and error.
OutputFile: when non-empty, redirects logging to the specified file-path. By default, it's empty and logs are displayed on the console through STDERR.
Formatter: when json is specified, the logs are formatted as multi-line JSON-formatted records. Otherwise, a multi-line attribute-value formatted log-records are printed (non-configurable format).

Logger:
  Level: debug
  OutputFile: ""
  Formatter: json

Database

Database is declared through the Database field, and corresponding database-specific section. Possible values for the field:

Local file-based databases:

badger
boltdb (default)

Locally deployable databases:

elastic (ElasticSearch)
mongodb (MongoDB)

Cloud services:

datastore (Google Datastore)
dynamodb (Amazon DynamoDB)

Funnel uses the database for storing tasks and logs. The data size depends on the usage activity. Per task, the data size should be quite modest.

Depending on the choice of the database, also review and configure the corresponding database, and remove other database sections from the configuration file.

For local databases, make sure that the configured paths refer to mounted volumes, for example under the work directory. Otherwise, the data is not persisted when the Funnel container is removed/upgraded.

Storages (Input And Output)

Funnel comes with support for many storages that are enabled by default. Therefore, when you want to disable a storage, it is not enough to remove the configuration section. You need to explicitly set Disabled: true.

Local Storage (File-System)

Specify where task executors can read input-files and write exported output-files on the local file-system.

This is not to be confused with where task executors (e.g. containers) create and modify files inside the process (which is contained within the work-directory).

LocalStorage:
  Disabled: false
  AllowedDirs:
    - /mnt/funnel-files/

This storage is used when a file URL uses the file protocol. For example: file:///mnt/funnel-files/project-x/specimen.dat.

More details in the official documentation.

HTTP

Specify the timeout for retrieving a file over HTTP (using the GET method):

HTTPStorage:
  Disabled: false
  Timeout: 30s

This storage is used when a file URL uses the http or https protocol. For example: https://ftp.example.org/project-x/specimen.dat.

NOTE: credentials in the URL are not supported for initiating Basic authentication.

FTP

FTP can be used for fetching and uploading inputs and outputs of executors.

The provided URL may contain credentials for interacting with the service. Without credentials in the URL, the user and password from configuration (by default: anonymous:anonymous) is used.

FTPStorage:
  Disabled: false
  Timeout: 30s
  User: anonymous
  Password: anonymous

This storage is used when a file URL uses the ftp or sftp protocol. For example: sftp://ftp.example.org/project-x/specimen.dat.

Sensitive Data Archive (SDA)

This archive can be used only for input files (i.e. read-only storage). In addition, for using the storage, user must be authenticated via Life Science AAI, so that SDA could use the access token for inspecting user's passport and visas, and verify user's permission to access the data.

To activate this storage, the URL to the SDA service (sda-download) is required, and the service must be available when Funnel is launched.

SDAStorage:
  ServiceURL: https://sda.example.org/
  Timeout: 30s

This storage is used when a file URL uses the sda protocol. For example: sda://DATASET_ID_001/specimen.dat.

Note that user's need to rely on the SDA service defined in the Funnel configuration.

NOTE: the custom-developed SDA-plugin supports Crypt4gh encryption/decryption. When the reference file ends with .c4gh, the plugin also tries to decrypt the file so that the executor would not have to. Failure to decrypt the file results in the failure of the task.

HTSGET

This service is used for filtering the source BAM/VCF files to reduce the amount of data to be downloaded. Therefore, the first request to the HTSGET service results in a JSON that describes additional data-request that Funnel must send (usually to the data-storage) to fetch parts of the target file.

To activate this storage, the URL to the HTSGET service is required, and the service must be available when Funnel is launched.

HTSGETStorage:
  ServiceURL: https://htsget.example.org/
  Timeout: 30s

This storage is used when a file URL uses the htsget protocol. As valid examples:

htsget://reads/DATASET_2000/synthetic-bam?class=header
htsget://variants/DATASET_2000/synthetic-vcf?referenceName=chr20

Note that user's need to rely on the HTSGET service defined in the Funnel configuration.

More information about the HTSGET specification is here.

NOTE: the custom-developed HTSGET-plugin supports Crypt4gh encryption/decryption. When the reference file ends with .c4gh, the plugin also tries to decrypt the file so that the executor would not have to. Failure to decrypt the file results in the failure of the task.

S3 And Other Cloud Storages

As these storages are well documented, we just reference them here:

S3 (both Amazon, and other S3 deployments)
Google Cloud Storage
OpenStack Swift

These storages are not explicitly disabled. Except for Amazon S3, they won't become active by default, as they need to be configured with access credentials. Access to public Amazon S3 buckets is enabled by default. As an example, the following is a valid Amazon S3 URL: s3://example-bucket/hello.txt.

For convenience, here is a configuration snippet for disabling these providers:

AmazonS3:
  Disabled: true

GenericS3: []

Swift:
  Disabled: true

NOTE: it is not possible to disable specific use-cases (reading or writing) of these storages. However, this can be restricted by:

S3 role permissions (role is defined by the configured Key and Secret)
disabling S3 storages, and forcing users to read files from public buckets using the HTTP protocol instead.

Computation

Only one computation method can be defined per Funnel instance. By default, it relies on Docker and therefore the docker command is expected to be available in the system path (as well as the daemon process).

Computation is defined as follows (local Docker in this example):

Compute: local

If you run Funnel from the container, be sure to share the Docker socket file to the container process:

docker run -v /var/run/docker.sock:/var/run/docker.sock ...

Container-based computation is also configurable but the only time you might want to change it, is when you want to use different command-line tool. Here is the default Docker based setup for local computations:

Worker:
  Container:
    DriverCommand: docker
    RunCommand: >-
      run -i --read-only
      {{if .RemoveContainer}}--rm{{end}}
      {{range $k, $v := .Env}}--env {{$k}}={{$v}} {{end}}
      {{range $k, $v := .Tags}}--label {{$k}}={{$v}} {{end}}
      {{if .Name}}--name {{.Name}}{{end}}
      {{if .Workdir}}--workdir {{.Workdir}}{{end}}
      {{range .Volumes}}--volume {{.HostPath}}:{{.ContainerPath}}:{{if .Readonly}}ro{{else}}rw{{end}} {{end}}
      {{.Image}} {{.Command}}
  PullCommand: pull {{.Image}}
  StopCommand: rm -f {{.Name}}

Alternative computation management options:

local, htcondor, slurm, pbs, gridengine, manual, aws-batch

HTTP and gRPC Servers

Funnel runs both HTTP based Task Execution Service API and also gRPC based API behind the HTTP API. The APIs are defined in gRPC proto files:

tes.proto defines the TES API.
scheduler.proto defines endpoints for listing nodes (GET /v1/nodes and GET /v1/nodes/{id}).
events.proto defines internal API for managing computational events.

In terms of deployment, gRPC API is relevant only if you plan to deploy a multi-node Funnel cluster: events API is used for publishing events from nodes.

Therefore, just focus on configuring the HTTP server:

Server:
  ServiceName tes.example.org
  HostName: tes.example.org
  HTTPPort: 8000
  RPCPort: 9090
  DisableHTTPCache: true

The last parameter affects whether HTTP caching should be turned of for TES API responses. If you plan to use the web-based dashboard, the true value is the best, as otherwise browser may serve old responses from its cache.

Access Control

Funnel, by default, allows anyone to use its API and dashboard UI. To introduce user-based access control, there are two options:

define users in the configuration for HTTP Basic authentication;
define an OIDC service in the configuration for delegating user authentication.

Finally, once user-authentication is enabled, also define the preferred task-acces mode for users. More details below.

Basic Authentication

Define users under the Server configuration. Optionally, as user can be marked as an administrator. Typically users can just see their own tasks. Admins can see (and cancel) the tasks of all users.

Server:
  BasicAuth:
    - User: admin
      Password: admin-pass-example
      Admin: true
    - User: user1
      Password: user1-pass-example

If the list of users is empty, Basic authentication is NOT enforced.

Credentials must be passed to the Funnel API using the HTTP header:
Authorization: Basic encoded-credentials-here.

OIDC Authentication

This assumes that there is an authentication service that supports the OIDC standard. The Funnel service instance must be registered at the OIDC provider Typically, a redirect URL must be registered, and it should end with /login, for example: https://funnel.example.org/login. The OIDC provider must provide client ID and secret values, and standard configuration URL. Now register that info in the Funnel configuration file under the Server-section:

Server:
  OidcAuth:
    ServiceConfigURL: https://www.example.org/oidc/.well-known/openid-configuration
    ClientId: to-be-copied-from-oidc
    ClientSecret: to-be-copied-from-oidc
    RequireScope: email
    RequireAudience:
    RedirectURL: https://funnel.example.org/login
    Admins:
      - [email protected]

Note that the RedirectURL must be valid and it must match the one registered at the OIDC provider. RequireScope defines one or more space-separated authentication scope values. RequireAudience is optional, and can be used for enforcing that the user-presented Access Token is valid.

The section for Admins is optional: it can be used list user IDs (in the sub claim) that can be elevated to the admin-role just like with the Basic authentication.

User's Access Token must be presented to the Funnel API using the HTTP header:
Authorization: Bearer access-token-here.
The dashboard stores the access-token in an HTTP cookie named jwt.

Expiration time of the cookie is determined by the OIDC service. Funnel does not invalidate the token as it might affect some tasks if the task is using token based authentication for accessing storage.

Task-Access Mode

Funnel provides following options for the Server.TaskAccess option:

All (default) – all authenticated users can view and cancel all tasks;
Owner – tasks are visible to the users who created them (admins are not privileged);
OwnerOrAdmin - extends Owner by allowing Admin-users see and cancel everything.

The recommended option for most cases is OwnerOrAdmin.

Summary

Here is a simple configuration for running Funnel locally on Docker as the computation service.

Compute: local
Database: badger
Badger:
  Path: /opt/funnel/badger.db
Logger:
  Level: info
  Formatter: json
EventWriters:
  - log
  - badger

Server:
  ServiceName: tes.example.org
  HostName: tes.example.org
  HTTPPort: 8000
  RPCPort: 9090
  DisableHTTPCache: true
  TaskAccess: OwnerOrAdmin

  # Keep either BasicAuth or OidcAuth but not both.
  BasicAuth:
    - User: admin
      Password: admin-pass-example
      Admin: true

  OidcAuth:
    ServiceConfigURL: https://www.example.org/oidc/.well-known/openid-configuration
    ClientId: to-be-copied-from-oidc
    ClientSecret: to-be-copied-from-oidc
    RequireScope: email
    RequireAudience:
    RedirectURL: https://funnel.example.org/login
    Admins:
      - [email protected]

Worker:
  WorkDir: /opt/funnel/work

LocalStorage:
  Disabled: false
  AllowedDirs:
    - /mnt/funnel-files/

HTTPStorage:
  Disabled: false
  Timeout: 30s

HTSGETStorage:
  ServiceURL: https://htsget.example.org/
  Timeout: 30s

SDAStorage:
  ServiceURL: https://sda.example.org/
  Timeout: 30s

GenericS3: []
AmazonS3:
  Disabled: true