Architecture: graphql - metaspace2020/metaspace GitHub Wiki

sm-graphql

This project is organized into vertical slices of functionality, grouped by Business Domain. This is an intentional design choice to minimize the amount of layering and boilerplate that is common in web server projects. Most simple CRUD logic should be implemented directly inside GraphQL resolver functions.

sm-graphql starts several servers/processes, all managed in the server.js file:

  • Apollo GraphQL API server
  • Apollo GraphQL Subscription API server (WebSockets)
  • HTTP registration/login/etc REST API
  • HTTP "Storage Server" for raw optical image upload
  • (in the Storage Server) Uppy Companion server for signing upload URLs for direct dataset/molDB upload to S3
  • A scheduled "cron" job for sending reminder emails to users to publish their data if they have old private projects

Additionally, TypeORM runs any new database migrations on startup.

The GraphQL API can be easily explored at https://metaspace2020.eu/graphql (or your local dev server equivalent). Set "request.credentials": "include", in the settings and it will use your login details from the browser cookies.

Security

Almost all security-related logic happens in sm-graphql:

  • User creation/login is handled by the REST API in src/modules/auth/controller.ts
  • Authentication is handled by Passport middleware based on each request's cookies/JWTs/Api-Keys
  • Authorization needs to be handled explicitly in GraphQL resolver functions. This is usually done when retrieving the entity, e.g.:
    • The ElasticSearch queries in esConnector.ts filter datasets/annotations to only include results that should be visible to the current user. Some controllers will even query and discard the result just to check that the dataset is allowed to be accessed.
    • When there are multiple different levels of access privilege, it should be explicit in the function names, e.g. getDatasetForEditing which will raise an exception if the user isn't allowed to edit the dataset.
    • Operations that call sm-api must still handle authorization! sm-api doesn't do any authorization itself.
  • As an optimization, some resolvers pass authorization information to their children resolvers through a scopeRole field, e.g.

Authentication methods

Cookie

Managed by the Passport library, works like every other website - the cookie content includes a signed session ID, the actual session data is stored in Redis. Cookies are the primary authentication mechanism - (non-anonymous) JWTs and Api-Keys can only be generated by a user authenticated with a cookie.

The cookie is the same whether a user logs in with Google or Email+Password.

JWT

GraphQL requests from webapp use a JWT for authentication. This isn't really needed anymore - previously webapp and graphql were separate and webapp handled authentication. It's just more work to clean up - getting access to the cookies in the GraphQL Subscription Server has been an difficult/impossible in the past. The subscription server library has probably fixed that by now.

Python Client also uses JWTs if Email+Password authentication is used. For Api-Key authentication, the JWT isn't needed.

Api-Key

API Keys use a similar authentication code path to JWTs, but have significant restrictions (only specific mutations are allowed, some queries are blocked, all usages are logged) to limit the impact if they're leaked. They're intended for use with the Python Client.

Project review link

The project publication workflow allows a user to create a share link to that project. Anyone who accesses this link is allowed to see the datasets in the project - the authorization details are persisted in the user's session, even if they're not logged in.

Email validation link

Not intended to be used continually, but for new users' convenience, clicking the email validation link will give them a logged-in cookie up to 1 hour after account creation. This technically counts as an authentication method from a security perspective.

Code Patterns

Bindings, Sources and Models

It's common for one business entity to have multiple representations at the various interfaces. These follow a naming convention:

import { Dataset } from '../../../binding' // The "Binding" is the GraphQL schema type - no suffix is used
import { DatasetSource } from '../../../bindingTypes' // The "Source" is the type returned by resolvers, which may have additional fields for internal use e.g. authentication data
import { Dataset as DatasetModel } from '../model' // The "Model" is the TypeORM DB Entity class

Scope roles

When a parent resolver needs to share authentication data with one or more child resolvers, it's done through a "ScopeRole" field. E.g. only group members may see a user's email address, but the group member status is easiest to query when selecting the user. Resolvers like Group.members include a scopeRole field in their returned "source" object, which is eventually used in User.email to check if the email should be visible.

GraphQL Resolvers

GraphQL queries are executed hierarchically, e.g. for this query:

query {
  currentUser {
    id
    email
    projects {
      role
      project { id name }
    }
  }
}

the Apollo GraphQL server performs these operations:

  1. Call the Query.currentUser resolver (which returns a UserModel object). The GraphQL Schema defines the return value as User, so nested fields use User resolvers.
  2. Check for a User.id resolver - there is none, so the id raw value from UserModel is used instead
  3. Check for a User.email resolver - it exists, so it's called with the UserModel object as the source parameter
  4. Check for a User.projects resolver - it exists, so it's called with the UserModel object as the source. It returns an array of UserProjectModel objects. The GraphQL return type is [UserProject], so it's handled as an array and the nested fields use UserProject resolvers
  5. For each UserProjectModel object in User.projects' return value:
    1. Check for a UserProject.role resolver - it doesn't exist, so role from UserProjectModel
    2. Check for a UserProject.project resolver - it exists, so it's called with the UserProjectModel as the source. It returns a ProjectModel object, and the schema says the graphql return type is Project
    3. Check for a Project.id resolver - it doesn't exist, so the id field from the ProjectModel is used
    4. Check for a Project.name resolver - it doesn't exist, so the name field from the ProjectModel is used

This approach allows API consumers (i.e. webapp and python-client) to specify exactly what data they need. Well-written GraphQL resolvers are extremely flexible and often don't need any server-side code changes as the client-side application evolves.

The biggest drawback is that the hierarchical method calling makes it very easy to hit the "SELECT N+1" problem, e.g. in the above query the UserProject.project resolver is called for every project - if there are 10 projects and UserProject.project contains an SQL query, then 10 SQL queries will be issued. There's an easy solution though - see the Caching/Dataloaders section.

Resolver type definitions

The type annotations around resolvers are a bit janky because we want to enforce that the code is type-safe with the .graphql definitions, but "graphql-binding" was the only available .graphql-to-TypeScript interface compiler available when this was written. graphql-binding has many shortcomings in the generated types, and updated versions of the library don't seem to support our use-case, so we're currently stuck with it. If you want to fix this, there's an idea task that suggests a newer library that looks suitable. It's also viable to just write our own compiler, as GraphQL schemas are actually very simple to process if you use the programmatic API.

The compiled bindings are stored in graphql/src/binding.ts and are generated by yarn run gen-binding, which runs automatically when the Docker container starts or detects changes to .graphql files.

Here's an example resolver definition:

const ProjectResolvers: FieldResolversFor<Project, ProjectSource> = {
  async hasPendingRequest(project: ProjectSource, args: any, ctx: Context, info: GraphQLResolveInfo): Promise<boolean | null> {
   ...

The FieldResolversFor<TBinding, TSource>` allows TypeScript to enforce that the contained functions are loosely compatible with the GraphQL schema.

  • TBinding (Project from binding.ts in this case) is used to ensure the args and return type of each function matches the GraphQL schema
  • TSource (ProjectSource in this case) is used for type-checking that all resolvers handle the "Source" (first argument) correctly

The "resolver function" (hasPendingRequest) accepts up to 4 arguments:

  • source (called project in this case) - see TSource explanation above
  • args - arguments to this resolver. This only contains a value if the GraphQL schema includes arguments for this resolver. It's mostly used for Queries/Mutations, but some fields also use it e.g. Annotation.colocalizationCoeff. NOTE: args has the most problems with bad types generated in bindings.ts - e.g. String is translated to string[]|string and ID is translated to string|number. Often it's better to manually write type definitions for this argument.
  • ctx - the "Context" object for the current request. It's shared between all resolver calls in this request (allowing it to be used for caching), and includes useful stuff like the entityManager connection to the database, and the user details.
  • info - almost never needed. It contains metadata about the whole GraphQL query, including which sub-fields will be resolved from the current resolver's returned object.

Caching and DataLoaders

It's common for GraphQL resolvers within the same query to need access to the same data, but it's not easy to just select the data once and pass it to the functions that need it because resolvers are so isolated. Context manages a cache that's specific to the current request to help with these cases, e.g.

const getMolecularDbById = async(ctx: Context, dbId: number): Promise<MolecularDB> => {
  return await ctx.contextCacheGet('getMolecularDbById', [dbId],
    (dbId: number) =>
      ctx.entityManager.getCustomRepository(MolecularDbRepository)
        .findDatabaseById(ctx, db_id)
  )
}

This ensures that no matter how many times the outer getMolecularDbById is called during the request, the inner findDatabaseById function will only be called once per value of dbId.

Virtually all data in graphql is dependent on the current user's permissions, so no global caches have been set up.

DataLoaders can combine many independent function calls into a single function call that receives an array of the calls' parameters, allowing optimizations e.g. using a single SQL query to get many rows by ID. Here's a good explanation. Use Context.contextCacheGet to create a single instance of the DataLoader for each request. e.g.

const getDbDatasetById = async(ctx: Context, id: string): Promise<DbDataset | null> => {
  const dataloader = ctx.contextCacheGet('getDbDatasetByIdDataLoader', [], () => {
    return new DataLoader(async(datasetIds: string[]): Promise<any[]> => {
      const results = await ctx.entityManager.query('SELECT...', [datasetIds])
      const keyedResults = _.keyBy(results, 'id')
      return datasetIds.map(id => keyedResults[id] || null)
    })
  })
  return await dataloader.load(id)
}

Points of interest:

src/modules/auth

Contains authentication middleware and a non-GraphQL REST API for registration, login, JWT issuing, etc.

src/modules/webServer

Contains the Storage Server and code to run Uppy Companion

schemas

Contains the GraphQL schema files. These are compiled by Apollo into a single schema at runtime.

Webapp's tests also use compiled version of these schema files so that it can run a mock graphql server for the tests to call. The schema is kept in webapp/tests/utils/graphql-schema.json (not stored in Git) and is generated by running yarn run gen-graphql-schema in the graphql project. Webapp automatically calls this as part of yarn run test.

metadataSchemas

The Dataset Upload page was originally planned to be much more dynamic, holding different fields of metadata for different dataset types, different projects, etc. A JSON schema was developed for configuring this form. We didn't need the dynamic configurability in the end, so the schema very rarely changes. However, many parts of the upload page are still dynamically built based on the JSON schema.

These files are usually generated on container startup as part of the deref-schema script in package.json. You can rebuild them manually with yarn run deref-schema.

⚠️ **GitHub.com Fallback** ⚠️