Architecture: graphql - metaspace2020/metaspace GitHub Wiki
This project is organized into vertical slices of functionality, grouped by Business Domain. This is an intentional design choice to minimize the amount of layering and boilerplate that is common in web server projects. Most simple CRUD logic should be implemented directly inside GraphQL resolver functions.
sm-graphql starts several servers/processes, all managed in the server.js file:
- Apollo GraphQL API server
- Apollo GraphQL Subscription API server (WebSockets)
- HTTP registration/login/etc REST API
- HTTP "Storage Server" for raw optical image upload
- (in the Storage Server) Uppy Companion server for signing upload URLs for direct dataset/molDB upload to S3
- A scheduled "cron" job for sending reminder emails to users to publish their data if they have old private projects
Additionally, TypeORM runs any new database migrations on startup.
The GraphQL API can be easily explored at https://metaspace2020.eu/graphql (or your local dev server equivalent). Set "request.credentials": "include", in the settings and it will use your login details from the browser cookies.
Almost all security-related logic happens in sm-graphql:
- User creation/login is handled by the REST API in
src/modules/auth/controller.ts - Authentication is handled by Passport middleware based on each request's cookies/JWTs/Api-Keys
- Authorization needs to be handled explicitly in GraphQL resolver functions. This is usually done when retrieving the entity, e.g.:
- The ElasticSearch queries in
esConnector.tsfilter datasets/annotations to only include results that should be visible to the current user. Some controllers will even query and discard the result just to check that the dataset is allowed to be accessed. - When there are multiple different levels of access privilege, it should be explicit in the function names, e.g.
getDatasetForEditingwhich will raise an exception if the user isn't allowed to edit the dataset. - Operations that call sm-api must still handle authorization! sm-api doesn't do any authorization itself.
- The ElasticSearch queries in
- As an optimization, some resolvers pass authorization information to their children resolvers through a
scopeRolefield, e.g.
Managed by the Passport library, works like every other website - the cookie content includes a signed session ID, the actual session data is stored in Redis. Cookies are the primary authentication mechanism - (non-anonymous) JWTs and Api-Keys can only be generated by a user authenticated with a cookie.
The cookie is the same whether a user logs in with Google or Email+Password.
GraphQL requests from webapp use a JWT for authentication. This isn't really needed anymore - previously webapp and graphql were separate and webapp handled authentication. It's just more work to clean up - getting access to the cookies in the GraphQL Subscription Server has been an difficult/impossible in the past. The subscription server library has probably fixed that by now.
Python Client also uses JWTs if Email+Password authentication is used. For Api-Key authentication, the JWT isn't needed.
API Keys use a similar authentication code path to JWTs, but have significant restrictions (only specific mutations are allowed, some queries are blocked, all usages are logged) to limit the impact if they're leaked. They're intended for use with the Python Client.
The project publication workflow allows a user to create a share link to that project. Anyone who accesses this link is allowed to see the datasets in the project - the authorization details are persisted in the user's session, even if they're not logged in.
Not intended to be used continually, but for new users' convenience, clicking the email validation link will give them a logged-in cookie up to 1 hour after account creation. This technically counts as an authentication method from a security perspective.
It's common for one business entity to have multiple representations at the various interfaces. These follow a naming convention:
import { Dataset } from '../../../binding' // The "Binding" is the GraphQL schema type - no suffix is used
import { DatasetSource } from '../../../bindingTypes' // The "Source" is the type returned by resolvers, which may have additional fields for internal use e.g. authentication data
import { Dataset as DatasetModel } from '../model' // The "Model" is the TypeORM DB Entity classWhen a parent resolver needs to share authentication data with one or more child resolvers, it's done through a "ScopeRole" field. E.g. only group members may see a user's email address, but the group member status is easiest to query when selecting the user. Resolvers like Group.members include a scopeRole field in their returned "source" object, which is eventually used in User.email to check if the email should be visible.
GraphQL queries are executed hierarchically, e.g. for this query:
query {
currentUser {
id
email
projects {
role
project { id name }
}
}
}the Apollo GraphQL server performs these operations:
- Call the
Query.currentUserresolver (which returns aUserModelobject). The GraphQL Schema defines the return value asUser, so nested fields useUserresolvers. - Check for a
User.idresolver - there is none, so theidraw value fromUserModelis used instead - Check for a
User.emailresolver - it exists, so it's called with theUserModelobject as thesourceparameter - Check for a
User.projectsresolver - it exists, so it's called with theUserModelobject as thesource. It returns an array ofUserProjectModelobjects. The GraphQL return type is[UserProject], so it's handled as an array and the nested fields useUserProjectresolvers - For each
UserProjectModelobject inUser.projects' return value:- Check for a
UserProject.roleresolver - it doesn't exist, sorolefromUserProjectModel - Check for a
UserProject.projectresolver - it exists, so it's called with theUserProjectModelas thesource. It returns aProjectModelobject, and the schema says the graphql return type isProject - Check for a
Project.idresolver - it doesn't exist, so theidfield from theProjectModelis used - Check for a
Project.nameresolver - it doesn't exist, so thenamefield from theProjectModelis used
- Check for a
This approach allows API consumers (i.e. webapp and python-client) to specify exactly what data they need. Well-written GraphQL resolvers are extremely flexible and often don't need any server-side code changes as the client-side application evolves.
The biggest drawback is that the hierarchical method calling makes it very easy to hit the "SELECT N+1" problem, e.g. in the above query the UserProject.project resolver is called for every project - if there are 10 projects and UserProject.project contains an SQL query, then 10 SQL queries will be issued. There's an easy solution though - see the Caching/Dataloaders section.
The type annotations around resolvers are a bit janky because we want to enforce that the code is type-safe with the .graphql definitions, but "graphql-binding" was the only available .graphql-to-TypeScript interface compiler available when this was written. graphql-binding has many shortcomings in the generated types, and updated versions of the library don't seem to support our use-case, so we're currently stuck with it. If you want to fix this, there's an idea task that suggests a newer library that looks suitable. It's also viable to just write our own compiler, as GraphQL schemas are actually very simple to process if you use the programmatic API.
The compiled bindings are stored in graphql/src/binding.ts and are generated by yarn run gen-binding, which runs automatically when the Docker container starts or detects changes to .graphql files.
Here's an example resolver definition:
const ProjectResolvers: FieldResolversFor<Project, ProjectSource> = {
async hasPendingRequest(project: ProjectSource, args: any, ctx: Context, info: GraphQLResolveInfo): Promise<boolean | null> {
...The FieldResolversFor<TBinding, TSource>` allows TypeScript to enforce that the contained functions are loosely compatible with the GraphQL schema.
-
TBinding(Projectfrombinding.tsin this case) is used to ensure theargsand return type of each function matches the GraphQL schema -
TSource(ProjectSourcein this case) is used for type-checking that all resolvers handle the "Source" (first argument) correctly
The "resolver function" (hasPendingRequest) accepts up to 4 arguments:
-
source(calledprojectin this case) - seeTSourceexplanation above -
args- arguments to this resolver. This only contains a value if the GraphQL schema includes arguments for this resolver. It's mostly used for Queries/Mutations, but some fields also use it e.g.Annotation.colocalizationCoeff. NOTE:argshas the most problems with bad types generated inbindings.ts- e.g.Stringis translated tostring[]|stringandIDis translated tostring|number. Often it's better to manually write type definitions for this argument. -
ctx- the "Context" object for the current request. It's shared between all resolver calls in this request (allowing it to be used for caching), and includes useful stuff like theentityManagerconnection to the database, and theuserdetails. -
info- almost never needed. It contains metadata about the whole GraphQL query, including which sub-fields will be resolved from the current resolver's returned object.
It's common for GraphQL resolvers within the same query to need access to the same data, but it's not easy to just select the data once and pass it to the functions that need it because resolvers are so isolated. Context manages a cache that's specific to the current request to help with these cases, e.g.
const getMolecularDbById = async(ctx: Context, dbId: number): Promise<MolecularDB> => {
return await ctx.contextCacheGet('getMolecularDbById', [dbId],
(dbId: number) =>
ctx.entityManager.getCustomRepository(MolecularDbRepository)
.findDatabaseById(ctx, db_id)
)
}This ensures that no matter how many times the outer getMolecularDbById is called during the request, the inner findDatabaseById function will only be called once per value of dbId.
Virtually all data in graphql is dependent on the current user's permissions, so no global caches have been set up.
DataLoaders can combine many independent function calls into a single function call that receives an array of the calls' parameters, allowing optimizations e.g. using a single SQL query to get many rows by ID. Here's a good explanation. Use Context.contextCacheGet to create a single instance of the DataLoader for each request. e.g.
const getDbDatasetById = async(ctx: Context, id: string): Promise<DbDataset | null> => {
const dataloader = ctx.contextCacheGet('getDbDatasetByIdDataLoader', [], () => {
return new DataLoader(async(datasetIds: string[]): Promise<any[]> => {
const results = await ctx.entityManager.query('SELECT...', [datasetIds])
const keyedResults = _.keyBy(results, 'id')
return datasetIds.map(id => keyedResults[id] || null)
})
})
return await dataloader.load(id)
}Contains authentication middleware and a non-GraphQL REST API for registration, login, JWT issuing, etc.
Contains the Storage Server and code to run Uppy Companion
Contains the GraphQL schema files. These are compiled by Apollo into a single schema at runtime.
Webapp's tests also use compiled version of these schema files so that it can run a mock graphql server for the tests to call. The schema is kept in webapp/tests/utils/graphql-schema.json (not stored in Git) and is generated by running yarn run gen-graphql-schema in the graphql project. Webapp automatically calls this as part of yarn run test.
The Dataset Upload page was originally planned to be much more dynamic, holding different fields of metadata for different dataset types, different projects, etc. A JSON schema was developed for configuring this form. We didn't need the dynamic configurability in the end, so the schema very rarely changes. However, many parts of the upload page are still dynamically built based on the JSON schema.
These files are usually generated on container startup as part of the deref-schema script in package.json. You can rebuild them manually with yarn run deref-schema.