Sensitive data - inbo/vlaams-biodiversiteitsportaal GitHub Wiki

Context

One of the main biodiversity data-providers for the Flemish Government does not want their data to become publicly available in the highest resolution.
Therefor, in order for us to still make use of their data, we need to provide a way to only show the data in high resolution to authorised users.
The general public should only be able to see the data in low resolution.

An important part to note, is that it is important users can use the portal to really work with the high-resolution data. They already have ways of downloading the data locally.
An important reason for having a portal ourselves is to allow employees and other government agents to use biodiversity data without the need to do their own data wrangling / processing.

Examples of user permission on solr data

https://doc.sitecore.com/xp/en/developers/latest/sitecore-experience-manager/use-permissions-for-search.html#use-permissions-when-you-search-from-the-api

Approaches

Existing support (Australia and UK)

The living atlas already has some support for sensitive data.
During the data processing, a combination of species. data-resources and locations can be marked as sensitive.
The pipeline will then either generalize or remove the high-resolution data.

The high resolution data is still retained in separate fields on the same occurrence records in the solr db. But, it is not generally shown, nor used in any of the queries or facets available to the user.
The sensitive data is only made available when viewing individual occurrences, or when downloading the data as dwc archive.

Technically this works, by association certain roles with a "sensitiveFq", e.g.:

sensitive:"generalised" AND (cl927:"Australian Captial Territory" OR cl927:"Jervis Bay Territory") AND -(dataResourceUid:dr359 OR dataResourceUid:dr571 OR dataResourceUid:dr570")

When downloading data, this sensitiveFq is used to split the single query into two.
One containing the sensitive data, and the other one without the non-sensitive data.
The results are then merged together.

When viewing individual occurrences, the sensitiveFq is used to perform a first query and only if this is empty,
a second query without the fq is performed to fetch the generalised data.

Pro 👍 Con 👎
Already developed and in use Performance? Amount of queries doubles for people with sensitive data roles. (if they don't have access to that particular entry)
Not much duplication of data Only works for viewing individual occurrences and downloads. I do not think there is a way for this method to work with normal querying functionality without breaking sorting / pagination / etc. Typically, these things require a single query, so the DB can make sure those things are done correctly.
High flexibility in specifying user permissions, you can make the sensitiveFq as complex as you want. Existing sensitive data generalization does not support things all records in a data-resource, or custom grids (UTM, etc.)

Separate Solr collections

We could simply have two instances of the solr collection,
one containing sensitive data and one containing non-sensitive data.
The biocache-service with sensitive data would simply require an additional role,
which is easy to do with the current config.

Data ingestion would run as normal without any changes, but would replicate it in a second collection For the sensitive data, we would additionally run the pipelines using different configurations,
making sure one collection only receives the generalised data and the other one retains the high-resolution data.
Possibly we could accomplish the same without duplicating the normal / non-sensitive data by using collection aliases,
but I am not sure how that would work in practice.

Next we would modify the biocache-service to select the correct collection based on the user role.

Pro 👍 Con 👎
Minimal changes to the current code, only biocache-service needs to be modified to select the correct collection based on the user role. Very coarse grained access control, all or nothing
All queries throughout the platform will work transparently. Duplicate Solr indexes can be very expensive in terms of storage and time needed for indexing (currently the longes part of the pipeline ~4 hours).

Separate Solr collections without duplication

As a possible alternative to the above approach, we could simply have separate solr collections for the sensitive ddata. But without providing copies of the non-sensitive data.

That way we avoid duplication, but it makes it impossible to query across sensitive and non-sensitive data at the same time.
We would therefor have to provide some custom front-end tooling to allow users to select which collection they want to query.

Pro 👍 Con 👎
No more expensive duplication Cannot query across sensitive and non-sensitive data at the same time
Can have multiple different sets of data / roles

Store high-resolution data in separate fields (France??)

A bit similar like the existing sensitive data approach, but instead you could have many different fields for different roles, locations, etc. So the normal data would be generalized, but the high-resolution data would be retained in separate fields. The queries would then be modified to use the correct fields based on the user role. This modification could be performed automatically, or when a user manually chooses it.

Pro 👍 Con 👎
No expensive duplication Cannot query across sensitive and non-sensitive data at the same time
Can have multiple different sets of data / roles Potentially complicated logic to modify all queries to use the correct fields

Add allowed roles to the Solr index

We can also add the different roles, allowed to see a record, to the occurrence data stored inside solr.
That way, we can always use the correct version of the data, given a certain user's role.

This works by always adding a fq to all queries, and relies on some three-valued logic to select the right versions.
The fq looks like this:

*:* NOT dynamicProperties_rbac:*)
  OR (dynamicProperties_rbac:true AND dynamicProperties_rbac_allowed:(<user_roles>)) 
  OR (dynamicProperties_rbac:false AND !dynamicProperties_rbac_allowed:(<user_roles>))

The <user_roles> would be replaced by the roles of the current user, concatenated using an OR operator.

We would have to pre-process the sensitive data archives to duplicate them and provided the necessary additional properties.
But no changes are needed to the pipeline code, thanks to the use of dynamicProperties_* fields. So for every sensitive dataset we would need to provide 1 generalized and 1 high-resolution version, with the additional dynamic properties, and the same occurrenceIds. Or provide a single data resource with both versions of the data, but with different occurrenceIds (e.g. adding a :GENERALIZED suffix).

Pro 👍 Con 👎
Queries happen automatically on the correct data, across all services Performance of adding a fq to all queries?
No duplication of all data, only sensitive records Sensitive and non-sensitive records have different UUIDs, can be confusing for users (And might cause issues with things like duplicate detection in the pipelines)
Relatively simple change, can even be made configurable and possibly upstreamed, given enough support. Changes in data <> role links require re-indexing
Can differentiate record access based on roles, even down to an individual record. Might not scale well when individual users having a large amount of roles

Working PoC: https://github.com/StefanVanDyck/biocache-service/tree/poc-rbac

More flexible sensitveFq type queries

Maybe instead of roles, the senstiveFq approach with very dynamic queries, could also be made to work using the three-value logic.
It would probably be less performant, but could avoid re-indexing when mapping records to roles and allow for much more complicated rules.
For role-based access we could still use this senstiveFq, but simply use the dynamicProperties_rbac_allowed as de sensitiveFq.
Making the role based approach a special case of the more generic sensitiveFq approach.
The main difference in this approach is we duplicate the records in the solr index, so all the filtering happens by solr, using a single query.

Something like this:

*:* NOT dynamicProperties_sensitive:*)
  OR (dynamicProperties_sensitive:true AND fq) 
  OR (dynamicProperties_sensitive:false AND !fq)

Pro-Con would be the same as above, except for the re-indexing and the scaling with many user roles.

Conclusion

I think the best approach would be to use the three-value logic with the dynamicProperties_* fields.
It should provide us with the most flexibility and the least amount of duplication.
The only possible downsides are the performance and some changes to dedpuplication code in the pipeline.
But we can only really find out how much an issue they are, once we start implementing it.

⚠️ **GitHub.com Fallback** ⚠️