Enter neutralization data (plasma or serum samples) - Jefflier/covid-drdb-payload GitHub Wiki

For plasma or serum, the samples were obtained from study subjects before or after SARS-CoV-2 infection or after immunization with any kind of vaccine. The neutralization titers are the most important data. Other metadata are also required because they reflect the subject conditions of plasma neutralization. The subject history for example contains infection variant, vaccine, dosage, disease severity, age, infection or immunization date, etc.

Steps

Read the paper and annotate key information
Extract neutralization data

2.1 Heterogeneous data represetation
2.2 A note on figures
2.3 Figures with paired points
2.4 Dose-response curve
2.5 Estimate date
2.6 Surrogate neutralization test

Format data

3.1 Metadata tables
- 3.1.1 common tables
3.2 Neutralization potency tables

Submit data

Read the paper and annotate key information

Please pay attention to the information below in Abstract, Methods and materials, Result, or supplementary content:

Paper metadata
- first author's name
- DOI, for publications without DOI, please provide the URL
- year of publication
Neutralizing potency
- the most important data is 50% neutralization concentration (IC50)
- we also record IC80, IC90... etc if available
Neutralization assay
- Variants and mutations of SARS-CoV-2 used in the assay
- assay type, for example live virus, pseudovirus
- assay procedure and details
infection metadata
- infection date
- infection species
- infection SARS-CoV-2 variant (PANGOLIN or WHO name)
immunization metadata
- vaccine name
- shot number, 1st shot, 2nd shot, 1st booster shot, 2nd booster shot
- vaccination date

Extract neutralization data

Heterogeneous data represetation

Different papers are using different methods to represent the neutralization titers. Some of them are easy for extracting, others need more work. Also, some papers don't provide neutralization titers but fold change only. The neutralization titers are considered finer than fold change data.

Table of plasma conditions, titer, and variants
dose-response curve of each plasma
neutralization titer figure with paired points
neutralization titer figure with unpaired points
neutralization titer figure with average values (geomean, mean, etc), we call it aggregate data
neutralization titer from the body text
neutralization titer fold change between control and test variants

Despite different representations, the key information is the same:

plasma and the subject exposure metadata
control and test variants and their mutations
neutralization assay type
neutralization titer values
LLOQ (lower limit of quantification)

Granularity

In general, there are four levels of granularity for neutralization titers:

Aggregation \ Pairing	Yes	No
No	I	II
Yes	III	IV

The "aggregation" means the convalescent plasmas (CP) or vaccinee plasmas (VP) titer value is the average value from subjects sharing the same exposure conditions. An example of aggregated data is the study only reports the geometric mean titer (GMT).

The "pairing" means the plasma test on different variants can be linked to the same subject. An example of unpaired data is the study provided a figure with individual points, but the points are not linked by lines, so we can't know if two points tested on different variants are the result of the same plasma sample.

Important note

The level I (no aggregation paired data) granularity provides the most detailed data and it is the most preferred data form. If level I is not achievable, level II (no aggregation unpaired data) or level III (aggregated data paired data) is acceptable although level II is still preferred over level III. Level IV is designated for data set where only fold changes are available. Unlike level I, II, and III data, which can be deposited into the rx_potency table, the level IV data can be only deposited into the rx_fold table.

Comment: Some papers may repeatedly report the same data using different representations. To reduce the duplication, we require to enter the data with finer granularity.

Source data files

Some journals published source data files with the paper, please try to find the source data files first. Some papers may provide raw data in supplementary tables, please download and check supplementary materials before entering neutralization data.

Figures

Most of the papers provide figures plotted with titer data points. Those individual data points can be extracted using image processing software and converted back to titer data. Extracting from figures is difficult. Before diving into the figures, try to find source data files (usually in Excel file form) or tables.

Summary tables

Summary tables regularly report the aggregated data.

A note on figures

Given the variety of data-processing and data-visualization programs available to researchers, the figures in papers differ in their style. The format of a figure can be rasterized or vectorized, which results in the need for data extraction techniques particular to each paper.

For rasterized figures, you can extract data using image editors (Adobe Illustrator, Adobe Photoshop). First, mark the points and measure the x-y coordinates. Second, use a formula to calculate actually IC50 values. You can also use the same method to get ULOQ from the figure if provided.

For vectorized figures, you can use the same method as rasterized figures. You can also find individual pars of each point in the layers panel, which will save you time to measure the x-y coordinates. You can also write scripts for image editors to measure and calculate data automatically.

Figures with paired points

It is common for papers to use lines to link/pair data points from the same plasma sample. Please use numeric suffixes like _1, _2 to distinguish different samples from the same figure, or add the section name like _figure1A_, _Fig2B_ to distinguish plasmas from different figures.

Some publications may overlay the paired points data with a box-plot. We prefer the paired points if you can measure the value, but if it's hard to distinguish points, you can enter the average value (geomean, mean, etc) in the box plot. Please ignore the confidence interval and p-values.

Dose-response curve

Please pay attention to the value at 50% neutralization. If the curve doesn't cross the 50% neutralization, that means the neutralization was not detected in the assay, please use ULOQ as the neutralization potency data.

Estimate date

For infection, if the patient was infected by the original variant or B.1 variant, the estimated date could be before 2020-09-30. For the Alpha variant and Beta variant, it's about between 2020-12-01 and 2021-04-01. For the Delta variant, it's about 2021-07-01. For the Omicron BA.1 variant, it's about between 2021-12-01 and 2022-02-28. For the Omicron BA.2 variant, it's about after 2022-03-01. This rule will be used if the paper didn't report the accurate date, the estimated date, the average date, the date range, or even the month. Because of the overlap of each wave, it's not possible to estimate the date correctly. You can also estimate the variant by infection date reported in the paper, if there's uncertainty please use Unknown variant. For animal model studies, the infection date can be close to the publication date.

For vaccination, because most of the vaccines were approved for emergency use around Jan 2021, We can assume the first shot was between 2021-01-01 and 2021-05-01. The second shot was about 1 month later. The first boost shot was about 6 months after the second shot. This rule will be used if the paper didn't report the accurate date, the estimated date, the average date, the date range, or even the month.

For breakthrough infection, some papers report average dates between each exposure, you can use them to estimate the dates of infection and vaccination.

Surrogate neutralization test

The surrogate neutralization test (sVNT) data are not neutralization titer but the percentage of neutralization. Normally, it tests on a fixed titer, for example 1:20, and compares the neutralization potency of different plasma samples or compares the difference against different variants. In this situation, when filling the rx_potecy table, the potency_type is NC, if the fixed titer is known it would be NCxx, for example NC20. The potency is the percentage value, the potency_unit is percent.

Format data

Please use this Excel template specifically for plasma to format the data.

In this section, we describe each table and its columns. The primary key or joint primary keys of a table are highlighted in bold.

Metadata tables

Common tables

Please read Enter neutralization data (metadata tables)

Neutralization potency tables

⚠️Note Please note that if there is potency(GMT), then you do not need to enter rx_fold table!!!

`rx_potency`

Column name	Description	Format	Default	Comment
ref_name	RefID			enter the 'ref_name' in the 'articles' table
rx_name	freetext to describe plasma, you can provide infected variant name, vaccine name, dosages, etc to distinguish different plasmas
iso_name	iso_name of tested virus, the name should be in `isolates` table
section	Figure, table, supplementary content or paragraph number from where the data are extracted
assay_name	Must be a value from the `assay_name` column in the 'assay' tables
potency_type			NT50
potency	Neutralization titer
cumulative_count	number of data points share the same value
potency_upper_limit	ULOQ (upper limit of quantification)
potency_lower_limit	LLOQ (lower limit of quantification)
potency_unit			`NULL`
date_added		`YYYY-MM-DD`

`rx_fold`

Column name	Description	Format	Default
ref_name	RefID
rx_name	freetext to describe plasma, you can provide infected variant name, vaccine name, dosages, etc to distinguish different plasmas
control_iso_name	iso_name of control virus, the name should be in `isolates` table
iso_name	iso_name of test virus, the name should be in `isolates` table
section	Figure, table, supplementary content or paragraph number from where the data are extracted
assay_name	Must be a value from the `assay_name` column in the `assays.csv` table
potency_type			NT50
fold_cmp	IF test NT50 < LLOQ then use ">", else use "="
fold	Fold change (control NT50 / test NT50)
resistance_level		`NULL`
ineffective		`NULL`
cumulative_count	number of data points share the same value
date_added		`YYYY-MM-DD`

`ref_isolate_pairs`

This table records which isolate is the control, and which isolate is the test. This table is used with the rx_potency table, if all data are in the rx_fold table, this table can be ignored.

Column name	Description	Comment
ref_name	RefID
control_iso_name	control `iso_name` from `rx_potency` table	Most of time it's wild type virus or virus with D614G mutation
iso_name	test `iso_name` from `rx_potency` table

`subject_plasma`

The concept of subject_name represents an individual or a group of people sharing the same infection or immunization (exposure) history

Column name	Description	Format	Default
ref_name	RefID
subject_name	Unique identifier to represent subjects sharing same exposure history	Freetext
rx_name	`rx_name` in `rx_potency` table
collection_date_cmp	If the accurate plasma isolation dates are reported in the paper then use '=', else use '~'		'~'
collection_date		`YYYY-MM-DD`
location	Country name where the plasma was isolated
cumulative_group	Same as `subject_name`
section	Figure, table, supplementary content or paragraph number from where the data are extracted

`subject_infections`

Column name	Description	Format	Default	Comment
ref_name	RefID
subject_name	`subject_name` from `subject_plasma` table
infection_date_cmp	If the accurate infection dates are reported in the paper then use '=', else use '~'
infection_date		`YYYY-MM-DD`
infected_var_name	`var_name` from `variants` table			If the infection variant name is not reported, use `Unknown vaccine`
location	Country name
immune_status			`NULL`
severity	Mild, Moderate, Hospitalized, Non-Hospitalized
section	Figure, table, supplementary content or paragraph number from where the data are extracted

`subject_vaccines`

Column name	Description	Format	Comment
ref_name	RefID
subject_name	`subject_name` from `subject_plasma` table
vaccination_date_cmp	If the accurate immunization dates are reported in the paper then use '=', else use '~'
vaccination_date		`YYYY-MM-DD`
vaccine_name	`vaccine_name` from `vaccines` table		If the vaccine name is not reported, use `Unknown vaccine`
dosage	1st shot as 1, 2nd shot as 2, booster shot as 3, etc	Integer
location	Country name
section	Figure, table, supplementary content or paragraph number from where the data are extracted

Submit data

If you're not familiar with programming, please skip this step and save the Excel file to the issue page. Please also mention to the admin the data file is ready to use. We will convert the data file into the database-friendly format, and check the consistency.

Please see how to submit the data in Enter neutralization data (submit data)