5.2.An example of extracting ICD codes from UK Biobank tabular data - neurohub/neurohub_documentation GitHub Wiki
The purpose of this exercise is to extract data from the tabular data in Beluga of subjects (participants) encoded U071 (COVID-19, virus identified) correlating with the second-scan date using the xsv
command. Please find a Step-by-Step process for your reference below.
The ICD (International Classification Disease) and IDP(imaging-derived phenotypes) information are available in current.csv
file, a big CSV table (separate from the bulk data).
Available on /lustre03/project/6008063/neurohub/UKB/Tabular/current.csv
, you will find the most updated UK Biobank data set available on Beluga.
The command will show the headers of the CSV file, the complete list of data fields with their column number.
xsv headers current.csv
Currently 21,407 data fields are associated with about 0.5 million UK Biobank subjects.
3. Diagnoses - main ICD10 (Data field 41202)
The first question would be to know if the data field is available in our data set.
You can search for the data field using the grep
command:
xsv headers current.csv | grep 41202
Files that contain data on an individual participant are named in this fashion:
<Datafield column number>''<FIELD-ID>''<INSTANCE-ID>''<ARRAY-ID>
The outcome will be the 41202 data field associated with the array, here a total of 79 arrays. (The first column is the data field column number.)
18504 41202-0.1
18505 41202-0.2
18506 41202-0.3
18507 41202-0.4
.......
18579 41202-0.76
18580 41202-0.77
18581 41202-0.78
18582 41202-0.79
You can use xsv
and select the data from the range of column associated, here column 18503 to 18582.
xsv select 18503-18582 current.csv
To get the subject number (EID) you can select 1, (as the patient number is the first column of the file)
xsv select 1,18503-18582 current.csv
xsv select 1,18503-18582 current.csv | grep U071
6. Extract subjects correlating with the second-scan date (Data field 41280)
xsv headers current.csv | grep 41280
grep
patients diagnosed in 2021 for example:
xsv select 1,19888-20146 current.csv | grep 2021
xsv select 1,18503-18582,1,19888-20146 current.csv | grep U071 | grep 2021-06
5061 participants with an ICD code of U07.1 (covid tested positive)