File format - VUmcCGP/wisecondor GitHub Wiki
Numpy compressed formats are somewhat unusual for Python users. This document is meant to help you understand what data is saved in the files and how to obtain them should you wish to. Either for custom report creation or for additions to the tool.
Import the required module:
import numpy
Rid us of some function pointing variable errors:
toolConvert = '<toolConvert>'
toolTest = '<toolTest>'
toolNewref = '<toolNewref>'
Load a file:
theFile = numpy.load('./afile.npz')
Ask what data is in the file:
theFile.keys()
Numpy has a limited specification of what is in the file, a few workarounds are needed.
Sometimes it's fine to just access the data directly, sometimes it's not. This depends on whether or not the data accessed is a default numpy array structure. If it is not, use .item()
:
theFile['theData'].item()
This should return the data in it's original structure.
Data contained in the first npz created from a bam file using convert
:
-
runtime
, adict
specifying information about the system when the conversion was ran. -
arguments
, adict
with all arguments used to call the script. -
quality
, adict
with integers showing the amount of reads lost per filter. -
sample
, adict
with anint32
array per chromosome, telling the amount of reads per bin.
runtime {
'username': 'rstraver',
'version': '5cce5e3',
'hostname': 'marvin',
'datetime': datetime.datetime(2016, 12, 12, 17, 11, 52, 131386)
}
arguments {
'retthres': 4,
'retdist': 4,
'binsize': 50000.0,
'outfile': './testSamples/sampleFile.npz',
'func': '<toolConvert>',
'infile': './testSamples/sampleFile.bam'
}
quality {
'filter_rmdup': 1189893,
'filter_mapq': 3401000,
'no_coordinate': 12580916L,
'mapped': 43366013L,
'unmapped': 0L,
'post_retro': 38754546,
'pair_fail': 0,
'pre_retro': 43365720
}
sample {
'1': array([0, ..., 0], dtype=int32),
'2': array([0, ..., 0], dtype=int32),
...,
'22': array([0, ..., 0], dtype=int32),
'X': array([0, ..., 0], dtype=int32),
'Y': array([0, ..., 0], dtype=int32)
}
Data contained in the npz generated using test
:
-
runtime
, adict
specifying information about the system when the conversion was ran. -
arguments
, adict
with all arguments used to call the script. -
binsize
, anint
that represents the binsize this sample was tested at. -
asdef
, afloat
that represents the average standard deviation in this sample. -
aasdef
, afloat
, equalsasdef * threshold_z
, relative effect size required on average to make a call for a single bin using the z-score method. -
threshold_z
, afloat
representing the treshold used to determine significant variations. -
results_calls
, anarray
ofarrays
offloat
values, every array defines a call.
Format:[chromosome, startBin, endBin, zScore, effectSize]
, cast first 3 values toint
if desired. -
results_cwz
,array
offloat
values representing chromosome wide z-scores, using stouffers z-score to combine all z-scores on a chromosome. -
results_r
,array
ofarrays
offloat
values, one array per chromosome, representing relative read depth change per bin. -
results_z
,array
ofarrays
offloat
values, one array per chromosome, representing z-score per bin.
runtime {
'username': 'rstraver',
'version': '5cce5e3',
'hostname': 'marvin',
'datetime': datetime.datetime(2016, 12, 12, 17, 20, 21, 492569)
}
arguments {
'minzscore': None,
'minrefbins': 25,
'reference': './dataFiles/reference.npz',
'outfile': './testSamples/sampleFile_out.npz',
'func': '<toolTest>',
'mineffectsize': 0,
'repeats': 5,
'infile': './testSamples/sampleFile.npz',
'multitest': 1000
}
binsize
250000
asdef
0.030506466625100436
aasdef
0.15509784520217804
threshold_z
5.084097319699588
results_calls [
[1.00000000e+00, 5.40000000e+01, 5.40000000e+01,
1.06117139e+01, 3.82842947e-01],
...,
[2.20000000e+01, 8.60000000e+01, 8.60000000e+01,
-6.10927573e+00, -1.46670420e-01]
]
results_cwz [
-2.00089991,
...,
1.28552298
]
results_r [
array([0.00000000e+00, ..., 0.00000000e+00]),
...,
array([0.00000000e+00, ..., 0.00000000e+00])
]
results_z [
array([0.00000000e+00, ..., 0.00000000e+00]),
...,
array([0.00000000e+00, ..., 0.00000000e+00])
]
-
runtime
, adict
specifying information about the system when the conversion was ran. -
arguments
, adict
with all arguments used to call the script. -
binsize
, anint
that represents the binsize this reference was created at. -
chromosome_sizes
, anarray
ofint
values to tell the amount of bins per chromosome. -
masked_sizes
, anarray
ofint
values to tell the amount of bins per chromosome after masking. -
mask
, anarray
ofbool
values to tell what bins are masked out for further analyses. -
pca_components
, anarray
ofarrays
offloat
values describing PCA components. -
pca_mean
, anarray
offloat
values describing PCA means. -
indexes
, anarray
ofarrays
ofint
values telling the indexes of bins selected for every bin. Every array contains the refset for one bin, all chromosomes are concatenated into one large array. -
distances
, anarray
ofarrays
offloat
values telling the distance squared between target and reference bin for every bin selected inindexes
.
runtime {
'username': 'rstraver',
'version': 'b12fadf',
'hostname': 'marvin',
'datetime': datetime.datetime(2016, 12, 8, 15, 40, 19, 693540)
}
arguments {
'infiles': ['./refSamples/firstRefSample.npz',
...,
'./refSamples/lastRefSample.npz'],
'partfile': './dataFiles/reference_part',
'cpus': 1,
'binsize': 250000,
'outfile': './dataFiles/reference.npz',
'parts': 1,
'func': '<toolNewref>',
'part': [1, 1],
'refsize': 100,
'prepfile': './dataFiles/reference_prep.npz'
}
binsize
250000
chromosome_sizes [
998, ..., 206
]
masked_sizes [
913, ..., 141
]
mask [
True, ..., False
]
pca_components [
[0.00395148, ..., 0.01222231],
...,
[0.00389488, ..., 0.01998623]
]
pca_mean [
1.22552300e-05, ..., 8.10593756e-05
]
indexes [
[395, 446, 426 ..., 466, 467, 468],
...,
[315, 377, 422 ..., 331, 338, 340]
]
distances [
[0.00000000e+00, ..., 1.23259516e-32],
...,
[5.83621524e-24, ..., 5.86302828e-24]
]