Flow file format - mjsull/HapFlow GitHub Wiki
The flow format consists of one of each of these lines
'C <chromosome_name>'
chromosome_name - the name of the chromosome reads are aligned to
'I <max_var> <depth_count_median> <depth_count>'
max_var – maximum variations found at a single site
depth_count_median – median read coverage at variant sites
depth_count – maximum read coverage of a variant
'G <max_var_1> <max_var_2> …. <max_var_err>'
max_var_n – maximum read depth of nth most common variant
max_var_err – maximum read depth of errors
A line for each variant
'V pos,var1,var2,….'
pos – variant position
var1 – most common variant
var2 – second most common variant
etc.
an (*) is placed next to the variant present in the reference
Finally there is a line for each flow
' F pos,flow,count,group'
pos – start position of the flow
flow – consists of a string of digit and symbols separated by commas
count - number of reads in flow
digit – 0 indicates the most common variant is found at this position, 1 indicates the second, 2 indicates the third etc.
x – indicates that the bases at this position don't match any called variant or the reference base
_ – indicates that variant falls between the two reads of this pair
+ – indicates subsequent variants are on the forward strand
+s – indicates subsequent variants are on the forward strand and the read starts at the variant
- – indicates that subsequent variants are on the reverse strain
-s – indicates that subsequent variants are on the reverse strain and the read starts at the variant
e – indicates read ends on last variant
Example read alignments and flow representation (above) and resulting flow file (below)
C fake_chromosome
I 2 10 11
G 7 4 0
V 50,A,T*
V 70,G*,C
V 100,TAA*,TAATAA
F 50,+,0,_,-,0,3,1
F 50,+s,0,1,1,3,2
F 50,+,1,0,-,0,4,3
F 70,+,1,1,e,1,3
Below is a detailed look at each line
C fake_chromosome
C - line identifier
fake_chromosome - The name of the reference.
I 2 10 11
I - line identifier
2 - The maximum number of variants at any position
10 - The median read coverage of all variants
11 - The maximum read coverage of all variants
G 7 4 0
G - line identifier
7 - Maximum reads assigned to the most common allele at any position
4 - Maximum reads assigned to the second most common allele at any position
0 - Maximum reads with an allele not identified in the VCF file at any position
V 50,A,T*
The first variant is at position 50 in the reference, called alleles at this site are A (most common) and T (second most common). T is the allele in the reference genome.
V 70,G*,C
The second variant is at position 70 in the reference, called alleles at this site are G (most common) and C (second most common). G is the allele in the reference genome.
V 100,TAA*,TAATAA
The third variant is at position 100 in the reference, called alleles at this site are TAA (most common) and TAATAA, TAA (second most common) is the allele in the reference genome.
F 50,+,0,_,-,0,3,1
The pink flow
F - line identifier
50 - The first variant of this flow is at position 50
+ - the first pair of reads in this flow align to the forward strand
0 - indicates the allele at position 50 is the most common allele
_ - indicates the allele at position 70 is not covered by the read pair
- - the second pair of reads in this flow align to the reverse strand
0 - indicates the allele at position 100 is the most common allele
3 - indicates the number of reads represented by this flow
1 - As there are no previously defined groups, the first flow is assigned to group 1
F 50,+s,0,1,1,3,2
Dark green flow
F - line indicator
50 - The first variant of this flow is at position 50
+s - the first pair of reads in this flow align to the forward strand. The variant falls on the first base of each read.
0 - indicates the allele at position 50 is the most common allele
1 - indicates the allele at position 70 is the second most common allele
1 - indicates the allele at position 100 is the second most common allele
3 - indicates the number of reads represented by this flow
2 - As this flow contains an allele at variant site not consistent with a previously defined group it is assigned the next available group number.
F 50,+,1,0,-,0,4,3
Orange flow
F - line identifier
50 - The first variant of this flow is at position 50
+ - the first pair of reads in this flow align to the forward strand
1 - indicates the allele at position 50 is the second most common allele
0 - indicates the allele at position 70 is the most common allele
- - the second pair of reads in this flow align to the reverse strand
1 - indicates the allele at position 50 is the second most common allele
4 - indicates the number of reads represented by this flow
3 - As this flow contains an allele at a variant site not consistent with a previously defined group it is assigned the next available group number.
F 70,+,1,1,e,1,3
Light green flow
F - line indicator
70 - The first variant of this flow is at position 70
+ - the first pair of reads in this flow align to the forward strand.
1 - indicates the allele at position 70 is the second most common allele
1 - indicates the allele at position 100 is the second most common allele
e - indicates the allele at position 100 falls on the last base(s) of this read
1 - indicates the number of reads represented by this flow
3 - As alleles at all common variant sites are consistent with a previously defined group, it is assigned that groups number.