Flow file format - mjsull/HapFlow GitHub Wiki

The flow format consists of one of each of these lines

'C <chromosome_name>'

chromosome_name - the name of the chromosome reads are aligned to

'I <max_var> <depth_count_median> <depth_count>'

max_var – maximum variations found at a single site

depth_count_median – median read coverage at variant sites

depth_count – maximum read coverage of a variant

'G <max_var_1> <max_var_2> …. <max_var_err>'

max_var_n – maximum read depth of nth most common variant

max_var_err – maximum read depth of errors

A line for each variant

'V pos,var1,var2,….'

pos – variant position

var1 – most common variant

var2 – second most common variant

etc.

an (*) is placed next to the variant present in the reference

Finally there is a line for each flow

' F pos,flow,count,group'

pos – start position of the flow

flow – consists of a string of digit and symbols separated by commas

count - number of reads in flow

digit – 0 indicates the most common variant is found at this position, 1 indicates the second, 2 indicates the third etc.

x – indicates that the bases at this position don't match any called variant or the reference base

_ – indicates that variant falls between the two reads of this pair

+ – indicates subsequent variants are on the forward strand

+s – indicates subsequent variants are on the forward strand and the read starts at the variant

- – indicates that subsequent variants are on the reverse strain

-s – indicates that subsequent variants are on the reverse strain and the read starts at the variant

e – indicates read ends on last variant

Example read alignments and flow representation (above) and resulting flow file (below)

C fake_chromosome
I 2 10 11
G 7 4 0
V 50,A,T*
V 70,G*,C 
V 100,TAA*,TAATAA
F 50,+,0,_,-,0,3,1
F 50,+s,0,1,1,3,2
F 50,+,1,0,-,0,4,3
F 70,+,1,1,e,1,3

Below is a detailed look at each line

C fake_chromosome

C - line identifier

fake_chromosome - The name of the reference.

I 2 10 11

I - line identifier

2 - The maximum number of variants at any position

10 - The median read coverage of all variants

11 - The maximum read coverage of all variants

G 7 4 0

G - line identifier

7 - Maximum reads assigned to the most common allele at any position

4 - Maximum reads assigned to the second most common allele at any position

0 - Maximum reads with an allele not identified in the VCF file at any position

V 50,A,T*

The first variant is at position 50 in the reference, called alleles at this site are A (most common) and T (second most common). T is the allele in the reference genome.

V 70,G*,C

The second variant is at position 70 in the reference, called alleles at this site are G (most common) and C (second most common). G is the allele in the reference genome.

V 100,TAA*,TAATAA

The third variant is at position 100 in the reference, called alleles at this site are TAA (most common) and TAATAA, TAA (second most common) is the allele in the reference genome.

F 50,+,0,_,-,0,3,1

The pink flow

F - line identifier

50 - The first variant of this flow is at position 50

+ - the first pair of reads in this flow align to the forward strand

0 - indicates the allele at position 50 is the most common allele

_ - indicates the allele at position 70 is not covered by the read pair

- - the second pair of reads in this flow align to the reverse strand

0 - indicates the allele at position 100 is the most common allele

3 - indicates the number of reads represented by this flow

1 - As there are no previously defined groups, the first flow is assigned to group 1

F 50,+s,0,1,1,3,2

Dark green flow

F - line indicator

50 - The first variant of this flow is at position 50

+s - the first pair of reads in this flow align to the forward strand. The variant falls on the first base of each read.

0 - indicates the allele at position 50 is the most common allele

1 - indicates the allele at position 70 is the second most common allele

1 - indicates the allele at position 100 is the second most common allele

3 - indicates the number of reads represented by this flow

2 - As this flow contains an allele at variant site not consistent with a previously defined group it is assigned the next available group number.

F 50,+,1,0,-,0,4,3

Orange flow

F - line identifier

50 - The first variant of this flow is at position 50

+ - the first pair of reads in this flow align to the forward strand

1 - indicates the allele at position 50 is the second most common allele

0 - indicates the allele at position 70 is the most common allele

- - the second pair of reads in this flow align to the reverse strand

1 - indicates the allele at position 50 is the second most common allele

4 - indicates the number of reads represented by this flow

3 - As this flow contains an allele at a variant site not consistent with a previously defined group it is assigned the next available group number.

F 70,+,1,1,e,1,3

Light green flow

F - line indicator

70 - The first variant of this flow is at position 70

+ - the first pair of reads in this flow align to the forward strand.

1 - indicates the allele at position 70 is the second most common allele

1 - indicates the allele at position 100 is the second most common allele

e - indicates the allele at position 100 falls on the last base(s) of this read

1 - indicates the number of reads represented by this flow

3 - As alleles at all common variant sites are consistent with a previously defined group, it is assigned that groups number.

⚠️ **GitHub.com Fallback** ⚠️