A Guide to the Code - Winterflower/dna-adapter GitHub Wiki
Welcome to the DNAdapter wiki! This page will contain up-to date documentation on the internals of the DNAadapter codebase. If you want to contribute to the DNAdapter, have a read through these pages. If something is unclear or poorly explained, please feel free to email me at camillamon[at]gmail.com or info[at]winterflower.net.
The Preprocessing Module
Let's explore the preprocess module function by function. You will need the following:
- A text editor you like working in (I use vim or Atom)
- Python 2.7 installed on your machine
- knowledge of how to run Python scripts and how to work in the interactive Python shell
generate_kmers
Purpose: Generates a list of all possible words of length k from the DNA nucleotide alphabet
Arguments: k, the length of the kmer
Returns: A list of all possible kmers of length k Usage example:
import preprocess as pr
#generate all possible DNA 2-mers
print pr.generate_kmers(2)
['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA',
'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']
revcomp(seq)
Purpose: Generate the reverse complement of a DNA sequence in the 5' to 3' direction
Arguments: The DNA sequence as a string
Returns: The reverse complement DNA sequence as a string
Usage example:
pr.revcomp('AAAT')
'ATTT'
def generate_rcmap_table(kmerlen, kmers, print_dict=False)
Purpose: Generates a mapping from a kmer to its reverse complement
Arguments:
- kmerlen : the length of the kmer (ie. kmerlen=2 generates a reverse complement table for all kmers of length 2)
- kmers : a list of all kmers of length kmerlen
- print_dict : a boolean which indicates whether or not to print the kmer mapping
Returns: a list of mappings for each kmer
Code walkthrough:
def generate_rcmap_table(kmerlen, kmers, print_dict=False):
"""
Returns a mapping list which maps a k-mer to its reverse complement.
:author Dongwon Lee (2011)
Method has been modified by Camilla Montonen (2014)
Arguments:
kmerlen -- integer, length of k-mer
kmers -- list, a full set of k-mers generated by generate_kmers
print_dict, a boolean to indicate whether the kmer mapping dictionary should be printed
Return:
a dictionary containing the mapping table
>>> preprocess.generate_rcmap_table(2,preprocess.generate_kmers(2))
[0, 1, 2, 3, 4, 5, 6, 2, 8, 9, 5, 1, 12, 8, 4, 0]
>>> preprocess.generate_rcmap_table(2,preprocess.generate_kmers(2),True)
{'AA': 0, 'AC': 1, 'GT': 11, 'AG': 2, 'CC': 5, 'CA': 4, 'CG': 6, 'TT': 15, 'GG': 10, 'GC': 9, 'AT': 3, 'GA': 8, 'TG': 14, 'TA': 12, 'TC': 13, 'CT': 7}
[0, 1, 2, 3, 4, 5, 6, 2, 8, 9, 5, 1, 12, 8, 4, 0]
"""
revcomp_func = revcomp
kmer_id_dict = {}
for i in xrange(len(kmers)):
kmer_id_dict[kmers[i]] = i
revcomp_mapping_table = []
for kmerid in xrange(len(kmers)):
rc_id = kmer_id_dict[revcomp_func(kmers[kmerid])]
if rc_id < kmerid:
revcomp_mapping_table.append(rc_id)
else:
revcomp_mapping_table.append(kmerid)
if print_dict:
print kmer_id_dict
return revcomp_mapping_table