A Guide to the Code - Winterflower/dna-adapter GitHub Wiki

Welcome to the DNAdapter wiki! This page will contain up-to date documentation on the internals of the DNAadapter codebase. If you want to contribute to the DNAdapter, have a read through these pages. If something is unclear or poorly explained, please feel free to email me at camillamon[at]gmail.com or info[at]winterflower.net.

The Preprocessing Module


Let's explore the preprocess module function by function. You will need the following:

  • A text editor you like working in (I use vim or Atom)
  • Python 2.7 installed on your machine
  • knowledge of how to run Python scripts and how to work in the interactive Python shell

generate_kmers

Purpose: Generates a list of all possible words of length k from the DNA nucleotide alphabet

Arguments: k, the length of the kmer

Returns: A list of all possible kmers of length k Usage example:

import preprocess as pr

#generate all possible DNA 2-mers
print pr.generate_kmers(2)

['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA',
 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']

revcomp(seq)

Purpose: Generate the reverse complement of a DNA sequence in the 5' to 3' direction

Arguments: The DNA sequence as a string

Returns: The reverse complement DNA sequence as a string

Usage example:

pr.revcomp('AAAT')
'ATTT'

def generate_rcmap_table(kmerlen, kmers, print_dict=False)

Purpose: Generates a mapping from a kmer to its reverse complement

Arguments:

  • kmerlen : the length of the kmer (ie. kmerlen=2 generates a reverse complement table for all kmers of length 2)
  • kmers : a list of all kmers of length kmerlen
  • print_dict : a boolean which indicates whether or not to print the kmer mapping

Returns: a list of mappings for each kmer

Code walkthrough:

def generate_rcmap_table(kmerlen, kmers, print_dict=False):
	"""
	Returns a mapping list which maps a k-mer to its reverse complement.

	:author Dongwon Lee (2011)
	Method has been modified by Camilla Montonen (2014)

	Arguments:
	kmerlen -- integer, length of k-mer
	kmers -- list, a full set of k-mers generated by generate_kmers
	print_dict, a boolean to indicate whether the kmer mapping dictionary should be printed

	Return:
	a dictionary containing the mapping table

	>>> preprocess.generate_rcmap_table(2,preprocess.generate_kmers(2))
	[0, 1, 2, 3, 4, 5, 6, 2, 8, 9, 5, 1, 12, 8, 4, 0]

	>>> preprocess.generate_rcmap_table(2,preprocess.generate_kmers(2),True)
	{'AA': 0, 'AC': 1, 'GT': 11, 'AG': 2, 'CC': 5, 'CA': 4, 'CG': 6, 'TT': 15, 'GG': 10, 'GC': 9, 'AT': 3, 'GA': 8, 'TG': 14, 'TA': 12, 'TC': 13, 'CT': 		7}
	[0, 1, 2, 3, 4, 5, 6, 2, 8, 9, 5, 1, 12, 8, 4, 0]

	"""
	revcomp_func = revcomp

	kmer_id_dict = {}
	for i in xrange(len(kmers)):
		kmer_id_dict[kmers[i]] = i

	revcomp_mapping_table = []
	for kmerid in xrange(len(kmers)):
		rc_id = kmer_id_dict[revcomp_func(kmers[kmerid])]
		if rc_id < kmerid:
			revcomp_mapping_table.append(rc_id)
		else:
			revcomp_mapping_table.append(kmerid)
	if print_dict:
		print kmer_id_dict
	return revcomp_mapping_table