Novel computational techniques for mapping and classification of Next-Generation Sequencing data

DNA sequencing

General approach

The general approach for sequencing is similar for all technologies. Until today, it remains hard to sequence the entire DNA molecule as a whole so the methods usually rely on sequencing fragments of copies of the original molecule. The sequencing process can be understood as “reading” these fragments and encoding the obtained information to data files, usually textual. Generally speaking, this “reading” consists of obtaining a technology-specific signal characterizing the DNA fragment and recoding this signal to a sequence of letters of the DNA alphabet (so called base calling), possibly providing also information about reliability of individual letters . Exact steps of the entire process vary in individual technologies. For instance, the resulting signal can have various forms such as a function of electrical current or a series of photographies. The resulting DNA strings, commonly called reads, may not fully correspond to the original DNA molecule since individual steps of the process introduce sequencing errors. To design algorithms for read processing, we need to well understand properties and particularities of the technologies. The main parameters are statistical distribution of read length, statistical properties of sequencing errors (probabilities of individual types of errors, their common patterns, etc.), sequencing biases (e.g., coverage bias), amount of produced data within a single sequencing experiment, or reliability of provided base qualities. Moreover, some technologies are capable to provide reads in pairs proximal in the original DNA (so-called paired-end and mate-pair sequencing). Also different data are obtained based on the type of sequencing such as whole genome sequencing (WGS), whole exome sequencing (WES), target sequencing (TS), whole transcriptome shotgun sequencing (WTSS, RNA-seq), methylation sequencing (MeS, BS-seq), and others [48].

Sequencing technologies producing short reads

Illumina. Illumina (originally named Solexa) was released in 2006 [51] and has ultimately become a technology dominating the market [46]. Its state-of-the-art sequencers can produce reads of length 100 bp to 300 bp and paired-end reads are supported. Its overall error rate is very low; the most common errors are substitutions with a typical rate 0.005 and 0.010 for first and second end of a pair, respectively [56, 57]. The error rate increases towards the ends of the reads, but the errors can be relatively easily corrected [58, 59, 60, 61].

Ion Torrent. The Ion Torrent sequencing platform, first released in 2010 [51], can produce reads of length up to 400 bp. The most frequent errors are indels, which appear with rate 0.03 [62], whereas substitution errors are by order of magnitude less frequent. As the error rate can be improved by quality clipping, some publications (e.g, [46, 63]) mention the improved error rate 0.01.

SOLiD. SOLiD sequencers were introduced in 2006 [51]. They can produce reads up to 100 bp with very low error rate <0.001, possibly paired-end. A particularity of SOLiD sequencers is the used alphabet. Opposed to the other technologies, reads are encoded in color (di-nucleotide) space [64], i.e., transitions between adjacent nucleotides are stored instead of the nucleotides itself. A major advantage of this encoding is the fact that sequencing errors can be distinguished from single nucleotide variants. While the former is observed as a single mismatch, the latter causes two adjacent mismatches. On the other hand, read mappers without explicit support for SOLiD are not applicable, which strongly limits its usage.

454. 454 sequencers were first introduced in 2005 [51]. According to produced data, 454 lies on the border between short and long read technologies. They provide reads of length up to 1, 000 bp (depending on exact sequencer type) with error rate about 0.01, of which the majority are indels [46, 63]. Paired-end reads are supported.

Sequencing technologies producing long reads

Pacific Bioscience. The PacBio sequencing technology provides reads of length up to 20, 000 bp with error rate ranging from 0.11 to 0.15 [65]. A major advantage of PacBio is the fact that errors are distributed randomly, therefore, they are easier to be distinguished from genomic variants. Note that short reads can be used for their correction (see, e.g., [66]).

Oxford Nanopore. Oxford Nanopore produces very special sequencers, distinct from the other technologies in many aspects. First of all, Nanopore sequencers have a size of a smart phone, which makes them the most mobile sequencers on the market. The technology itself is based on decoding electrical signals from protein pores, which are embedded in an electrically resistant polymer membrane. Voltage created across this membrane causes pass of DNA molecules through the membrane in a single direction. The associated changes in electrical current on the pores are recorded and exact sequence of nucleotides decoded from them. Properties of Oxford Nanopore data strongly depend on the specific choice of the used chemistry with many possible combinations. Obtained reads can be up to 200,000 bp long [46] with a typical error rate about 0.12, mainly represented by indels [46]. The most error-prone step of the sequencing process is decoding the electrical signal, mainly because it is hardly possible to maintain a constant speed of DNA molecule passage during the sequencing. Therefore, especially homopolymeric regions are hard to be sequenced (thus decoded [67]) correctly. Oxford Nanopore sequencing is currently a rapidly developing technology, providing data of constantly increasing quality. The associated high error rate remains to be the major disadvantage in practical applications. Nevertheless, the high mobility and comparatively low prices compensate for this drawback and make Oxford Nanopore a very perspective technology, highly suitable for “point-of-care” disease detection, field pathogen detection, civil and army protection, water quality surveillance [68], real-time disease surveillance (e.g, of Ebola [69]), or for sequencing in the space [70, 71]. Other particularities of the technology are a so-called selective sequencing (ReadUntil) [72], i.e., fast skipping molecules out-of-interest in order to accelerate sequencing, and direct methylation sequencing [73].

Le rapport de stage ou le pfe est un document d’analyse, de synthèse et d’évaluation de votre apprentissage, c’est pour cela chatpfe.com propose le téléchargement des modèles complet de projet de fin d’étude, rapport de stage, mémoire, pfe, thèse, pour connaître la méthodologie à avoir et savoir comment construire les parties d’un projet de fin d’étude.

Table des matières

I Introduction
1 Context, motivation and contributions
2 DNA sequencing
3 Main techniques of pairwise sequence comparison
4 Data structures for NGS data analysis
II Dynamic read mapping
5 Context and motivation
6 RNF: a framework to evaluate NGS read mappers
7 Ococo: the first online consensus caller
8 DyMaS: a dynamic mapping simulator
9 Discussion
III ?-mer-based metagenomic classification
10 Overview
11 Spaced seeds for metagenomics
12 ProPhyle: a BWT-based metagenomic classifier
13 Discussion
IV Conclusions
V Appendices
A Languages of lossless spaced seeds
B Read Naming Format specification