Sequence evaluation

When processing sequences obtained from a vendor, it is useful to have an idea of how well the sequencing reactions worked, both in an absolute sense and relative to other recently obtained sequences in the same project. What follows is a primary sequence independent method of evaluating a collection (i.e. and order from an outside vendor) of sequences.

The first step is to align sequences by nucleotide index (ignoring the actual sequence). Start by reading the sequences into a list. I use the list s.b to hold forward (5’ to 3’) sequences, and the list s.f to hold the reverse (but in the 5’ to 3’ orientation, as sequenced) sequences:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
rm(list=ls(all=TRUE))
library(seqinr)

working.dir <- "B:/<my-working-dir>/"


back.files <- list.files( paste(working.dir, "back/", sep="" ))
for.files <- list.files( paste(working.dir, "for/", sep="" ))

> back.files[1:20]
[1] "MBC20120428a-A1-PXMF1.seq" "MBC20120428a-A10-PXMF1.seq"
[3] "MBC20120428a-A11-PXMF1.seq" "MBC20120428a-A12-PXMF1.seq"
[5] "MBC20120428a-A2-PXMF1.seq" "MBC20120428a-A3-PXMF1.seq"
[7] "MBC20120428a-A4-PXMF1.seq" "MBC20120428a-A5-PXMF1.seq"
[9] "MBC20120428a-A6-PXMF1.seq" "MBC20120428a-A7-PXMF1.seq"
[11] "MBC20120428a-A8-PXMF1.seq" "MBC20120428a-A9-PXMF1.seq"
[13] "MBC20120428a-B1-PXMF1.seq" "MBC20120428a-B10-PXMF1.seq"
[15] "MBC20120428a-B11-PXMF1.seq" "MBC20120428a-B12-PXMF1.seq"
[17] "MBC20120428a-B2-PXMF1.seq" "MBC20120428a-B3-PXMF1.seq"
[19] "MBC20120428a-B4-PXMF1.seq" "MBC20120428a-B5-PXMF1.seq"
>

Next determine the number of files read and create a list of that length to hold the sequences. Then read them in and inspect a sequence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
s.b <- list()
length(s.b) <- length(back.files)
s.f <- list()
length(s.f) <- length(for.files)

for(i in 1:length(back.files)){
s.b[[i]] <- read.fasta(paste( working.dir, "back/", back.files[[i]], sep=""))
}

for(i in 1:length(for.files)){
s.f[[i]] <- read.fasta(paste( working.dir, "for/", for.files[[i]], sep=""))
}

> getSequence(s.b[[2]])[[1]][1:200]
[1] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "c" "n" "n" "n"
[19] "g" "t" "c" "c" "a" "c" "t" "g" "c" "g" "g" "c" "c" "g" "c" "c" "a" "t"
[37] "g" "g" "g" "a" "t" "g" "g" "a" "g" "c" "t" "g" "t" "a" "t" "c" "a" "t"
[55] "c" "c" "t" "c" "t" "t" "c" "t" "t" "g" "g" "t" "a" "g" "c" "a" "a" "c"
[73] "a" "g" "c" "t" "a" "c" "a" "g" "g" "c" "g" "c" "g" "c" "a" "c" "t" "c"
[91] "c" "g" "a" "t" "a" "t" "t" "g" "t" "g" "a" "t" "g" "a" "c" "t" "c" "a"
[109] "g" "t" "c" "t" "c" "c" "a" "c" "t" "c" "t" "c" "c" "c" "t" "g" "c" "c"
[127] "c" "g" "t" "c" "a" "c" "c" "c" "c" "t" "g" "g" "c" "g" "a" "g" "c" "c"
[145] "g" "g" "c" "c" "g" "c" "c" "a" "t" "c" "t" "c" "c" "t" "g" "c" "a" "g"
[163] "g" "t" "c" "t" "a" "g" "t" "c" "a" "g" "a" "g" "c" "c" "t" "c" "c" "t"
[181] "a" "c" "a" "t" "a" "a" "t" "g" "g" "a" "t" "a" "c" "a" "a" "c" "t" "a"
[199] "t" "a"


Note that ambiguities are indicated with an “n”. The sequence evaluation will involve counting the number of ambiguitites at each index position. The expectation is that initially - first 25 or so bases - will have a large number of ambiguities, falling to near zero at position 50. This is the run length required to get the primer annealed and incoporating nucleotides. Next will follow 800-1200 positions with near zero ambiguity count. How long exactly is a function of the sequencing quality. Towards the end of the run the ambiguities begin to rise as the polymerase loses energy. Finally the ambiguity count will fall as the reads terminate.

Create a vector nbsum that will tally the count of ambiguities at a given index. Then process through each sequence and count, at each index, the number of ambiguities. The total count of ambiguities is entered into nbsum at the corresponding index position.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
for( i in 1:length(s.b)){
for( j in 1:length( getSequence(s.b[[i]][[1]]))){
if( getSequence(s.b[[i]])[[1]][j] == "n") nbsum[j] <- nbsum[j] + 1
}
}

> nbsum[1:100]
[1] 167 168 168 168 168 166 163 164 160 153 149 142 131 150 135 125 120 111
[19] 99 93 79 80 59 51 61 52 48 38 26 20 17 17 22 20 18 14
[37] 16 15 13 11 14 16 23 21 13 12 6 7 5 6 9 5 3 3
[55] 1 4 3 1 1 3 3 3 2 0 1 0 0 0 0 1 1 1
[73] 5 5 12 14 21 25 24 28 29 31 21 20 8 8 7 10 4 2
[91] 2 3 5 1 3 0 3 1 1 0
>


x <- 1:1200
plot(nbsum[x])

Overlay the reverse reads in red.

1
2
3
4
5
6
7
8
9
10
nfsum <- vector( mode="integer", length=2000)

for( i in 1:length(s.f)){
for( j in 1:length( getSequence(s.f[[i]][[1]]))){
if( getSequence(s.f[[i]])[[1]][j] == "n") nfsum[j] <- nfsum[j] + 1
}
}

points(nfsum[x], col="red")

I have created a shiny app that implements the above code. Download it here.

Installation

Edit your channels.scm file to include the labsolns channel

Once edited:


$guix pull
$guix package -i seqeval
$source $HOME/.guix-profile/etc/profile

##run the bash script

$ seqeval.sh