Same classifier, same otu, different sequencing run = different classification

shira · May 5, 2018, 12:48pm

@BenKaehler @Nicholas_Bokulich
Thank you for your explanations and suggestions, I really appreciate it!

The library preparation for the problematic run was outsourced, so I don't have much detail about how this was done.

You suggested I run dada2 statistics - how do I do that, is it the same as --verbose? From just looking at the number of reads I get out of the dada2 run it is clear that when I trim more of the reverse reads the problem of 50% 'bad' reads goes away, along with 50% of the total reads, so this sits well with your explanation.

I would be grateful if you could look at my quality maps and suggest how to do the trimming. The amplicon is around 450bp. I will send you the file but also paste an image here so this can perhaps be useful for others too.

Ideally I would test various trimming strategies, but even when I do this on just one sample, it takes a long time. Can I just test dada2 on a subsample of a sample, or will this skew the results too much?

34%20PM
54%20PM

Nicholas_Bokulich · May 7, 2018, 2:14pm

It would be useful to get this information from your collaborators or service lab, if you can. That's probably easier than trying to retrospectively figure out if the reads are in mixed orientations.

Yes, --verbose for the 2018.2 and older releases of QIIME2. As of the 2018.4 release, this summary is output to --o-denoising-stats.

You need a minimum of 20 bp overlap for good merging of reads.
Depending on how desperate I am to get reads to merge/get reasonably long reads, I usually trim where quality begins dropping off (~260 nt forward, ~180 nt reverse) or at least where the black boxes in those qual plots begin to drop below Q=20 (~280-300 nt forward, ~220 reverse).

I would go with the latter in this case: the forward reads look really nice and you can probably trim at 301 nt if you wanted to (I might do 270 personally). Trim reverse by 220. This will give you enough overlap.

Yes, you can definitely do a subsample. How much of a subsample is needed to avoid biasing results I cannot say... but 10% of reads is probably a reasonable subsample.

I hope that helps!

shira · May 8, 2018, 12:21pm

Thank you for answering all of my many questions!
I would like to share some test runs I did, in case they may be helpful to someone. I managed to use one complete sample (187,000 reads based on demux summary) without sub-sampling it because the --p-n-threads 0 \threads really accelerated things. Here are the results:

fwd rev (overlap) #reads
301 223 (74) 56,947
301 200 (51) 59,562
301 190 (41) 60,169
301 180 (31) 51,548
294 190 (34) 54,969
291 180 (21) 100,374***
277 223 (50) 131,416***

***denotes parameters that gave rise to the 'flipped reads' taxa problem discussed above.

Nicholas_Bokulich · May 8, 2018, 9:21pm

Thanks for sharing your results @shira!

Very interesting... so it looks like for your purposes the dada2 runs that give fewer reads are preferable. Having dada2 toss out reads is not a bad thing unless if you are losing too much data... usually it means getting rid of junk, which is probably the case here.

shira · May 9, 2018, 3:20am

Thanks @Nicholas_Bokulich!
In my case it was easy to figure out that some denoise runs retained, or generated junk amplicons, since I had that strange taxonomy appear. However - is there a more direct method to look at the output of the dada2 run to see the quality of the amplicons, similar to the quality profile that you get after demultiplexing?

Nicholas_Bokulich · May 9, 2018, 11:41pm

Good question. I am not sure I have a great answer!

Looking at the dada2 summary and using feature-table summarize are going to be the best ways to evaluate any sort of output table. But that doesn't necessarily give you insight on "quality" (in the sense of accuracy) — though it does give you a sense of quality, e.g., for sequence coverage, number of chimera being removed, etc.
If you are analyzing a mock community or other sample with a known sequence composition, you can determine how close your denoised ASVs are to your expected sequences, using quality-control evaluate-seqs. See this tutorial.
Again, if you have samples with known composition (mock or simulated samples), you can evaluate composition to determine how close your observed vs. expected sample compositions are.

I hope that helps!

shira · June 9, 2018, 1:16pm

@Nicholas_Bokulich

It seems like there is no escape, and I will have to deal with the actual problem I have of mixed orientation reads. I would be grateful if you could direct me to other posts or relevant info on the subject. Many thanks!

Nicholas_Bokulich · June 12, 2018, 12:41am

Since you have mixed-orientation reads, I would recommend using the vsearch or blast-based classifiers in q2-feature-classifier, which can handle mixed-orientation reads (i.e., these perform both forward and reverse alignment against your reference sequences). We do not yet have a :qiime2: method for re-orienting reads, so just using an appropriate classifier would be the easiest thing to do — and those classifiers do perform quite well also.

There may be ways to perform read re-orientation outside of QIIME2, e.g., see this post for another user's reorientation script. However, I would personally not go that route, if only to avoid going outside of QIIME2 (to preserve provenance, avoid formatting snafus, etc).

Let me know if you have any more questions or concerns!

shira · June 12, 2018, 11:28am

Thanks for your quick reply @Nicholas_Bokulich.

If I understand correctly, what you suggest I do is process the raw data as you would normal paired-end reads, and only change the way the taxonomy is determined. Is it OK to assume that dada2 will deal well with the situation of mixed reads? Would you try and avoid a company that provides the raw data in mixed orientation format, if so, what are the reasons?

Following your suggestions I tried using vsearch. I could not find a tutorial to follow, so I came up with the following code. It's still running, I hope it works (I have a backslash \ at the end of each line, but its not showing here for some reason).

qiime feature-classifier classify-consensus-vsearch
--i-query rep-seqs.qza
--i-reference-reads vsearch_classification/99_otus.qza
--i-reference-taxonomy vsearch_classification/99_otu_taxonomy.qza
--p-perc-identity 0.97
--p-min-consensus 0.51
--p-maxaccepts 10
--p-threads 0
--o-classification vsearch-taxonomy.qza

Your help is much appreciated!

Nicholas_Bokulich · June 12, 2018, 7:45pm

Correct

Yes

However, this will inflate the number of features, altering (especially) alpha and (probably) beta diversity results based on ASVs. If that's not a problem for your experimental goals, then there's nothing to worry about.

If that is a problem, reorienting your reads prior to importing to QIIME2 may be a better choice, e.g., using the script in the post I linked to above. Or wait until we provide support for this in QIIME 2 (we will post here when that feature is added in a future release).

This thread may also contain some useful advice for now.

Not necessarily. There is nothing "wrong" with what they are doing, it's just unusual and hence we don't have the functionality required to process those reads in a streamlined way in QIIME 2 . We will in the future because a few users have asked

Yes, that should do it . If it's still running, it is still working! It will cause an error if you run out of memory or some other issue, so don't worry for now.

put code between lines containing three backtick (```) characters to make it appear like code on the page, and prevent reformatting.

I hope that helps!