Possible Analysis Pipeline for Ion Torrent 16S Metagenomics Kit Data in QIIME2?

Nicholas_Bokulich · April 16, 2020, 3:46pm

Hi @rparadiso,
I am pinging @cjone228 to describe her pipeline decisions better than I can! But here are my thoughts:

I think the issue with using mafft here is that the amplicons are targeting different sections of the 16S, so mafft will not be able to align those properly... fragment-insertion will insert those fragments into a reference tree built on full 16S sequences, so will accommodate this multi-amplicon design.

I think you've already seen @cjone228's description above (I'm just quoting here to give context for others following along!)

Having multiple amplicons means that counting ASVs will lead to highly inflated measures of richness (e.g., if you have 7 different amplicons in the pool, then richness scores will be 7X the actual richness). A phylogenetic alpha-diversity method like Faith's PD would be unaffected by this issue, but all other metrics will be. A few workarounds:

You could collapse your feature table on taxonomy and then calculate alpha diversity metrics. Ideally, this will "collapse" amplicons that hit the same reference sequence but at different sections of the 16S. It could still run into issues if some amplicons assign more deeply than others (we know this to be a problem!) so alpha diversity could still be inflated. One possibility is to use a taxonomy assignment method like classify-consensus-blast and take only the top hit, and only use these results for calculating alpha diversity; similar to using sepp the hope is that these would "splice" into the same reference taxonomy.
use closed-reference OTU clustering to cluster sequences into full-length reference sequences. Once more, we are "splicing" the amplicons from different regions into single full-length sequences. The problem: closed-ref OTU clustering has its own issues you are probably already familiar with (and if not you can read about these more in the tutorials at qiime2.org)
Lazy methods: the multiple amplicon issue will affect all samples evenly, so your richness estimates will be highly inflated in all samples, presumably at an even rate. If sequence depth is high enough, you could just carry on as normal, based on the assumption that this is a bias impacting all samples so you can compare within the same study using conventional alpha diversity metrics, but not between studies (since the richness is inflated!). Another option: try dividing observed richness by the number of amplicons to get an estimate of the richness you'd observe with a single amplicon. You could also try filtering out individual amplicons to see what your richness looks like with each individual amplicon to see how well this method works... getting some external validation (compare to other studies) is also recommended if you go that route.

Just a few ideas to think over! Interested to hear what @cjone228 and others think. I hope that helps.

rparadiso · April 16, 2020, 4:55pm

Hello Nicholas,

Should this be done at every taxonomic level?

Nicholas_Bokulich:

Lazy methods: the multiple amplicon issue will affect all samples evenly, so your richness estimates will be highly inflated in all samples, presumably at an even rate. If sequence depth is high enough, you could just carry on as normal, based on the assumption that this is a bias impacting all samples so you can compare within the same study using conventional alpha diversity metrics, but not between studies (since the richness is inflated!). Another option: try dividing observed richness by the number of amplicons to get an estimate of the richness you’d observe with a single amplicon. You could also try filtering out individual amplicons to see what your richness looks like with each individual amplicon to see how well this method works… getting some external validation (compare to other studies) is also recommended if you go that route.

It does not seem according to me a viable way as you should assume that all primers have the same efficiency in reaction, but we know that it is unlikely to be so.
Some time ago, in addition, I read of a job that must be done in which it seems that one region rather than the others are more "specific" (let me pass the term) for some taxes compared to others.

Thanks Nicholas for your answear!

Nicholas_Bokulich · April 16, 2020, 5:11pm

Only the level you are interested in, e.g., do you want to count the number of observed species? observed genera? I'd say probably only species level.

Good point. If all primers are used in a single PCR reaction and you assume that amplification efficiency varies between primers, then this could further complicate alpha diversity measurements.

KMaki · April 16, 2020, 6:06pm

Thanks so much for the detailed response! The range of amplified hypervariable regions mapped to the F and R primers on the ecoli genetic sequence is between 216 to 296. Are limits of 100-400 good in your opinion? I removed the --trunc flag from the greengenes extract option and kept the min and max length options consistent but it didn't change the % of unclassified taxonomy for V3 (currently between 20-25% for my mock sequences). The V2 region performed very well (about 98% classified) with the same parameters which I think is interesting.

I planned next to tweak the options in the vsearch classifier but wanted to clarify your suggestions:

In the qiime2 docs, under the qiime feature-classifier classify-consensus-vsearch command the following option is available:
--p-perc-identity PROPORTION Range(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.
[default: 0.8]
I don't see a p-perc identity without proportion on the docstring. The default for this is 0.8 so wouldn't lowering this value potentially get me more leniency and classified taxa?

Is one of these below --maxaccepts options what you mean by 'max-hits'? For the first option, setting this value to 1 would probably decrease my change of getting a sequence classified right?
--p-maxaccepts VALUE Int % Range(1, None) | Str % Choices('all') [default: 10]
--p-maxhits VALUE Int % Range(1, None) | Str % Choices('all') Maximum number of hits to show once the search is terminated. [default: 'all']

For the tophits options, it seems like its just a true/false right? Should I consider changing this to false?
--p-top-hits-only / --p-no-top-hits-only
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1. [default: False]

Thanks again for your help and suggestions on this!
Katherine

Nicholas_Bokulich · April 16, 2020, 6:16pm

That's strange... could indicate that the primers you used for extraction might be missing things in the reference database.

yes, it gives you more leniency but this should be balanced with max-accepts (or use top-hits-only) to avoid putting too many potential hits into the consensus.

maxaccepts controls how many hits VSEARCH accepts (using kmer matching) before running an alignment, and maxhits controls how many successful alignments it reports (handing off to q2-feature-classifier for consensus classification). Setting either of these to 1 will not decrease your chances of classifying a sequence, but will increase the likelihood of overclassifying (because you are just taking the top hit, so consensus classification is not performed)

It is false by default, so you would change to true to take only top hits.

Good luck!

cjone228 · April 16, 2020, 6:50pm

Hi everyone!!

@rparadiso First of all, thanks for the kind words and for contributing to our discussion!! Funnily enough, we were talking about some similar issues in our lab meeting this morning and were wondering about some related things... we understand that because we are sequencing multiple variable regions at once we must incorporate phylogeny in our core metrics analyses. We also understand the concept of the artificially increased richness in alpha diversity if you do not include phylogeny, but we were all a bit confused on how using an insertion tree will account for this:

@Nicholas_Bokulich Why is this? We were having trouble articulating the why to our PI.

After reading your reply to @rparadiso, several other questions sparked in our minds. A lot of what you explained were concepts we had thought of in our non-bioinformatician brains but hadn't been able to fully articulate...

So does this mean that sequences from different variable regions but from the same bacteria will be placed together on the reference tree when using sepp-fragment-insertion? Even though the ASVs themselves are very different in sequence?

As for the workarounds, we have questions/comments for each:

Is this essentially saying to do taxonomic classification early on in your analysis pipeline (ie. sort of in lieu of phylogeny?) and using this to take care of the richness inflation issue? And if so, for collapsing taxonomy, what would one do if one variable region identifies down to the species and another only identifies down to genus? Would you decide to only collapse down to genus to incorporate as much info as possible? Or when you say "take only the top hit" do you mean the top % ID on BLAST?

Putting all of the variable regions together into one big 16S gene is something we had considered a while back but didn't really know how to accomplish. We had considered SMURF, but @jwdebelius had replied to us and said it basically wasn't a viable option:

In our current pipeline, we use DADA2 denoise-pyro, which as we understand it creates ASVs as opposed to OTUs. Therefore, would we even be in a position to be able to perform closed-reference OTU picking, or would we need to use a different method of denoising that creates OTUs? Also, is closed-reference OTU clustering another way of accomplishing what we wanted to do with SMURF (put all variable regions together into one consensus 16S sequence), and if so, how does one do that?

We have noticed using Ion Reporter software, which breaks down results per primer, that the breadth and depth of taxonomic classification per variable region varies quite dramatically (ie. V9 does very poorly, V4 does well, etc.). Therefore, we unfortunately aren't sure that this workaround would really be sound...

Finally, the above comment implies separating the different variable regions if we are interpreting this correctly - is there a way to do this without knowing the exact primer sequences (as is the case in our situation)?

Sorry for the monstrously long post, and thanks in advance for your valuable advice! :qiime2:

Nicholas_Bokulich · April 16, 2020, 7:10pm

Phylogenetic methods would be mostly unaffected, because the different amplicons — by splicing into the reference tree — will essentially be collapsed by phylogeny, so distinct amplicons that splice to the same point in the tree will not exaggerate diversity estimates.

Exactly! If you try to align these individually (e.g., with mafft), they will probably not align at all, making a totally wacky phylogeny. But splice them into a reference tree and different amplicons from the same species will be "collapsed" into the correct position.

Yes. Not in lieu of phylogeny, but in addition (as is typical). Still use Faith's PD, but collapsing on taxonomy may be a way to still measure non-phylogenetic metrics like richness or Shannon. It is not a perfect fix but worth a try!

the hope is that you will get similar phylogenetic resolution for amplicons coming from the same species, which probably is not true. This is not a perfect approach but worth a try!

Collapsing at genus level is probably more reliable, but I am not sure I have not tested. I am only brainstorming here!

Set maxaccepts to 1 for classify-consensus-blast or maxhits to 1 for classify-consensus-vsearch

qiime vsearch cluster-features-closed-reference. See the tutorials and docs at qiime2.org for usage details.

This aligns them to a reference, unlike SMURF, which attempts to stitch them together de novo (i.e., reference free alignment)

You can denoise, then cluster into OTUs, that is fine.

No, it is not creating a consensus or stitching together an alignment, it is aligning to a full-length reference sequence.

Yes, you can extract variable regions from greengenes using known primers, then use qiime quality-control exclude-seqs to filter out those domains (ideally there should be enough overlap that this works but as with everything else I am suggesting possible workarounds and have not worked with this type of data so this should not be considered a definite fix! )

I hope that helps!

colinbrislawn · April 16, 2020, 8:08pm

Nick already answered this, but here's how I would summarize it for my PI:

Faith's Phylogenetic Distance (PD) reports branch length covered by the ASVs.
While sequencing more regions would lead to more ASVs, the phylogenetic branches covered by all these regions should be similar.

Similarly, you could merge ASVs by taxonomic annotation, then report alpha diversity of species found. This would be 'collapsed by taxonomy', compared to Faith's PD 'collapsed by phylogeny' that Nick described.

This has been a great thread! The challenges of multiple regions gives us a good chance to discuss underlying methods and limitations of the field.

jwdebelius · April 17, 2020, 2:45pm

cjone228:

Putting all of the variable regions together into one big 16S gene is something we had considered a while back but didn’t really know how to accomplish. We had considered SMURF, but @jwdebelius had replied to us and said it basically wasn’t a viable option:

jwdebelius:

I’ve worked some with SMURF over the past year. The short answer is that as of February 2020, SMURF is probably not a viable solution.

Let’s ignore the run-ability of the current Matlab code (and the that has filled my swear jar as a consequence), and just focus on the core issue.

You need primers to extract your reference region. Not just the knowledge of the region but the exact primer pairs and read length. Someone who knew the regions could do the theoretical extraction for you and then trim the primers, and then you could theoretically align all your reads against each region in a computational expensive but ultimately semi effective system…

In our current pipeline, we use DADA2 denoise-pyro, which as we understand it creates ASVs as opposed to OTUs. Therefore, would we even be in a position to be able to perform closed-reference OTU picking, or would we need to use a different method of denoising that creates OTUs? Also, is closed-reference OTU clustering another way of accomplishing what we wanted to do with SMURF (put all variable regions together into one consensus 16S sequence), and if so, how does one do that?

DADA2 works well for this, as far as I understand. However, re-clustering the OTUs from the mixed samples is likely not a viable solution. In SMURF, you re-normalize by region coverage and relative abundance (this may be less of a problem here, but let's assume some stochasticity). So, if I have a region that's poorly covered in my database, SMURF accounts for that in the calculation.

Im working on a re-implementation, and I know I keep giving a timeline, but all my work has gone sideways. So, I think the current solution is a good one, and I will let you know when I have q2-sidle (SMURF Implementation Done to acceLorate Effeciency) working the way it needs to be to go live.

Nicholas_Bokulich · April 17, 2020, 7:31pm

Hey @cjone228 and @rparadiso,
I have chatted with @jwdebelius and @colinbrislawn on the sidelines to agree on the best approach and to clarify some of our back-and-forth above.

It sounds like right now @jwdebelius and @colinbrislawn and I are all in favor of closed-reference OTU clustering as a potential solution, because it allows you to "collapse" these mixed amplicons by aligning them against full-length sequences; this is distinct from actually clustering OTUs de novo, which as @jwdebelius mentioned above would not work with mixed amplicons. This has not been benchmarked, though, so proceed with that in mind.

I want to reiterate that I have not personally worked with Ion Torrent data, and the solutions I described above to @rparadiso have not been tested... so those are not authoritative solutions, just ideas for improving alpha diversity estimates (because we can all agree that the mixed amplicons create a major challenge for alpha/beta diversity estimates with conventional approaches!).

Thanks all for the great discussion! One of these days I really hope this coalesces in a community tutorial on best practices for analyzing ion torrent 16S metagenomics kit data!

colinbrislawn · April 17, 2020, 8:39pm

This might be common knowledge already, but I wanted to articulate why I think close-ref OTU picking could be a good fit for multi-region-studies

Three high-level strategies for defining OTUs... are canonically described as de novo, closed-reference, and open-reference OTU picking... Each of these methods has benefits and drawbacks.
...
In closed-reference OTU picking, input sequences are aligned to pre-defined cluster centroids in a reference database. If the input sequence does not match any reference sequence at a user-defined percent identity threshold, that sequence is excluded. (peerj, 2014).

This is essentially 'counting database hits' so

resulting OTUs are 100% biased by the database
resulting OTUs are 100% consistent with the database
resulting OTUs are literally just the ones from the database

Modern ASV methods aim to be just as consistent without introducing database bias, but for this project we are knowingly using this strong bias to normalize across regions.

Let us know what you find!

Colin

P.S. You could get some popcorn and read this flaming review of closed-ref clustering, or don't because we use ASVs now!

Nicholas_Bokulich · April 17, 2020, 9:22pm

Thanks for clarifying @colinbrislawn! I agree, and I also generally avoid closed-ref OTU clustering in the modern era if I can, but this ion torrent kit seems like one of the special case where I think the strengths (collapsing disparate amplicons) could outweigh the weaknesses (database bias, reduced resolution vs. ASVs) so there are times when I still use and advocate closed-ref OTU clustering.

One thing to note: we have known for a long time that OTU clustering on its own leads to inflated diversity estimates and needs to be paired with other filtering (or denoising!) methods to reduce those errors, and closed-ref OTU clustering on its own suffers from the same issues.

However, in this case you are using closed-ref OTU clustering after denoising. So the contents of that flaming review do not really apply here... erroneous sequences are being filtered/corrected by your denoising method of choice, then closed-ref OTU clustering is being used strictly to "collapse" the ASVs into full-length 16S sequences, not as a pseudo-error-filtering method. This should all still be benchmarked to see how this performs for this ion torrent kit (@cjone228's mock communities will enable that endeavor!) but that review should not discourage this analysis.

rparadiso · April 23, 2020, 1:56pm

Hello to all,
I'm trying this pipeline on my data. This is the step (as suggeste above):
1- Import data
2- demux
3- dada2 denoise-pyro
4- qiime vsearch cluster-features-closed-reference
5- qiime fragment-insertion sepp

In the last step (5) I had an error message...

This is the script that I used:
qiime fragment-insertion sepp
--i-representative-sequences rep_seqs_cr_99.qza
--i-reference-database sepp-refs-gg-13-8.qza
--output-dir Sepp_feces
--p-threads 1

Plugin error from fragment-insertion:

Command '['run-sepp.sh', '/var/folders/nk/qyjfz4t11vn5vnx07bkphqrw0000gn/T/qiime2-archive-70i_7a8b/2d6587b0-3345-4d60-b53a-de3e4054cf84/data/dna-sequences.fasta', 'q2-fragment-insertion', '-x', '1', '-A', '1000', '-P', '5000', '-a', '/var/folders/nk/qyjfz4t11vn5vnx07bkphqrw0000gn/T/qiime2-archive-2wfekjne/a14c6180-506b-4ecb-bacb-9cb30bc3044b/data/aligned-dna-sequences.fasta', '-t', '/var/folders/nk/qyjfz4t11vn5vnx07bkphqrw0000gn/T/qiime2-archive-2wfekjne/a14c6180-506b-4ecb-bacb-9cb30bc3044b/data/tree.nwk', '-r', '/var/folders/nk/qyjfz4t11vn5vnx07bkphqrw0000gn/T/qiime2-archive-2wfekjne/a14c6180-506b-4ecb-bacb-9cb30bc3044b/data/raxml-info.txt']' returned non-zero exit status 1.

Debug info has been saved to /var/folders/nk/qyjfz4t11vn5vnx07bkphqrw0000gn/T/qiime2-q2cli-err-mf2m8927.log

Fist of all the pipeline is correct?
What happened?
I think that the problem is the step 4 because when I use sepp after dada it works.
Can someone help me?

Regards,

Rubina

Lauren · April 24, 2020, 6:30pm

Hi Everyone!

@cjone228 and I have another couple of points we would like clarified:

Our fastq files contain single-end mixed orientation reads (both forward and reverse). We imported our data using SingleEndFastqManifestPhred33V2 according to the QIIME2 the Importing Data Document. However, we recently noticed that the Importing Data Document states "In this variant of the fastq manifest format, the read directions must all either be forward or reverse." Is there another way we should be importing our data? Or is the only solution to re-orient our reads or split based on direction prior to importing?
In the event that we are able to import our fastq files as-is (i.e. in mixed orientation), we wanted to clarify whether or not DADA2 can handle mixed-orientation reads? (Based on what we read we don't think that it can...).

So, overall we are just trying to clarify whether it is inevitable that we will need to split our reads by direction at some point or another.

Thanks!
Lauren

P.S. @rparadiso - you are actually a step ahead of us, so we don't have an answer to your question! Hopefully someone else has some insight for you

Lauren · April 24, 2020, 6:35pm

Oops! I meant to thank @Nicholas_Bokulich, @colinbrislawn, and @jwdebelius for your clarifications on OTU clustering, SMURF, and phylogenetic analysis above - all of your posts were extremely helpful!

Lauren :qiime2:

Nicholas_Bokulich · April 24, 2020, 6:49pm

I am not 100% certain of the semantics there, but I think the point was that reads should not be pre-joined if they are imported in that format..

Your reads are all forward or reverse because in this case F/R mean the read direction on the sequencing instrument, not the orientation respective to the genome (which is mixed in this case).

So you are doing the right thing, and this is the correct format.

dada can handle mixed-orientation reads (respective to the genome), that is not a problem technically speaking. But mixed F + R reads and pre-joined reads will cause issues.

So again you are doing things correctly.

The only issue I can think of for mixed-orientation reads and dada2 is that you will get unique ASVs for reads from the same genome that are in reverse orientations. But in theory that is not a dada2 problem, it is an alpha diversity problem! (as I think we've discussed above but this topic is so long I can't remember anymore).

The issue is that you do not need SEPP, and should not use SEPP here. When you use closed-reference OTU clustering, the features are no longer ASVs that need to be aligned/spliced into a new phylogeny. The features are now the matching reference OTUs, and you adopt the reference phylogeny.

See this issue for more details on why you should not use SEPP after closed-reference OTU picking:

So what should you (and everyone else who wants to use this pipeline) do instead? You should use the reference trees that ship with your reference database of choice (e.g., in your case use the greengenes 99% OTU reference tree since you used that same database for clustering with vsearch).

Thanks everyone! I feel like we're making a lot of progress!

rparadiso · April 27, 2020, 7:42am

Thanks Nicholas for your correction.
Now I'm a little confused about how to continue...
I have to download the tree from Greengenes' database and then how do I continue?
Can I perform in the next step the core metrics?

qiime diversity core-metrics-phylogenetic
--i-phylogeny rooted-tree.qza (green genes tree?)
--i-table table.qza
--p-sampling-depth XXX
--m-metadata-file metadata.tsv
--output-dir core-metrics-results

Best,
Rubina

Nicholas_Bokulich · April 27, 2020, 2:51pm

Hi @rparadiso,

Yes just import the tree that corresponds to the reference sequences that you used (e.g., if you used the 99% ref, use the 99% OTUs tree) as a Phylogeny[Rooted] following the import instructions.

If you run into any subsequent errors with core-metrics, see this topic:

Good luck!

Lauren · April 27, 2020, 4:45pm

Thanks for the feedback @Nicholas_Bokulich!

You are correct - we did cover this earlier in the post

Thinking a little further ahead, if we import and perform dada on our genomic-mixed-orientation reads, do you foresee that the reverse reads will have problems aligning to the reference database when doing closed-reference OTU clustering? (i.e. - will we lose half of our data?)

Thanks as always!
Lauren :qiime2:

Lauren · April 27, 2020, 5:57pm

We may have just answered our question about closed reference OTU clustering - in the QIIME2 documents for closed reference clustering of features, the parameters section has the following option:

--p-strand TEXT Choices('plus', 'both')
Search plus (i.e., forward) or both (i.e., forward
and reverse complement) strands. [default: 'plus']

We assume if we choose 'both' that would allow for mixed orientation reads?

Next, we were looking for the greengenes reference database for the OTU clustering step. We could not find the 13_8 release on the gg website. We found this post and downloaded the file. Is this is the correct file to use?

The file contains the following folders:
Screen Shot 2020-04-27 at 1.48.10 PM

For our closed reference OTU clustering, we need to use one of the xx_otus.fasta files as our reference sequences. Which version should we use - the rep_set or the rep_set_aligned?

Thanks so much!
Lauren :qiime2:
Based on your recommendations above, we would want to use a rooted tree. Which version should we use: XX_otus.tree or xx_otus_unannotated.tree ?