Using FeatureData[Taxonomy] artifact in QIIME2?

emmzee · October 15, 2019, 1:34am

Hi all,

I have finally decided to jump on the hype train and use QIIME 2 to analyze some data. So far, I am a fan of QIIME2 and all it can do, and I have been following the Overview tutorial to analyze some data, but I have encountered a few issues, as I can't seem to find the right place to answer my questions.

To give you a little background, I am analyzing 18S amplicon sequences, and I have generated my .fastq file with Ion Torrent PGM, I performed demultiplexing in QIIME(1) with split_library.py, imported my .fna seqeunces file into QIIME 2 and finally dereplicated my sequences and used classify-sklearn against the 132 SILVA release to assign taxonomy to my sequences. Here are my concerns:

Is it possible to receive an ASV separated table (with assigned taxonomy similar QIIME(1) format) from my generated FeatureData[Taxonomy] artifact? In other words, how can I convert my artifact to show the original ASVs that were collapsed to form the taxonomy I see in FeatureData[Taxonomy] after performing a taxonomic visualization (e.g. with bar plots)?
Is it possible to 'normalize' my table for ununiform sequencing depth (if or if not converted to the format above)? I believe it's possible to export the artifact into an editable form and change it manually, but is there a way to do that in QIIME2?
I originally used sampleIDs with underscores, can I edit that to a different character in QIIME2 within the generated FeatureData[Taxonomy] artifact?

Thank you for your time and patience. Please let me know if I can provide any more detail or clarify any of my questions

Nicholas_Bokulich · October 15, 2019, 10:52pm

Welcome to the hype train @emmzee! :qiime2::qiime2::qiime2:

You came to the right place.

Answers to your questions:

Use qiime metadata tabulate --m-input-file taxonomy.qza

Use qiime feature-table rarefy

But note that QIIME 2 methods generally perform their own normalization on board (e.g., qiime diversity core-metrics pipelines rarefy as an initial step in the pipeline) so rarefying on your own is generally not needed unless if exporting for external use (or I suppose you might want this for your taxonomy barplot?)

There is a way to change the IDs in the table, but not the taxonomy. You will need to export the taxonomy, relabel, and re-import to QIIME 2.

Now some answers to the questions you did not ask:

why not demux in QIIME 2?

No denoising or at least OTU clustering? Simply dereplicating is likely to leave in lots of noisy reads.

I hope that helps!

emmzee · October 16, 2019, 12:17am

Thank you for your detailed response, @Nicholas_Bokulich!

Use qiime metadata tabulate --m-input-file taxonomy.qza

This is extremely useful, so thank you for suggesting it. Using the visualization output, I can find the Feature ID, its associated confidence level, and the assigned taxon. Is it possible to show the distribution of each feature ID across all samples in my study?

Use qiime feature-table rarefy But note that QIIME 2 methods generally perform their own normalization on board (e.g., qiime diversity core-metrics pipelines rarefy as an initial step in the pipeline) so rarefying on your own is generally not needed unless if exporting for external use (or I suppose you might want this for your taxonomy barplot?)

What about rarefying-independent techniques? For example, if I'm performing a PCoA, and I'd prefer a cumulative or total sum scaling method, would that be an option in qiime2? I assume this is still in development as suggested in this thread, and currently the best option is to perform this outside of qiime2.

why not demux in QIIME 2?

Are there any advantages to doing that? I intended to use QIIME2 a long time ago, but it was less clear how to use Ion Torrent generated data in the pipeline. I still find QIIME1 more straightforward, as I can control easily control parameters like min seq length, account for variable barcode lengths, allow for primer mismatches, and trim reverse primers.

No denoising or at least OTU clustering? Simply dereplicating is likely to leave in lots of noisy reads.

While following the overview tutorial, it seemed better to just use straight ASVs for classification, but now I am worried I did not denoise the samples. You essentially perform "100% OTU clustering" when dereplicating because you generate ASVs, isn't that conceptually correct? Incidentally, I decided to detect and remove chimeras earlier today. Should I denoise first and then perform chimera detection and removal, or am I safe with denoising a non-chimera artifact?

Nicholas_Bokulich · October 16, 2019, 1:45am

Not each independent feature — you can get this info from a feature table and plot in python, R, etc.

Sure. We don't have CSS or TSS (this is still an open issue and contributions are very welcome ) but for a rarefaction-free method you can check out q2-deicode in the QIIME 2 library: https://library.qiime2.org
Robust Aitchison PCA Beta Diversity with DEICODE

The main reason would be to preserve your entire workflow in provenance, so that someone (you, a collaborator, you in 5 years, etc) can figure out the entire workflow used to generate any QIIME 2 output. But functionally speaking no QIIME 2 does not do anything special during demux that qiime 1 did not do. I suppose the other reason is that you can get support on this forum if you run into trouble but qiime 1 is no longer officially supported (though the qiime 1 forum still exists). Speaking of which...

We don't have any official tutorials but there's a good deal of past Q+A on this forum that may be a good place to figure it all out. And of course you can always just ask to see if anyone know

In a sense yes — but denoising also filters out (and in the case of dada2 attempts to correct) errors in the reads.

denoise first. The denoisers in QIIME 2 have chimera checking built in, so you do not need to.

tabulate can actually operate on all sorts of QIIME 2 artifacts (as well as TSV metadata files) because many QIIME 2 artifacts can be read as metadata. Have fun with that method

I hope that all helps!

emmzee · October 16, 2019, 3:08am

denoise first. The denoisers in QIIME 2 have chimera checking built in, so you do not need to.

I have one (hopefully) last question regarding this. I imported my .fna file from QIIME as an artifact of type SampleData[Sequences]. Using the same tutorial I have been following, the only path is to dereplicate my sequences using vsearch, as I have done. What is the 'correct' path to denoise those sequences, given a dereplicated sequence set, or a raw sequences set that hasn't been dereplicated of type SampleData[Sequences]?

Nicholas_Bokulich · October 16, 2019, 3:27am

Oh good question — I had forgotten this and you have provided a better answer than I could to this question:

Indeed the advantage is that demultiplexing with QIIME 1 will prevent you from being able to denoise in QIIME 2 (since the denoisers need the quality scores as inputs, and QIIME 1 drops the quality scores at demultiplexing).

So you either need to:

go back and demultiplex with QIIME 2
use OTU clustering in QIIME 2 (that's better than just dereplicating, but not as good as denoising)
Use deblur outside of QIIME 2 (standalone deblur is installed along with QIIME 2 and does not need quality scores so you can use it on these data then import the biom table and representative sequences to QIIME 2 for further analysis).

So maybe that won't be the last question regarding this but that is okay, that's what we are here for.

emmzee · October 19, 2019, 10:57pm

So you either need to:

go back and demultiplex with QIIME 2

use OTU clustering in QIIME 2 (that’s better than just dereplicating, but not as good as denoising)

Use deblur outside of QIIME 2 (standalone deblur is installed along with QIIME 2 and does not need quality scores so you can use it on these data then import the biom table and representative sequences to QIIME 2 for further analysis).

So maybe that won’t be the last question regarding this but that is okay, that’s what we are here for.

Thank you for your support. I'm going to try the solutions you provided in steps 1 and 3, starting with the solution 1. The most challenging step for me is importing the data, as I'm not sure which method is best. I have one fastq with all the samples and their associated barcodes/ amplicon library. I converted it with QIIME1 into .fna and .qual (not sure if this is possible in QIIME2). The barcodes are of variable length, can they somehow be imported into QIIME2?

Edit: What I was also wanted to ask is: If I convert from fastq to .fna and .qual, and not end up using the quality file ".qual", am I missing out on important information?

Nicholas_Bokulich · October 21, 2019, 3:39pm

Do you have a separate fastq of the barcodes for each sequence? If yes, you have EMP format sequences and this is described in the importing tutorial on qiime2.org

If not, are your barcodes contained within the sequence? (e.g., at the 5' end of each sequence?) If yes, see the q2-cutadapt tutorial for guidance on important and demultiplexing.

Don't do this! If you want to demultiplex in QIIME 2 you will need to keep your data as fastq

I don't think the length matters as long as each sample has a unique barcode, but let us know if you run into any issues related to this.

Yes! Lots of information, which is why you want to keep as fastq. After demultiplexing you can denoise your data with dada2, which will use the qual scores (as part of the fastq) to figure out the error rate and use that to correct error-riddled sequences.

Good luck!

emmzee · October 21, 2019, 5:21pm

Thank you again for the detailed response!

After hours of frustration, I failed to import my fastq file, but I managed to request separate fastq files for each sample from our sequencing technician, which were given to me in the form of unique fastq files for each barcode. This helped me avoid all the issues and import my sequences in single-ended form using a manifest file, demultiplex and use my primer sequences (and adapter/ reverse primer) to select my amplicons from the shared barcodes, and quality filter my sequences.

After sequence filtration, it seems I can only retain 10% of my sequences. I believe I generally have a low sequence quality score, but I'm working on fixing this as I'm experimenting with various parameters, and while using all the previous posts on the forum to help guide me. I will attempt to try solution #3 separately in the future and edit my post in case this ends being used.

Nicholas_Bokulich · October 21, 2019, 6:00pm

Feel free to open a new topic on the forum if you run into trouble. I agree, existing posts probably have the solution (== truncate your reads more to remove low-quality tails).

Glad you could get the data in demultiplexed format! Saves you time and trouble

system · November 22, 2019, 12:00am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.