Possible Analysis Pipeline for Ion Torrent 16S Metagenomics Kit Data in QIIME2?

Hi @Lauren and @cjone228,

@KMaki and I are working with the same data as you and wanted to ask you all after running dada2, what percentages did you get from the denoising output in the variable from the denoising stats qzv:
“Percent of input passed filter”

We see that the nice Parkinson’s tutorial has an average of percent of about 95%. Our Ion Torrent specific data has an average about .2% to our best which is about 30% across all runs. What did you all see here after this step?

Thank you!
Jen

Ouch!

Try increasing the trimming parameters to remove low quality starting and ending positions.

Also, just to check, are you using denoise-pyro? The PD-mouse tutorial uses Illumina data, which has different defaults than pyro data like Ion Torrent.

Colin

2 Likes

Hi Jen! Sorry to hear you are having this issue - I just finished the DADA2 step for my samples, which span 4 separate chips, and for each of them it seems as though typically 70-80% of the reads pass the filter. This was true for @Lauren's samples too.

Here is an example of our code:

qiime dada2 denoise-pyro \
  --p-trim-left 15 \
  --p-trunc-len 0 \
  --i-demultiplexed-seqs ~/work/Carli/Chip3/Chip3_demux.qza \
  --o-table ~/work/Carli/Chip3/Chip3_table.qza \
  --o-representative-sequences ~/work/Carli/Chip3/Chip3_rep_seqs.qza \
  --o-denoising-stats ~/work/Carli/Chip3/Chip3_denoising_stats.qza \
  --verbose

I hope this helps!
Carli

1 Like

Hi I try to use your pipeline for the analysis of my data, but I have a problem with the importing.

qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path /Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv
–output-path demux_1_15.qza
–input-format SingleEndFastqManifestPhred33

This is the error message:
There was a problem importing /Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv:

/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv is not a(n) SingleEndFastqManifestPhred33 file:

Line 3 has 1 cells ([’/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/1_15/1_DX.fastq.gz’]), expected 3.

Can you help me?

Thanks

Hi Rubina,

I think the problem is that your manifest file is a .csv file and not a .tsv file! Try making it a .tsv file and hopefully that will work.

Best,
Carli

4 Likes

Hi Everyone!

@cjone228 and I want to clarify this point - we just realized that using the the full length sklearn classifier is not advised because the pre-trained sklearn classifier doesn't do well with mixed-orientation reads (see post). We just had the problem where we were only getting ~60-70% classification with both Silva and GG and realized that mixed orientation reads are likely the culprit. Thus, we will try to move forward with classify-consensus-vsearch. Just found this post - super helpful! :partying_face: :star_struck: :laughing: @KMaki have you gotten vsearch consensus classifier to work? If so, do you mind sharing your final code? :pray:

Investigation of this hurdle brought up another question: do we need to deal with our mixed orientation reads prior to the taxonomy steps? (Just a reminder that IT produces single end reads, but each sample includes reads from 12 different pooled amplicons: 6F primers and 6R primers.) @Jen_S posted on this topic and @Nicholas_Bokulich recommended separating out F from R reads prior to denoising, denoising separately, and then merging. Is this something we should do? Could mixed oriented reads be part of why only ~70% of our sequences are passing the dada2 filter? (see @cjone228 and @Jen_S discussion on that below).

Thanks much!
Lauren :qiime2:

2 Likes

Hi @Lauren,

Ideally yes, since having mixed-orientation reads will inflate (particularly alpha) diversity estimates, since each true ASV is being counted twice (once for each orientation). OTU clustering may be a way to correct this post-dada2, but I have not looked into it yet.

Just to clarify, this is if the F/R reads are being re-oriented as @Jen_S was doing... because I think dada2 would probably explode if it is fed re-oriented reads, since the quality profiles would be reversed as well. If you are sorting out orientation issues post-dada2, you don't need to separate for denoising.

No, probably not. That's more likely related to trimming/truncation parameters and quality profile more generally.

2 Likes

Hi @Lauren !

I am happy you found the post helpful! :smiling_face_with_three_hearts: I am going to do my best to answer your questions, but will do so in reverse to line up with workflow.

The most effective adjustment that I made in DADA2 to increase the amount of sequences that passed denoising was decreasing my trun length. Decreasing the truncation length from 250 bp to 150 bp increased the amount of sequences that passed through DADA2 filtering from ~2% up to percentages in the 80s. This was not true across all areas but helped greatly. I saw that you don’t truncate so it may be worthwhile to try seeing if truncating at 250 or 150 improves your filtering percentages without sacrificing accuracy (I think I remember reading you have some mock samples so you can evaluate this). We also trim at 15 bp and use the DADA2 –pyro option.

VSEARCH has worked well for us so far! We are still doing the comparisons to see how it lines up to our old workflow but so far we have been happy with the result.

The code for vsearch classification is below:
qiime feature-classifier classify-consensus-vsearch
–i-query merged_rep_seqs_pyro_trun150.qza \ #This is rep seqs artifact
that comes from DADA2 output after merging all of the runs together
(if applicable)
–i-reference-reads gg_99_otus.qza \ #This is the green genes 99%
OTUs that we converted to an artifact (Code at bottom)
–i-reference-taxonomy gg_ref-taxonomy.qza \ #Reference tax from
Greengenes also converted to qiime Artifact (code below)
–o-classification vsearch_gg_classification_trun150.qza #output

Qiime visualization:
qiime metadata tabulate
–m-input-file vsearch_gg_classification_trun150.qza
–o-visualization vsearch_gg_classification_trun150.qzv

Code to import Greengenes.fasta and taxonomy.txt into Qiime2 artifacts
Note you get the Greengenes data from Qiime2 data resources here

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path 99_otus.fasta
–output-path gg_99_otus.qza

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path 99_otu_taxonomy.txt
–output-path ref-taxonomy.qza

I hope this helps! Please let me know if you have any other questions

Sincerely,
Katherine :blush:

4 Likes

Thanks @KMaki!

It’s about time we get a QIIME 2 tutorial for ion torrent sequencing put together, per the title of this topic. Maybe everyone who has posted to this topic should team up to write a tutorial that we can post in the “community tutorials” section of the forum (and eventually the QIIME 2 library)… anyone interested? It takes a village to write a tutorial… :houses:

6 Likes

@Jen_S and I are definitely interested in putting together a tutorial. We are currently trying to contact IonTorrent to see if there is a way to extract V regions with a plugin or some other way to preserve their intellectual property while working with the QIIME2 workflow. This will determine how we arrange the future workflow so based on what plays out we can start working on the IonTorrent analysis tutorial.

7 Likes

Hey @Nicholas_Bokulich and @colinbrislawn

I have a follow up question about my vsearch taxonomy classifier. One of my regions has a higher percentage of unclassified taxa than the other and I’m thinking I could potentially tweak some of the vsearch settings in the “qiime feature-classifier classify-consensus-vsearch” command.

I am currently using the default parameters (as I have in my code in my response to @Lauren

I am curious if you have any suggestions for parameters to adjust as a starting point to try and improve this unclassified tax %?

My DADA2 stats in this region are pretty good and nothing changed when I increased the truncation length of the segment pulled from greengenes so I think tweaking the vsearch parameters would be a good next testing point. :thinking:

Appreciate any thoughts/suggestions you may have, thank you!
Katherine

1 Like

Hi @Nicholas_Bokulich and @KMaki,

Sorry for my delayed response - thank you both very much for your input! :+1:

This makes sense - @cjone228 and I are completely new to bioinformatics so we are learning along the way. It didn't occur to us until last week that we were importing mixed reads using SingleEndFastqManifestPhred33V2. We now see on the Qiime2 Docs importing data page that "In this variant of the fastq manifest format, the read directions must all either be forward or reverse."

I think that it makes sense for us to re-orient our reads as @Jen_S did because approximately half of our reads are F and half R.

@KMaki, your reply post on VSEARCH is very clear (thank you!) - however I think @cjone228 and I need to take a step back and work on re-doing the initial steps before we get to this point.

@Nicholas_Bokulich, I would be happy to contribute to a tutorial, however I need to figure out what I am doing first! :laughing:

Now I have a few new discoveries / questions that go back to late February when I was going through output files from Ion Reporter with @colinbrislawn:
It occurred to me that I only listed a few of the files that IR spits out. Turns out it also spits out the following list:

I see that IR is using Qiime v 1.9.1, so is it possible for me to convert some of these files into Qiime2 compatible files and use them? My concern is that IR does everything through it's own server, so we wouldn't have access to any of the intermediate files (if there are any). Also, I am happy to open more of the folders and show you what is in them if that is helpful.

I see that I have some taxonomy .biom files, so yesterday I was trying to convert the OTU_family.biom file into a Qiime2 .qza file. I found this post and attempted to follow the instructions (see screenshot below)


I picked --type FeatureData [Taxonomy] because is doesn't look like my file has any sequences in it and I picked --input-format BIOMV210Format because my file states "BIOM-Format 2.1.7" and it was made with Qiime v 1.9.1 (see .biom contents in screenshot below)

As always, if anyone has any insight I greatly appreciate it!
Take care,
Lauren

1 Like

Hey @Lauren - good sleuthing, but I think that might be a red herring - the original file was generated using biom 2.1.7, but if you look one line up you'll see that the current format is biom 1.0.0.

The post you linked to for importing taxonomy data from a biom table only works with biom 2 files - all you need to do is convert this biom 1 file to a biom 2 file using the biom CLI tool:

http://biom-format.org/documentation/biom_conversion.html

Once converted you might be able to import as taxonomy, although based on the screenshot, I'm not 100% sure that'll work, because that special import invocation expects the taxonomy to be stored as metadata in the biom table, rather than the IDs themselves.

2 Likes

Thanks @thermokarst!

I wondered about that... so thanks for clarifying!

Sounds good- will do :+1:

Sigh... ok - It is worth a try, but I won't expect too much based on your insight. :unamused:

Thanks again!
Lauren

2 Likes

Unclassified taxa can occur for a few reasons with the vsearch-based classifier:

  1. there really is no good match.
    1. Seems unlikely if this is region-specific, but that could occur if the reference sequences have a jumble of different regions covered (e.g., different primers were used for sequencing those reference sequences). It looks like you are using Greengenes? So that is probably not the case, either.
    2. Can occur if --p-perc-identity is set too high... but the default is really low, so that's probably not the issue.
    3. Can also occur if reads are in the wrong orientation, but the default is to check both orientations, so not the issue here.
  2. there is more than one match at the basal taxonomic level (kingdom in the case of greengenes) above the consensus threshold. So theoretically you may be hitting both bacteria and archaea, leading to "unclassified" (because reliable classification cannot be achieved at even the basal level).
    1. Since your query sequences (and possibly also reference sequences) are mixed orientations and you are searching in both directions, you could be picking up hits in both directions (only one of which would be valid!). I recommend increasing --p-perc-identity a little bit to see what happens, also test out the max-hits and top-hits-only settings to see how they impact your results. If you suddenly get much better results (without setting max-hits or max-accepts to 1!!!) then you've struck gold.

Could you clarify? Are you using extract-reads on the reference sequences? If so, don't use trunc-len on those reads (since your query seqs are in mixed orientations they can hit at either end of that read)

Yes that works too, but based on @thermokarst's notes, it looks like the biom tables you have are already collapsed on taxonomy so may not be enormously helpful.

:partying_face:
Let's circle back to this — I think the prospect of building a community tutorial (especially one written by different research groups who are bringing different perspectives to the table) is really exciting.

Good luck to you both! Let us know if you have any more questions.

4 Likes

Thanks so much for the input, I will try to vsearch parameters and report back

As you guessed, currently we are pulling out segments from the greengenes "99_otus" artifact we import into QIIME2 corresponding to our desired V regions using "qiime feature-classifier extract-reads". Would you mind expanding on your statement:

Do you mean since we have mixed orientations there is a concern that the greengenes sequences needed to classify taxa using vsearch may be missed if we truncate?

Should be not using the trun-len at all for the extract reads command ?
Example code below (X's would be the primer sequences)

qiime feature-classifier extract-reads
--i-sequences gg_99_otus.qza
--p-f-primer XXXXXXX
--p-r-primer XXXXXXX
--p-trunc-len 250 \ #Remove this?
--p-min-length 100 \ #Keep this?
--p-max-length 400 \ #Also keep this?
--o-reads v2_gg_ref-seqs.qza

I got pretty good classification percentages in my other region but maybe this was just luck?

Thanks so much!

Strongly agree with the above!!! :star_struck: :woman_technologist:

4 Likes

Yes. Normally (when read orientation is consistent) truncation can be used with extract-reads to match the exact same site and length as in the query. For example:

imaginary query:
ACGTACGTACGTACGTACGTACGTACGTcccccccccccccccc

imaginary reference:
gggggggggggggggACGTACGTACGTACGTACGTACGTACGTccccccccccccccccgggggggggggggg

(lowercase "g"s == primer sites)
(lowercase "c"s == part of amplicon sequence that corresponds to region that was truncated from query by dada2)
(bold == part of amplicon that appears in truncated forward reads/reference sequences)

Using extract-reads on that reference would yield an exact match (or in real life just match the same region/conditions so enable exact or similar matches between query and ref):
truncated query: ACGTACGTACGTACGTACGTACGTACGT
truncated refseq: ACGTACGTACGTACGTACGTACGTACGT

However, when orientations are mixed, you don't know which end of the amplicon your query sequences are on... you will have a mixture of:

ACGTACGTACGTACGTACGTACGTACGTcccccccccccccccc

and its reverse complement:

gggggggggggggggACGTACGTACGTACGTACGTACGTACGT

Which truncated (on 3' end of each read) will yield:

ACGTACGTACGTACGTACGTACGTACGT

and

gggggggggggggggACGTACGTACGTAC

So you are covering different parts of the complete amplicon. Hence, the reference sequences should be untruncated to cover the full amplicon so that you can hit any part.

Correct

Yes, remove

keep both, but adjust to expected ranges (this could also cause unassignments if the ranges are not being set correctly). Check the lit for what the expected amplicon ranges are — setting broad limits probably does not hurt, these are really just used as safeguards, since occasionally some primer sets and reference database combinations can yield some spurious hits that cause issues during classification... unusually short or large amplicons are a good indicator of spurious hits.

What region was good and what region was bad? Sometimes it's not luck, sometimes it does depend on region, primer, reference db, etc... e.g., I've seen issues with V1 primers on some databases before because some of the reference sequences might not have the correct forward primer included in the sequence. The default settings were designed based on benchmarks of different 16S domains as well as ITS... sort of general "catch all" settings... but for unusual amplicons (maybe?) or for other marker genes (not you, since you have 16S but just saying) these settings may need to be tweaked. Having a mock community to re-optimize for novel primer sets is ideal!

5 Likes

Hello to all,
I tried your pipeline on my data and it works …
first of all thanks to this initiative several times over the years I have asked for it to be created without success so it really seems to me a great thing!
I have two questions for you:

  1. Why do you use sepp for the phylogenetic tree? can’t use mafft?
  2. There is a way to solve the problem in the alpha diversity? (Faith’s PD is not always enough)

Thank you so much

Rubina

2 Likes

Hi @rparadiso,
I am pinging @cjone228 to describe her pipeline decisions better than I can! But here are my thoughts:

I think the issue with using mafft here is that the amplicons are targeting different sections of the 16S, so mafft will not be able to align those properly... fragment-insertion will insert those fragments into a reference tree built on full 16S sequences, so will accommodate this multi-amplicon design.

I think you've already seen @cjone228's description above (I'm just quoting here to give context for others following along!)

Having multiple amplicons means that counting ASVs will lead to highly inflated measures of richness (e.g., if you have 7 different amplicons in the pool, then richness scores will be 7X the actual richness). A phylogenetic alpha-diversity method like Faith's PD would be unaffected by this issue, but all other metrics will be. A few workarounds:

  1. You could collapse your feature table on taxonomy and then calculate alpha diversity metrics. Ideally, this will "collapse" amplicons that hit the same reference sequence but at different sections of the 16S. It could still run into issues if some amplicons assign more deeply than others (we know this to be a problem!) so alpha diversity could still be inflated. One possibility is to use a taxonomy assignment method like classify-consensus-blast and take only the top hit, and only use these results for calculating alpha diversity; similar to using sepp the hope is that these would "splice" into the same reference taxonomy.
  2. use closed-reference OTU clustering to cluster sequences into full-length reference sequences. Once more, we are "splicing" the amplicons from different regions into single full-length sequences. The problem: closed-ref OTU clustering has its own issues you are probably already familiar with (and if not you can read about these more in the tutorials at qiime2.org)
  3. Lazy methods: the multiple amplicon issue will affect all samples evenly, so your richness estimates will be highly inflated in all samples, presumably at an even rate. If sequence depth is high enough, you could just carry on as normal, based on the assumption that this is a bias impacting all samples so you can compare within the same study using conventional alpha diversity metrics, but not between studies (since the richness is inflated!). Another option: try dividing observed richness by the number of amplicons to get an estimate of the richness you'd observe with a single amplicon. You could also try filtering out individual amplicons to see what your richness looks like with each individual amplicon to see how well this method works... getting some external validation (compare to other studies) is also recommended if you go that route.

Just a few ideas to think over! Interested to hear what @cjone228 and others think. I hope that helps.

2 Likes

Hello Nicholas,

Should this be done at every taxonomic level?

It does not seem according to me a viable way as you should assume that all primers have the same efficiency in reaction, but we know that it is unlikely to be so.
Some time ago, in addition, I read of a job that must be done in which it seems that one region rather than the others are more "specific" (let me pass the term) for some taxes compared to others.

Thanks Nicholas for your answear!

1 Like