Possible Analysis Pipeline for Ion Torrent 16S Metagenomics Kit Data in QIIME2?

Hello Lauren,

This is a gold mine! :pick: :moneybag:

The headers is what I’m most interested in.

MG | * | V3 | F | * | 99.355 | Actinobacteria
     ^ always 2. I wonder if this is the version of the kit/primers?
                  ^ I wonder if this is the exact read count in the sample?

Other than these downstream files, do you also get raw files from this run? I wonder if you could use the V3 runs, for example, to align to raw reads and identify the primers that flank each one. Or do they remove their secret primers before you have a chance to see them?

If you had raw data, you could also check how many times each of these exact subsequences appear and see if my second hypothesis is correct.

Let me know if you have got the data and want help crafting a vsearch command!

Thanks!
Colin

:female_detective: :male_detective: :robot:

2 Likes

Hi all,

I am working on the analysis for the ATCC mock communities using the above mentioned pipeline:

I completed steps 1-3, and am now working on the sepp fragment insertion step. In case anyone is curious, here is a picture of the quality plot we got in our demux.qzv file:

Screen Shot 2020-02-27 at 4.28.40 PM

We then continued with the DADA2 step (using denoise-pyro) and got a feature table:

and rep-seqs file:

I am now encountering an error when trying to do the sepp-fragment-insertion step - we use MARCC, a cluster, in order to do a lot of our work, so the below screenshot is the output of a SLURM job that we submitted for this.

I realized that the code I was using from the sepp-fragment-insertion step from the Parkinson's Mouse tutorial is from QIIME2 2019.10, so in the 2019.7 version there is no option for --i-reference-database. I installed (or I think I installed) QIIME2 version 2019.10 into our MARCC account to fix this, but it still gave me the same error saying that this is not an option for an argument. Any ideas? Thank you so much!! :pray:

Carli

1 Like

Dear all - I am quite new in microbiome analysis, and started some time ago with IonT16S kit mainly because the PGM machine was free. So far, I have used IonReporter microbiome package to get an overall view of my data, but this package is 'limited/fixed' and I would like to make more flexible analyses. I was wondering why not use the 'consensus' Fastq files as input sequences in QIIME2, and not the single V regions (for which the F and R primer sequences are needed). Fastq file for each of my samples is >100MB, contains > 200,000 sequences and look like this:
Sample01
Unfortunately, the length of sequences are quite different, since do not correspond to a single V region. Could this be a problem in subsequent QIIME steps?
Thank you for any help and comment!

1 Like

Hi all,

Making more progress on analysis of the 2 ATCC samples in QIIME2! :partying_face: However, we have now run into kind of a strange error that we'd love some feedback on. Since I last posted, I was able to complete the sepp-fragment-insertion step. We then performed alpha rarefaction in order to determine the sampling depth for core metrics analysis. We decided that 40,000 reads was a good sampling depth for us (which is good because our goal for our actual clinical samples is to downsample to 40,000 reads/sample).

For core metrics analysis, we used the following code:

qiime diversity core-metrics-phylogenetic \
  --i-table ~/work/IT_pilot/IT_pilot_table.qza \
  --i-phylogeny ~/work/IT_pilot/IT_pilot_tree.qza \
  --m-metadata-file ~/work/IT_pilot/IT_pilot_metadata.tsv \
  --p-sampling-depth 40000 \
  --output-dir ~/work/IT_pilot/IT_pilot_core_metrics_results

It worked with no errors (yay!), but then when we tried to visualize any of the PCoA plots, we can't see any of the points :neutral_face: See example below:

Note that you can see "2/2 visible" in the upper left corner, and the dropdown menu in the upper right does know about our samples (there are only two in this little pilot analysis).

Does anyone have any ideas what the issue may be? @colinbrislawn? @jwdebelius?

Thanks,
Carli

2 Likes

Yes, a similar topic reported this behavior the other day. There seems to be a possible bug in emperor that causes it to fail to display plots containing only 2 samples, and it looks like @yoshiki is on the case. In the meantime, the workaround is to include > 2 samples...

Welcome to the forum @fulbag! Yes, the reads being different lengths and from different V regions will cause issues in downstream analysis. @cjone228 and others have been discussing strategies for processing these data throughout this topic.

Splitting up the data by V region and processing separately seems to be the consensus suggestion that I and others have recommended in the past. Other methods, like using SMURF, sound like they have been troublesome and are not currently possible to adapt to other protocols.

Another possibility would be to use closed-reference OTU clustering to essentially merge all V regions into full-length reference 16S sequences. You may lose some of your sequences (if they poorly match the reference) but it is worth giving it a try...

Disclaimer: I do not have personal experience processing ion torrent data, and @cjone228 and others may be able to offer more guidance... I just want to make sure your question is not lost in the fray!

4 Likes

Hi @fulbag!

I saw that @Nicholas_Bokulich sent a reply to you just as I was typing a reply as well :grin:

This is a great question! Technically, you can use these fastq files as input (just to clarify, these are the fastq files that you get directly from your sequencer / server - not out of Ion Reporter). And those are exactly the files that Carli is using as input in her first post in this thread:

However there are a couple of tricky things going on that complicates data interpretation:

  1. Because there are multiple variable regions being amplified from the same sample at the same time, reads from the same bacterium but different variable regions will be interpreted as "different" bacteria even though they come from the same source (I think this is a similar problem as differences in 16s copy number between bacteria but in a different way). I believe this is what SMURF is trying to overcome...
  2. The other issue is that comparing sequences from different V regions is difficult.
    @colinbrislawn recently gave me the following explanation:

Counting is the hardest part of bioinformatics. If you count each region separately, you will be underestimating richness in some regions, and double (or tripple!) counting ASVs in others.
Most (all?) stats methods in this field presume that you are measuring a single source of diversity:
So you count the number of unique :bird:s in each state
or you count the number of unique :bird:s in each national park
or you count the number of unique :bird:s in each watershed
All of these metrics make sense on their own.
(Total unique :bird: in Montana < Idaho, p=0.02)*
But some national parks are shared by several states! And watershed often always cover multiple states!
So while you have measured three things well, combining them is hard, if not impossible
…unless…
instead of states and parks and watersheds, you have a geolocation of each point.
Now you have a unified way to compare all :bird:s.
:earth_americas: :world_map: :round_pushpin:
And this unified method can scale beyond the geographical constructs of the US!
:earth_africa: :earth_asia: :round_pushpin:
The states, parks, and watersheds are your different sequences regions. But what you really need is something to unify all three.
And I think you can bring your :bird:s together with a :evergreen_tree:

Therefore, if you use the consensus fastq files as input, we think you are limited to core metrics analysis that relies on phylogenetic analysis (e.g. Faith's PD for alpha diversity and Unifrac for beta diversity)

Our goal for identifying sequences (and primer sequences) from the the different V regions is that if there was some insurmountable problem in performing the consensus analysis pipeline, then we would be able to at least pick sequences from one variable region to analyze our data. And also, if we discovered that there were some better way to analyze the data then everyone could start doing it! :raised_hands: :qiime2: :muscle:

4 Likes

Thank you for correcting me - :blush:,

As above, :blush:

So, according to your and @Nicholas_Bokulich comments, the fastq files are 'useless'... unless we may decipher how these .fastq consensus sequences are assembled by the PGM server.
From my previous experience on the T cell receptor V-D-J region (on the alpha and beta chains) sequencing (Sanger's sequencing - years ago), the consensus was (manually) built by placing the V-D-J sequences within the conserved domains.
Wrongly, I was assuming that here the .fastq consensus were built in a similar way, but now I realize that this is not true...

So, basically it is not (yet) possible to extract from these .fastq files a single V region for the pipeline analysis - since primer sequences are not present.
Correct? :roll_eyes:
Thanks.

1 Like

Really I tried to extract from my reads the v3 and v4 regions. I used some primers that I suppose to be used in the kit and the analysis work. The only problem was that a big part of you sequence are lose in the bioinformatic analysis

1 Like

Hi @Lauren and @cjone228,

@KMaki and I are working with the same data as you and wanted to ask you all after running dada2, what percentages did you get from the denoising output in the variable from the denoising stats qzv:
“Percent of input passed filter”

We see that the nice Parkinson’s tutorial has an average of percent of about 95%. Our Ion Torrent specific data has an average about .2% to our best which is about 30% across all runs. What did you all see here after this step?

Thank you!
Jen

Ouch!

Try increasing the trimming parameters to remove low quality starting and ending positions.

Also, just to check, are you using denoise-pyro? The PD-mouse tutorial uses Illumina data, which has different defaults than pyro data like Ion Torrent.

Colin

2 Likes

Hi Jen! Sorry to hear you are having this issue - I just finished the DADA2 step for my samples, which span 4 separate chips, and for each of them it seems as though typically 70-80% of the reads pass the filter. This was true for @Lauren's samples too.

Here is an example of our code:

qiime dada2 denoise-pyro \
  --p-trim-left 15 \
  --p-trunc-len 0 \
  --i-demultiplexed-seqs ~/work/Carli/Chip3/Chip3_demux.qza \
  --o-table ~/work/Carli/Chip3/Chip3_table.qza \
  --o-representative-sequences ~/work/Carli/Chip3/Chip3_rep_seqs.qza \
  --o-denoising-stats ~/work/Carli/Chip3/Chip3_denoising_stats.qza \
  --verbose

I hope this helps!
Carli

1 Like

Hi I try to use your pipeline for the analysis of my data, but I have a problem with the importing.

qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path /Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv
–output-path demux_1_15.qza
–input-format SingleEndFastqManifestPhred33

This is the error message:
There was a problem importing /Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv:

/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv is not a(n) SingleEndFastqManifestPhred33 file:

Line 3 has 1 cells ([’/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/1_15/1_DX.fastq.gz’]), expected 3.

Can you help me?

Thanks

Hi Rubina,

I think the problem is that your manifest file is a .csv file and not a .tsv file! Try making it a .tsv file and hopefully that will work.

Best,
Carli

4 Likes

Hi Everyone!

@cjone228 and I want to clarify this point - we just realized that using the the full length sklearn classifier is not advised because the pre-trained sklearn classifier doesn't do well with mixed-orientation reads (see post). We just had the problem where we were only getting ~60-70% classification with both Silva and GG and realized that mixed orientation reads are likely the culprit. Thus, we will try to move forward with classify-consensus-vsearch. Just found this post - super helpful! :partying_face: :star_struck: :laughing: @KMaki have you gotten vsearch consensus classifier to work? If so, do you mind sharing your final code? :pray:

Investigation of this hurdle brought up another question: do we need to deal with our mixed orientation reads prior to the taxonomy steps? (Just a reminder that IT produces single end reads, but each sample includes reads from 12 different pooled amplicons: 6F primers and 6R primers.) @Jen_S posted on this topic and @Nicholas_Bokulich recommended separating out F from R reads prior to denoising, denoising separately, and then merging. Is this something we should do? Could mixed oriented reads be part of why only ~70% of our sequences are passing the dada2 filter? (see @cjone228 and @Jen_S discussion on that below).

Thanks much!
Lauren :qiime2:

2 Likes

Hi @Lauren,

Ideally yes, since having mixed-orientation reads will inflate (particularly alpha) diversity estimates, since each true ASV is being counted twice (once for each orientation). OTU clustering may be a way to correct this post-dada2, but I have not looked into it yet.

Just to clarify, this is if the F/R reads are being re-oriented as @Jen_S was doing... because I think dada2 would probably explode if it is fed re-oriented reads, since the quality profiles would be reversed as well. If you are sorting out orientation issues post-dada2, you don't need to separate for denoising.

No, probably not. That's more likely related to trimming/truncation parameters and quality profile more generally.

2 Likes

Hi @Lauren !

I am happy you found the post helpful! :smiling_face_with_three_hearts: I am going to do my best to answer your questions, but will do so in reverse to line up with workflow.

The most effective adjustment that I made in DADA2 to increase the amount of sequences that passed denoising was decreasing my trun length. Decreasing the truncation length from 250 bp to 150 bp increased the amount of sequences that passed through DADA2 filtering from ~2% up to percentages in the 80s. This was not true across all areas but helped greatly. I saw that you don’t truncate so it may be worthwhile to try seeing if truncating at 250 or 150 improves your filtering percentages without sacrificing accuracy (I think I remember reading you have some mock samples so you can evaluate this). We also trim at 15 bp and use the DADA2 –pyro option.

VSEARCH has worked well for us so far! We are still doing the comparisons to see how it lines up to our old workflow but so far we have been happy with the result.

The code for vsearch classification is below:
qiime feature-classifier classify-consensus-vsearch
–i-query merged_rep_seqs_pyro_trun150.qza \ #This is rep seqs artifact
that comes from DADA2 output after merging all of the runs together
(if applicable)
–i-reference-reads gg_99_otus.qza \ #This is the green genes 99%
OTUs that we converted to an artifact (Code at bottom)
–i-reference-taxonomy gg_ref-taxonomy.qza \ #Reference tax from
Greengenes also converted to qiime Artifact (code below)
–o-classification vsearch_gg_classification_trun150.qza #output

Qiime visualization:
qiime metadata tabulate
–m-input-file vsearch_gg_classification_trun150.qza
–o-visualization vsearch_gg_classification_trun150.qzv

Code to import Greengenes.fasta and taxonomy.txt into Qiime2 artifacts
Note you get the Greengenes data from Qiime2 data resources here

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path 99_otus.fasta
–output-path gg_99_otus.qza

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path 99_otu_taxonomy.txt
–output-path ref-taxonomy.qza

I hope this helps! Please let me know if you have any other questions

Sincerely,
Katherine :blush:

4 Likes

Thanks @KMaki!

It’s about time we get a QIIME 2 tutorial for ion torrent sequencing put together, per the title of this topic. Maybe everyone who has posted to this topic should team up to write a tutorial that we can post in the “community tutorials” section of the forum (and eventually the QIIME 2 library)… anyone interested? It takes a village to write a tutorial… :houses:

6 Likes

@Jen_S and I are definitely interested in putting together a tutorial. We are currently trying to contact IonTorrent to see if there is a way to extract V regions with a plugin or some other way to preserve their intellectual property while working with the QIIME2 workflow. This will determine how we arrange the future workflow so based on what plays out we can start working on the IonTorrent analysis tutorial.

7 Likes

Hey @Nicholas_Bokulich and @colinbrislawn

I have a follow up question about my vsearch taxonomy classifier. One of my regions has a higher percentage of unclassified taxa than the other and I’m thinking I could potentially tweak some of the vsearch settings in the “qiime feature-classifier classify-consensus-vsearch” command.

I am currently using the default parameters (as I have in my code in my response to @Lauren

I am curious if you have any suggestions for parameters to adjust as a starting point to try and improve this unclassified tax %?

My DADA2 stats in this region are pretty good and nothing changed when I increased the truncation length of the segment pulled from greengenes so I think tweaking the vsearch parameters would be a good next testing point. :thinking:

Appreciate any thoughts/suggestions you may have, thank you!
Katherine

1 Like

Hi @Nicholas_Bokulich and @KMaki,

Sorry for my delayed response - thank you both very much for your input! :+1:

This makes sense - @cjone228 and I are completely new to bioinformatics so we are learning along the way. It didn't occur to us until last week that we were importing mixed reads using SingleEndFastqManifestPhred33V2. We now see on the Qiime2 Docs importing data page that "In this variant of the fastq manifest format, the read directions must all either be forward or reverse."

I think that it makes sense for us to re-orient our reads as @Jen_S did because approximately half of our reads are F and half R.

@KMaki, your reply post on VSEARCH is very clear (thank you!) - however I think @cjone228 and I need to take a step back and work on re-doing the initial steps before we get to this point.

@Nicholas_Bokulich, I would be happy to contribute to a tutorial, however I need to figure out what I am doing first! :laughing:

Now I have a few new discoveries / questions that go back to late February when I was going through output files from Ion Reporter with @colinbrislawn:
It occurred to me that I only listed a few of the files that IR spits out. Turns out it also spits out the following list:

I see that IR is using Qiime v 1.9.1, so is it possible for me to convert some of these files into Qiime2 compatible files and use them? My concern is that IR does everything through it's own server, so we wouldn't have access to any of the intermediate files (if there are any). Also, I am happy to open more of the folders and show you what is in them if that is helpful.

I see that I have some taxonomy .biom files, so yesterday I was trying to convert the OTU_family.biom file into a Qiime2 .qza file. I found this post and attempted to follow the instructions (see screenshot below)


I picked --type FeatureData [Taxonomy] because is doesn't look like my file has any sequences in it and I picked --input-format BIOMV210Format because my file states "BIOM-Format 2.1.7" and it was made with Qiime v 1.9.1 (see .biom contents in screenshot below)

As always, if anyone has any insight I greatly appreciate it!
Take care,
Lauren

1 Like