Possible Analysis Pipeline for Ion Torrent 16S Metagenomics Kit Data in QIIME2?

Hi All,

In an effort to keep topics related to analyzing Ion Torrent (IT) 16S Metagenomics Kit Data in one place, @cjone228 and I have an alternative question that we would like to ask in this thread:

Users have the option of utilizing ThermoFisher software called Ion Reporter to analyze their data. Here is a reference for the application note: https://www.thermofisher.com/content/dam/LifeTech/Documents/PDFs/Ion-16S-Metagenomics-Kit-Software-Application-Note.pdf and another reference for their white paper on Metagenomics 16S algorithms: https://tools.thermofisher.com/content/sfs/brochures/ion-reporter-16s-metagenomics-algorithms-whitepaper.pdf . From what I can tell, this software categorizes reads into OTUs and then maps to either a proprietary curated Greengenes database, a proprietary curated MicroSeq database, or both. From there, the user can download a slew of files.

I am wondering if it is possible to use Ion Reporter to demultiplex, separate reads by hypervariable region, generate OTUs, and perform taxonomic identification, but then use the resulting downloaded files + my own metadata file to feed into QIIME2 to perform further analysis such as alpha / beta diversity, longitudinal analysis, etc.

The files that are downloaded include the following:

A file named OTU_family_genus_species.biom (I converted to .txt to visualize and took a screenshot below).

A file named OTU_family_genus_species.txt (screenshot of head below - note that the names of the samples are S0XX_XX_v2 - FYI the v2 here is part of the sample name, it is not referring to the v2 hypervariable region)

Both of the above files appear to have taxonomic information from all samples and all primers/hypervariable regions combined into the same file. They also provide similar files that only identify to the specific taxonomic level (see list of files below)
Screen Shot 2020-02-21 at 2.20.45 PM

Additionally, for every sample, there are the following files:

See below for an example of the head of S001_354_v2_by_primer.txt

Does it look like we may be able to use any of these files as input for QIIME2? Thanks for any and all input!

Lauren and Carli

1 Like

Yes, this should work :100:


That list of files looks fascinating! So many questions... :female_detective: :male_detective:

Are all the same reads in each pair of *_family.fasta and *_genus_species.fasta files?

Is there any overlap of reads between the V2, V3, V4, V8, etc files?
If not, these could be split by region like we hoped for!

Thank you so much Lauren and Carli. This is very interesting!
Colin

2 Likes

My one concern on the QIIME move is that you don’t have a tree (:broccoli:) ! You need either a greengenes ID, a corresponding ID, or a sequence to align.

Best,
Justine

2 Likes

Hi Colin and Justine,

Thank you for your replies! I looked more into the files and have more information:

I don't believe they are the same reads. I did a BLAST comparison search, and while many of the sequences are similar (as to be expected), none of them appear to have 100% homology for nearly the entire length of the sequence. (e.g. some of them have 100% homology, but for only 140 out of 220 bases). The *_family.fasta contains approx 100 sequences, whereas the *_genus_species.fasta contains approx 400 sequences. Therefore, I believe that the sequences in the *_family.fasta are those that could not be further classified into the genus or species level.

No - I believe these are the fasta files mapped to each individual variable region. The description of each sequence includes a header (see pic below) which includes the variable region, whether the read is forward or reverse, some sort of alignment score(?), and taxonomic identification along with the sequence. If you have any thoughts as to what the two numbers asterisked here are, can you please inform? >MG | * | V3 | F | * | 99.355 | Actinobacteria ...

One notable piece of information that this file does not include, is the number of hits each sequence gets (i.e. count). This does appear to be in the file S001_ * by_primer.txt, with the caveat that the counts seem to be for particular genus/species not sequence. (For instance in the screen pic above, you can see there are 4 sequences mapped to Collinsella aerofaciens. I think in the S001 *_by_primer.txt file, there would be one row for Collinsella aerofaciens and the count would be 4).

I agree with @jwdebelius that we don't have a tree. But since these files include sequences, V regions, and read direction I think we may be able to make one?

Take care,
Lauren

1 Like

Hello Lauren,

This is a gold mine! :pick: :moneybag:

The headers is what I’m most interested in.

MG | * | V3 | F | * | 99.355 | Actinobacteria
     ^ always 2. I wonder if this is the version of the kit/primers?
                  ^ I wonder if this is the exact read count in the sample?

Other than these downstream files, do you also get raw files from this run? I wonder if you could use the V3 runs, for example, to align to raw reads and identify the primers that flank each one. Or do they remove their secret primers before you have a chance to see them?

If you had raw data, you could also check how many times each of these exact subsequences appear and see if my second hypothesis is correct.

Let me know if you have got the data and want help crafting a vsearch command!

Thanks!
Colin

:female_detective: :male_detective: :robot:

2 Likes

Hi all,

I am working on the analysis for the ATCC mock communities using the above mentioned pipeline:

I completed steps 1-3, and am now working on the sepp fragment insertion step. In case anyone is curious, here is a picture of the quality plot we got in our demux.qzv file:

Screen Shot 2020-02-27 at 4.28.40 PM

We then continued with the DADA2 step (using denoise-pyro) and got a feature table:

and rep-seqs file:

I am now encountering an error when trying to do the sepp-fragment-insertion step - we use MARCC, a cluster, in order to do a lot of our work, so the below screenshot is the output of a SLURM job that we submitted for this.

I realized that the code I was using from the sepp-fragment-insertion step from the Parkinson's Mouse tutorial is from QIIME2 2019.10, so in the 2019.7 version there is no option for --i-reference-database. I installed (or I think I installed) QIIME2 version 2019.10 into our MARCC account to fix this, but it still gave me the same error saying that this is not an option for an argument. Any ideas? Thank you so much!! :pray:

Carli

1 Like

Dear all - I am quite new in microbiome analysis, and started some time ago with IonT16S kit mainly because the PGM machine was free. So far, I have used IonReporter microbiome package to get an overall view of my data, but this package is 'limited/fixed' and I would like to make more flexible analyses. I was wondering why not use the 'consensus' Fastq files as input sequences in QIIME2, and not the single V regions (for which the F and R primer sequences are needed). Fastq file for each of my samples is >100MB, contains > 200,000 sequences and look like this:
Sample01
Unfortunately, the length of sequences are quite different, since do not correspond to a single V region. Could this be a problem in subsequent QIIME steps?
Thank you for any help and comment!

1 Like

Hi all,

Making more progress on analysis of the 2 ATCC samples in QIIME2! :partying_face: However, we have now run into kind of a strange error that we'd love some feedback on. Since I last posted, I was able to complete the sepp-fragment-insertion step. We then performed alpha rarefaction in order to determine the sampling depth for core metrics analysis. We decided that 40,000 reads was a good sampling depth for us (which is good because our goal for our actual clinical samples is to downsample to 40,000 reads/sample).

For core metrics analysis, we used the following code:

qiime diversity core-metrics-phylogenetic \
  --i-table ~/work/IT_pilot/IT_pilot_table.qza \
  --i-phylogeny ~/work/IT_pilot/IT_pilot_tree.qza \
  --m-metadata-file ~/work/IT_pilot/IT_pilot_metadata.tsv \
  --p-sampling-depth 40000 \
  --output-dir ~/work/IT_pilot/IT_pilot_core_metrics_results

It worked with no errors (yay!), but then when we tried to visualize any of the PCoA plots, we can't see any of the points :neutral_face: See example below:

Note that you can see "2/2 visible" in the upper left corner, and the dropdown menu in the upper right does know about our samples (there are only two in this little pilot analysis).

Does anyone have any ideas what the issue may be? @colinbrislawn? @jwdebelius?

Thanks,
Carli

2 Likes

Yes, a similar topic reported this behavior the other day. There seems to be a possible bug in emperor that causes it to fail to display plots containing only 2 samples, and it looks like @yoshiki is on the case. In the meantime, the workaround is to include > 2 samples...

Welcome to the forum @fulbag! Yes, the reads being different lengths and from different V regions will cause issues in downstream analysis. @cjone228 and others have been discussing strategies for processing these data throughout this topic.

Splitting up the data by V region and processing separately seems to be the consensus suggestion that I and others have recommended in the past. Other methods, like using SMURF, sound like they have been troublesome and are not currently possible to adapt to other protocols.

Another possibility would be to use closed-reference OTU clustering to essentially merge all V regions into full-length reference 16S sequences. You may lose some of your sequences (if they poorly match the reference) but it is worth giving it a try...

Disclaimer: I do not have personal experience processing ion torrent data, and @cjone228 and others may be able to offer more guidance... I just want to make sure your question is not lost in the fray!

4 Likes

Hi @fulbag!

I saw that @Nicholas_Bokulich sent a reply to you just as I was typing a reply as well :grin:

This is a great question! Technically, you can use these fastq files as input (just to clarify, these are the fastq files that you get directly from your sequencer / server - not out of Ion Reporter). And those are exactly the files that Carli is using as input in her first post in this thread:

However there are a couple of tricky things going on that complicates data interpretation:

  1. Because there are multiple variable regions being amplified from the same sample at the same time, reads from the same bacterium but different variable regions will be interpreted as "different" bacteria even though they come from the same source (I think this is a similar problem as differences in 16s copy number between bacteria but in a different way). I believe this is what SMURF is trying to overcome...
  2. The other issue is that comparing sequences from different V regions is difficult.
    @colinbrislawn recently gave me the following explanation:

Counting is the hardest part of bioinformatics. If you count each region separately, you will be underestimating richness in some regions, and double (or tripple!) counting ASVs in others.
Most (all?) stats methods in this field presume that you are measuring a single source of diversity:
So you count the number of unique :bird:s in each state
or you count the number of unique :bird:s in each national park
or you count the number of unique :bird:s in each watershed
All of these metrics make sense on their own.
(Total unique :bird: in Montana < Idaho, p=0.02)*
But some national parks are shared by several states! And watershed often always cover multiple states!
So while you have measured three things well, combining them is hard, if not impossible
…unless…
instead of states and parks and watersheds, you have a geolocation of each point.
Now you have a unified way to compare all :bird:s.
:earth_americas: :world_map: :round_pushpin:
And this unified method can scale beyond the geographical constructs of the US!
:earth_africa: :earth_asia: :round_pushpin:
The states, parks, and watersheds are your different sequences regions. But what you really need is something to unify all three.
And I think you can bring your :bird:s together with a :evergreen_tree:

Therefore, if you use the consensus fastq files as input, we think you are limited to core metrics analysis that relies on phylogenetic analysis (e.g. Faith's PD for alpha diversity and Unifrac for beta diversity)

Our goal for identifying sequences (and primer sequences) from the the different V regions is that if there was some insurmountable problem in performing the consensus analysis pipeline, then we would be able to at least pick sequences from one variable region to analyze our data. And also, if we discovered that there were some better way to analyze the data then everyone could start doing it! :raised_hands: :qiime2: :muscle:

4 Likes

Thank you for correcting me - :blush:,

As above, :blush:

So, according to your and @Nicholas_Bokulich comments, the fastq files are 'useless'... unless we may decipher how these .fastq consensus sequences are assembled by the PGM server.
From my previous experience on the T cell receptor V-D-J region (on the alpha and beta chains) sequencing (Sanger's sequencing - years ago), the consensus was (manually) built by placing the V-D-J sequences within the conserved domains.
Wrongly, I was assuming that here the .fastq consensus were built in a similar way, but now I realize that this is not true...

So, basically it is not (yet) possible to extract from these .fastq files a single V region for the pipeline analysis - since primer sequences are not present.
Correct? :roll_eyes:
Thanks.

1 Like

Really I tried to extract from my reads the v3 and v4 regions. I used some primers that I suppose to be used in the kit and the analysis work. The only problem was that a big part of you sequence are lose in the bioinformatic analysis

1 Like

Hi @Lauren and @cjone228,

@KMaki and I are working with the same data as you and wanted to ask you all after running dada2, what percentages did you get from the denoising output in the variable from the denoising stats qzv:
“Percent of input passed filter”

We see that the nice Parkinson’s tutorial has an average of percent of about 95%. Our Ion Torrent specific data has an average about .2% to our best which is about 30% across all runs. What did you all see here after this step?

Thank you!
Jen

Ouch!

Try increasing the trimming parameters to remove low quality starting and ending positions.

Also, just to check, are you using denoise-pyro? The PD-mouse tutorial uses Illumina data, which has different defaults than pyro data like Ion Torrent.

Colin

2 Likes

Hi Jen! Sorry to hear you are having this issue - I just finished the DADA2 step for my samples, which span 4 separate chips, and for each of them it seems as though typically 70-80% of the reads pass the filter. This was true for @Lauren's samples too.

Here is an example of our code:

qiime dada2 denoise-pyro \
  --p-trim-left 15 \
  --p-trunc-len 0 \
  --i-demultiplexed-seqs ~/work/Carli/Chip3/Chip3_demux.qza \
  --o-table ~/work/Carli/Chip3/Chip3_table.qza \
  --o-representative-sequences ~/work/Carli/Chip3/Chip3_rep_seqs.qza \
  --o-denoising-stats ~/work/Carli/Chip3/Chip3_denoising_stats.qza \
  --verbose

I hope this helps!
Carli

1 Like

Hi I try to use your pipeline for the analysis of my data, but I have a problem with the importing.

qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path /Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv
–output-path demux_1_15.qza
–input-format SingleEndFastqManifestPhred33

This is the error message:
There was a problem importing /Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv:

/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv is not a(n) SingleEndFastqManifestPhred33 file:

Line 3 has 1 cells ([’/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/1_15/1_DX.fastq.gz’]), expected 3.

Can you help me?

Thanks

Hi Rubina,

I think the problem is that your manifest file is a .csv file and not a .tsv file! Try making it a .tsv file and hopefully that will work.

Best,
Carli

4 Likes

Hi Everyone!

@cjone228 and I want to clarify this point - we just realized that using the the full length sklearn classifier is not advised because the pre-trained sklearn classifier doesn't do well with mixed-orientation reads (see post). We just had the problem where we were only getting ~60-70% classification with both Silva and GG and realized that mixed orientation reads are likely the culprit. Thus, we will try to move forward with classify-consensus-vsearch. Just found this post - super helpful! :partying_face: :star_struck: :laughing: @KMaki have you gotten vsearch consensus classifier to work? If so, do you mind sharing your final code? :pray:

Investigation of this hurdle brought up another question: do we need to deal with our mixed orientation reads prior to the taxonomy steps? (Just a reminder that IT produces single end reads, but each sample includes reads from 12 different pooled amplicons: 6F primers and 6R primers.) @Jen_S posted on this topic and @Nicholas_Bokulich recommended separating out F from R reads prior to denoising, denoising separately, and then merging. Is this something we should do? Could mixed oriented reads be part of why only ~70% of our sequences are passing the dada2 filter? (see @cjone228 and @Jen_S discussion on that below).

Thanks much!
Lauren :qiime2:

2 Likes

Hi @Lauren,

Ideally yes, since having mixed-orientation reads will inflate (particularly alpha) diversity estimates, since each true ASV is being counted twice (once for each orientation). OTU clustering may be a way to correct this post-dada2, but I have not looked into it yet.

Just to clarify, this is if the F/R reads are being re-oriented as @Jen_S was doing... because I think dada2 would probably explode if it is fed re-oriented reads, since the quality profiles would be reversed as well. If you are sorting out orientation issues post-dada2, you don't need to separate for denoising.

No, probably not. That's more likely related to trimming/truncation parameters and quality profile more generally.

2 Likes

Hi @Lauren !

I am happy you found the post helpful! :smiling_face_with_three_hearts: I am going to do my best to answer your questions, but will do so in reverse to line up with workflow.

The most effective adjustment that I made in DADA2 to increase the amount of sequences that passed denoising was decreasing my trun length. Decreasing the truncation length from 250 bp to 150 bp increased the amount of sequences that passed through DADA2 filtering from ~2% up to percentages in the 80s. This was not true across all areas but helped greatly. I saw that you don’t truncate so it may be worthwhile to try seeing if truncating at 250 or 150 improves your filtering percentages without sacrificing accuracy (I think I remember reading you have some mock samples so you can evaluate this). We also trim at 15 bp and use the DADA2 –pyro option.

VSEARCH has worked well for us so far! We are still doing the comparisons to see how it lines up to our old workflow but so far we have been happy with the result.

The code for vsearch classification is below:
qiime feature-classifier classify-consensus-vsearch
–i-query merged_rep_seqs_pyro_trun150.qza \ #This is rep seqs artifact
that comes from DADA2 output after merging all of the runs together
(if applicable)
–i-reference-reads gg_99_otus.qza \ #This is the green genes 99%
OTUs that we converted to an artifact (Code at bottom)
–i-reference-taxonomy gg_ref-taxonomy.qza \ #Reference tax from
Greengenes also converted to qiime Artifact (code below)
–o-classification vsearch_gg_classification_trun150.qza #output

Qiime visualization:
qiime metadata tabulate
–m-input-file vsearch_gg_classification_trun150.qza
–o-visualization vsearch_gg_classification_trun150.qzv

Code to import Greengenes.fasta and taxonomy.txt into Qiime2 artifacts
Note you get the Greengenes data from Qiime2 data resources here

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path 99_otus.fasta
–output-path gg_99_otus.qza

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path 99_otu_taxonomy.txt
–output-path ref-taxonomy.qza

I hope this helps! Please let me know if you have any other questions

Sincerely,
Katherine :blush:

4 Likes