Possible Analysis Pipeline for Ion Torrent 16S Metagenomics Kit Data in QIIME2?

cjone228 · February 5, 2020, 12:53am

Hi QIIME2 community,

My colleague @Lauren I have been using the Ion torrent 16S Metagenomics Kit to sequence some clinical samples that we have. Our sample sets are both longitudinal. The kit consists of two primer sets, one for V2-4-8 and one for V3-6,7-9. The products of the reactions are pooled per sample, and then the rest of library prep is completed. Each sample gets its own barcode. Ion Torrent produces single end reads.

The issue here is that the reads from each of the 6 variable regions (in forward and reverse orientations) are all given to the user in one fastq file per sample on the server for the sequencing machine. Therefore, there are essentially 12 amplicons within the one fastq file. On top of this, the primer sequences from the kit are proprietary, so we cannot easily separate them (we are not bioinformaticists by training).

We spoke with @ebolyen and others on the QIIME2 team (@gregcaporaso, @thermokarst, @Mehrbod_Estaki) at the QIIME2 Workshop at the NIH in early January and got some ideas on how we could possibly analyze our data, and we wanted to re-hash it here in case anyone had any ideas on how we could move forward. We are graduate students, and this is also very valuable clinical data, so we would really like to be able to analyze it! We also know that others are having the same issues.

Possible Analysis Pipeline:

Import fastq files using manifest file format
Perform separate DADA2 step for each sequencing run, then merge DADA2 results

Use denoise-pyro option
p-trim-left 15
Do not truncate`

Explore feature table
Create phylogenetic tree using SEPP (fragment insertion)
Alpha rarefaction and selection of rarefaction depth using diagnostic curves
Core metrics analysis (only metrics which include phylogeny)

alpha diversity --> Faith's PD only
beta diversity --> Weighted and Unweighted Unifrac only

Taxonomic classification (from feature table in step 3) using pre-trained full length 16S Green Genes classifier

@Nicholas_Bokulich in your reply to the thread here, you suggest this classifier for Ion Torrent data but later suggest VSEARCH instead. Could you please clarify?

Going off of another thread we saw on here, other people are also wondering how to align reads to the 16S gene to possibly separate the amplicons. One new method we saw was SMURF. Any thoughts on that, anyone? Again, we do not have the primer sequences.

We just wanted to check and see if A) this is an acceptable way to move forward and B) if anyone has any further ideas on how we could separate the amplicons so we can do further (ie. longitudinal and ANCOM) analyses. Any and all input would be appreciated. Thank you!

Nicholas_Bokulich · February 5, 2020, 2:54am

Hi @cjone228,

The workflow you've described looks okay to me at first glance... I have not worked with ion torrent data or this kit, but based on my understanding this all seems fine.

I don't think I can add anything to what has already been discussed in several topics on the forum — e.g., the one you linked to above. Not having the primers complicates this. This makes q2-quality-control the best approach for separating... you would need to create reference sequences trimmed to the variable regions you know you have (even if you don't have the exact primers) and try to grab out reads that are close enough matches (be lenient with the perc-identity and coverage parameters)

I clarified at the bottom of that post:

I think you maybe need to know your primers for SMURF??? @jwdebelius may provide better advice regarding SMURF.

Good luck!

jwdebelius · February 5, 2020, 9:22am

Hi @cjone228,

I've worked some with SMURF over the past year. The short answer is that as of February 2020, SMURF is probably not a viable solution.

Let's ignore the run-ability of the current Matlab code (and the that has filled my swear jar as a consequence), and just focus on the core issue.

You need primers to extract your reference region. Not just the knowledge of the region but the exact primer pairs and read length. Someone who knew the regions could do the theoretical extraction for you and then trim the primers, and then you could theoretically align all your reads against each region in a computational expensive but ultimately semi effective system...

Practically, the MATLAB version is entangled in such a way that it's hard to run anything but the test data and the exact SMURF primers. The documentation is opaque, the code is convoluted, and I will venmo/swish/wire $5 to anyone who can figure it out. (I'm serious on this, please PM me).

I just got permission to work on a python re-implementation, but there are some issues to work through before that's available. I'm not sure Im ready to put a timeline on it because I'll just embarrass myself in the future, but soonish.

Sorry if this comes across as a bit angry or passionate, Ive spent more than a few hours trying to make sense of the matlab code and get my samples to run.

Best,
Justine

cjone228 · February 5, 2020, 8:25pm

Thanks for your reply, @Nicholas_Bokulich ! We are still a bit confused at this very section - at first you say "better to stick with vsearch", but then in point 2 you say "the pre-trained full-length 16S classifier would be best". I might just be confused, but I thought these were two separate things. Could you please clarify which we should use? Thank you so much!

cjone228 · February 5, 2020, 8:26pm

Thanks, Justine! Sounds like it is probably not the right tool for us. I feel your pain. @jwdebelius

Nicholas_Bokulich · February 5, 2020, 8:37pm

Fair enough! I think I was weaving two different thoughts together — e.g., classify-consensus-vsearch would be best, but if you use the sklearn classifier use the full-length classifier.

The same would go for you — you could use classify-consensus-vsearch to align against full-length 16S sequences, or use the sklearn classifier trained on the full-length 16S sequences. Both would classify your sequences regardless of location.

cjone228 · February 5, 2020, 8:43pm

Thank you so much! We plan to do some little test runs for this pipeline with some Ion Torrent data we have from a mock community and we will keep you guys updated on this thread.

Jen_S · February 7, 2020, 7:23pm

I also am using this kit and would love to hear any feedback from anyone who had some success with running data from this kit and QIIME2.
The kit contains at least 12 primers targeting 7 regions of the 16s rRNA gene. 6 primers in the forward direction and 6 primers in the reverse direction. To further complicate the matters is that the fwd and rev primers not paired, ie. the sequencing is bidirectional. One work around I came up with was if you are able to separate out your reverse read by region, to then reverse complement them and concatenate the reads tot he forward direction after trimming off primers from both sets.

Jen

Imindias · February 17, 2020, 12:21pm

Hi all!
sorry for the last response. We are using this kit in my lab, and it's a total nightmare.
The pipeline you suggest is ok for me, it's the same I am doing, although I have a question regarding the alpha diversity, why only Faith'sPD?
and then, taxonomic classification, I use vsearch, although together with the fact that we cannot truncate in DADA2, there are a lot of unclassified reads.
On the other hand, I have been trying to use the results from ion reporter, in order to have a better taxonomic resolution. You can introduce them in ANCOM and for longitudinal analysis, but not for other analysis.
All the best,
Isabel

cjone228 · February 17, 2020, 8:08pm

Hi Isabel! Thanks for your reply!!

The reason you can only reliably trust Faith's PD for the alpha diversity analysis is because it incorporates phylogeny - @ebolyen, could you maybe elaborate on why that's important?

I think VSEARCH is definitely an option, but from what we've heard so far we can use the pre-trained full length 16S Green Genes classifier since all of our variable regions are included within it. Why can't you truncate in DADA2? Are you getting an error? I don't think we truncated with what we've done so far (we were told we don't need to), but am curious as to why you can't/if this is bad.

This really piqued our interest because we REALLY want to do ANCOM and Longitudinal Analyses. Is there a file in Ion Reporter somewhere that we can feed into QIIME2 to do this??

Lauren · February 21, 2020, 7:31pm

Hi All,

In an effort to keep topics related to analyzing Ion Torrent (IT) 16S Metagenomics Kit Data in one place, @cjone228 and I have an alternative question that we would like to ask in this thread:

Users have the option of utilizing ThermoFisher software called Ion Reporter to analyze their data. Here is a reference for the application note: https://www.thermofisher.com/content/dam/LifeTech/Documents/PDFs/Ion-16S-Metagenomics-Kit-Software-Application-Note.pdf and another reference for their white paper on Metagenomics 16S algorithms: https://tools.thermofisher.com/content/sfs/brochures/ion-reporter-16s-metagenomics-algorithms-whitepaper.pdf . From what I can tell, this software categorizes reads into OTUs and then maps to either a proprietary curated Greengenes database, a proprietary curated MicroSeq database, or both. From there, the user can download a slew of files.

I am wondering if it is possible to use Ion Reporter to demultiplex, separate reads by hypervariable region, generate OTUs, and perform taxonomic identification, but then use the resulting downloaded files + my own metadata file to feed into QIIME2 to perform further analysis such as alpha / beta diversity, longitudinal analysis, etc.

The files that are downloaded include the following:

A file named OTU_family_genus_species.biom (I converted to .txt to visualize and took a screenshot below).

A file named OTU_family_genus_species.txt (screenshot of head below - note that the names of the samples are S0XX_XX_v2 - FYI the v2 here is part of the sample name, it is not referring to the v2 hypervariable region)

Both of the above files appear to have taxonomic information from all samples and all primers/hypervariable regions combined into the same file. They also provide similar files that only identify to the specific taxonomic level (see list of files below)

Additionally, for every sample, there are the following files:

See below for an example of the head of S001_354_v2_by_primer.txt

Does it look like we may be able to use any of these files as input for QIIME2? Thanks for any and all input!

Lauren and Carli

colinbrislawn · February 24, 2020, 12:22am

Yes, this should work

That list of files looks fascinating! So many questions...

Are all the same reads in each pair of *_family.fasta and *_genus_species.fasta files?

Is there any overlap of reads between the V2, V3, V4, V8, etc files?
If not, these could be split by region like we hoped for!

Thank you so much Lauren and Carli. This is very interesting!
Colin

jwdebelius · February 24, 2020, 10:02am

My one concern on the QIIME move is that you don't have a tree () ! You need either a greengenes ID, a corresponding ID, or a sequence to align.

Best,
Justine

Lauren · February 25, 2020, 4:55pm

Hi Colin and Justine,

Thank you for your replies! I looked more into the files and have more information:

I don't believe they are the same reads. I did a BLAST comparison search, and while many of the sequences are similar (as to be expected), none of them appear to have 100% homology for nearly the entire length of the sequence. (e.g. some of them have 100% homology, but for only 140 out of 220 bases). The *_family.fasta contains approx 100 sequences, whereas the *_genus_species.fasta contains approx 400 sequences. Therefore, I believe that the sequences in the *_family.fasta are those that could not be further classified into the genus or species level.

No - I believe these are the fasta files mapped to each individual variable region. The description of each sequence includes a header (see pic below) which includes the variable region, whether the read is forward or reverse, some sort of alignment score(?), and taxonomic identification along with the sequence. If you have any thoughts as to what the two numbers asterisked here are, can you please inform? >MG | * | V3 | F | * | 99.355 | Actinobacteria ...

One notable piece of information that this file does not include, is the number of hits each sequence gets (i.e. count). This does appear to be in the file S001_ * by_primer.txt, with the caveat that the counts seem to be for particular genus/species not sequence. (For instance in the screen pic above, you can see there are 4 sequences mapped to Collinsella aerofaciens. I think in the S001 *_by_primer.txt file, there would be one row for Collinsella aerofaciens and the count would be 4).

I agree with @jwdebelius that we don't have a tree. But since these files include sequences, V regions, and read direction I think we may be able to make one?

Take care,
Lauren

colinbrislawn · February 25, 2020, 6:59pm

Hello Lauren,

This is a gold mine!

The headers is what I'm most interested in.

MG | * | V3 | F | * | 99.355 | Actinobacteria
     ^ always 2. I wonder if this is the version of the kit/primers?
                  ^ I wonder if this is the exact read count in the sample?

Other than these downstream files, do you also get raw files from this run? I wonder if you could use the V3 runs, for example, to align to raw reads and identify the primers that flank each one. Or do they remove their secret primers before you have a chance to see them?

If you had raw data, you could also check how many times each of these exact subsequences appear and see if my second hypothesis is correct.

Let me know if you have got the data and want help crafting a vsearch command!

Thanks!
Colin

cjone228 · February 27, 2020, 9:33pm

Hi all,

I am working on the analysis for the ATCC mock communities using the above mentioned pipeline:

I completed steps 1-3, and am now working on the sepp fragment insertion step. In case anyone is curious, here is a picture of the quality plot we got in our demux.qzv file:

We then continued with the DADA2 step (using denoise-pyro) and got a feature table:

and rep-seqs file:

I am now encountering an error when trying to do the sepp-fragment-insertion step - we use MARCC, a cluster, in order to do a lot of our work, so the below screenshot is the output of a SLURM job that we submitted for this.

I realized that the code I was using from the sepp-fragment-insertion step from the Parkinson's Mouse tutorial is from QIIME2 2019.10, so in the 2019.7 version there is no option for --i-reference-database. I installed (or I think I installed) QIIME2 version 2019.10 into our MARCC account to fix this, but it still gave me the same error saying that this is not an option for an argument. Any ideas? Thank you so much!!

Carli

fulbag · February 29, 2020, 8:04pm

Dear all - I am quite new in microbiome analysis, and started some time ago with IonT16S kit mainly because the PGM machine was free. So far, I have used IonReporter microbiome package to get an overall view of my data, but this package is 'limited/fixed' and I would like to make more flexible analyses. I was wondering why not use the 'consensus' Fastq files as input sequences in QIIME2, and not the single V regions (for which the F and R primer sequences are needed). Fastq file for each of my samples is >100MB, contains > 200,000 sequences and look like this:

Unfortunately, the length of sequences are quite different, since do not correspond to a single V region. Could this be a problem in subsequent QIIME steps?
Thank you for any help and comment!

cjone228 · March 4, 2020, 8:15pm

Hi all,

Making more progress on analysis of the 2 ATCC samples in QIIME2! However, we have now run into kind of a strange error that we'd love some feedback on. Since I last posted, I was able to complete the sepp-fragment-insertion step. We then performed alpha rarefaction in order to determine the sampling depth for core metrics analysis. We decided that 40,000 reads was a good sampling depth for us (which is good because our goal for our actual clinical samples is to downsample to 40,000 reads/sample).

For core metrics analysis, we used the following code:

qiime diversity core-metrics-phylogenetic \
  --i-table ~/work/IT_pilot/IT_pilot_table.qza \
  --i-phylogeny ~/work/IT_pilot/IT_pilot_tree.qza \
  --m-metadata-file ~/work/IT_pilot/IT_pilot_metadata.tsv \
  --p-sampling-depth 40000 \
  --output-dir ~/work/IT_pilot/IT_pilot_core_metrics_results

It worked with no errors (yay!), but then when we tried to visualize any of the PCoA plots, we can't see any of the points See example below:

Note that you can see "2/2 visible" in the upper left corner, and the dropdown menu in the upper right does know about our samples (there are only two in this little pilot analysis).

Does anyone have any ideas what the issue may be? @colinbrislawn? @jwdebelius?

Thanks,
Carli

Nicholas_Bokulich · March 4, 2020, 8:28pm

Yes, a similar topic reported this behavior the other day. There seems to be a possible bug in emperor that causes it to fail to display plots containing only 2 samples, and it looks like @yoshiki is on the case. In the meantime, the workaround is to include > 2 samples...

Welcome to the forum @fulbag! Yes, the reads being different lengths and from different V regions will cause issues in downstream analysis. @cjone228 and others have been discussing strategies for processing these data throughout this topic.

Splitting up the data by V region and processing separately seems to be the consensus suggestion that I and others have recommended in the past. Other methods, like using SMURF, sound like they have been troublesome and are not currently possible to adapt to other protocols.

Another possibility would be to use closed-reference OTU clustering to essentially merge all V regions into full-length reference 16S sequences. You may lose some of your sequences (if they poorly match the reference) but it is worth giving it a try...

Disclaimer: I do not have personal experience processing ion torrent data, and @cjone228 and others may be able to offer more guidance... I just want to make sure your question is not lost in the fray!

Lauren · March 4, 2020, 8:43pm

Hi @fulbag!

I saw that @Nicholas_Bokulich sent a reply to you just as I was typing a reply as well

This is a great question! Technically, you can use these fastq files as input (just to clarify, these are the fastq files that you get directly from your sequencer / server - not out of Ion Reporter). And those are exactly the files that Carli is using as input in her first post in this thread:

However there are a couple of tricky things going on that complicates data interpretation:

Because there are multiple variable regions being amplified from the same sample at the same time, reads from the same bacterium but different variable regions will be interpreted as "different" bacteria even though they come from the same source (I think this is a similar problem as differences in 16s copy number between bacteria but in a different way). I believe this is what SMURF is trying to overcome...
The other issue is that comparing sequences from different V regions is difficult.
@colinbrislawn recently gave me the following explanation:

Counting is the hardest part of bioinformatics. If you count each region separately, you will be underestimating richness in some regions, and double (or tripple!) counting ASVs in others.
Most (all?) stats methods in this field presume that you are measuring a single source of diversity:
So you count the number of unique s in each state
or you count the number of unique s in each national park
or you count the number of unique s in each watershed
All of these metrics make sense on their own.
(Total unique in Montana < Idaho, p=0.02)*
But some national parks are shared by several states! And watershed often always cover multiple states!
So while you have measured three things well, combining them is hard, if not impossible
…unless…
instead of states and parks and watersheds, you have a geolocation of each point.
Now you have a unified way to compare all s.

And this unified method can scale beyond the geographical constructs of the US!

The states, parks, and watersheds are your different sequences regions. But what you really need is something to unify all three.
And I think you can bring your s together with a

Therefore, if you use the consensus fastq files as input, we think you are limited to core metrics analysis that relies on phylogenetic analysis (e.g. Faith's PD for alpha diversity and Unifrac for beta diversity)

Our goal for identifying sequences (and primer sequences) from the the different V regions is that if there was some insurmountable problem in performing the consensus analysis pipeline, then we would be able to at least pick sequences from one variable region to analyze our data. And also, if we discovered that there were some better way to analyze the data then everyone could start doing it! :qiime2: