In an effort to keep topics related to analyzing Ion Torrent (IT) 16S Metagenomics Kit Data in one place, @cjone228 and I have an alternative question that we would like to ask in this thread:
I am wondering if it is possible to use Ion Reporter to demultiplex, separate reads by hypervariable region, generate OTUs, and perform taxonomic identification, but then use the resulting downloaded files + my own metadata file to feed into QIIME2 to perform further analysis such as alpha / beta diversity, longitudinal analysis, etc.
The files that are downloaded include the following:
A file named OTU_family_genus_species.biom (I converted to .txt to visualize and took a screenshot below).
A file named OTU_family_genus_species.txt (screenshot of head below - note that the names of the samples are S0XX_XX_v2 - FYI the v2 here is part of the sample name, it is not referring to the v2 hypervariable region)
Both of the above files appear to have taxonomic information from all samples and all primers/hypervariable regions combined into the same file. They also provide similar files that only identify to the specific taxonomic level (see list of files below)
Additionally, for every sample, there are the following files:
Thank you for your replies! I looked more into the files and have more information:
I don't believe they are the same reads. I did a BLAST comparison search, and while many of the sequences are similar (as to be expected), none of them appear to have 100% homology for nearly the entire length of the sequence. (e.g. some of them have 100% homology, but for only 140 out of 220 bases). The *_family.fasta contains approx 100 sequences, whereas the *_genus_species.fasta contains approx 400 sequences. Therefore, I believe that the sequences in the *_family.fasta are those that could not be further classified into the genus or species level.
No - I believe these are the fasta files mapped to each individual variable region. The description of each sequence includes a header (see pic below) which includes the variable region, whether the read is forward or reverse, some sort of alignment score(?), and taxonomic identification along with the sequence. If you have any thoughts as to what the two numbers asterisked here are, can you please inform? >MG | * | V3 | F | * | 99.355 | Actinobacteria ...
One notable piece of information that this file does not include, is the number of hits each sequence gets (i.e. count). This does appear to be in the file S001_ * by_primer.txt, with the caveat that the counts seem to be for particular genus/species not sequence. (For instance in the screen pic above, you can see there are 4 sequences mapped to Collinsella aerofaciens. I think in the S001 *_by_primer.txt file, there would be one row for Collinsella aerofaciens and the count would be 4).
I agree with @jwdebelius that we don't have a tree. But since these files include sequences, V regions, and read direction I think we may be able to make one?
MG | * | V3 | F | * | 99.355 | Actinobacteria
^ always 2. I wonder if this is the version of the kit/primers?
^ I wonder if this is the exact read count in the sample?
Other than these downstream files, do you also get raw files from this run? I wonder if you could use the V3 runs, for example, to align to raw reads and identify the primers that flank each one. Or do they remove their secret primers before you have a chance to see them?
If you had raw data, you could also check how many times each of these exact subsequences appear and see if my second hypothesis is correct.
Let me know if you have got the data and want help crafting a vsearch command!
I am working on the analysis for the ATCC mock communities using the above mentioned pipeline:
I completed steps 1-3, and am now working on the sepp fragment insertion step. In case anyone is curious, here is a picture of the quality plot we got in our demux.qzv file:
We then continued with the DADA2 step (using denoise-pyro) and got a feature table:
I am now encountering an error when trying to do the sepp-fragment-insertion step - we use MARCC, a cluster, in order to do a lot of our work, so the below screenshot is the output of a SLURM job that we submitted for this.
I realized that the code I was using from the sepp-fragment-insertion step from the Parkinson's Mouse tutorial is from QIIME2 2019.10, so in the 2019.7 version there is no option for --i-reference-database. I installed (or I think I installed) QIIME2 version 2019.10 into our MARCC account to fix this, but it still gave me the same error saying that this is not an option for an argument. Any ideas? Thank you so much!!
Dear all - I am quite new in microbiome analysis, and started some time ago with IonT16S kit mainly because the PGM machine was free. So far, I have used IonReporter microbiome package to get an overall view of my data, but this package is 'limited/fixed' and I would like to make more flexible analyses. I was wondering why not use the 'consensus' Fastq files as input sequences in QIIME2, and not the single V regions (for which the F and R primer sequences are needed). Fastq file for each of my samples is >100MB, contains > 200,000 sequences and look like this:
Unfortunately, the length of sequences are quite different, since do not correspond to a single V region. Could this be a problem in subsequent QIIME steps?
Thank you for any help and comment!
Making more progress on analysis of the 2 ATCC samples in QIIME2! However, we have now run into kind of a strange error that we'd love some feedback on. Since I last posted, I was able to complete the sepp-fragment-insertion step. We then performed alpha rarefaction in order to determine the sampling depth for core metrics analysis. We decided that 40,000 reads was a good sampling depth for us (which is good because our goal for our actual clinical samples is to downsample to 40,000 reads/sample).
For core metrics analysis, we used the following code:
Note that you can see "2/2 visible" in the upper left corner, and the dropdown menu in the upper right does know about our samples (there are only two in this little pilot analysis).
Yes, a similar topic reported this behavior the other day. There seems to be a possible bug in emperor that causes it to fail to display plots containing only 2 samples, and it looks like @yoshiki is on the case. In the meantime, the workaround is to include > 2 samples...
Welcome to the forum @fulbag! Yes, the reads being different lengths and from different V regions will cause issues in downstream analysis. @cjone228 and others have been discussing strategies for processing these data throughout this topic.
Splitting up the data by V region and processing separately seems to be the consensus suggestion that I and others have recommended in the past. Other methods, like using SMURF, sound like they have been troublesome and are not currently possible to adapt to other protocols.
Another possibility would be to use closed-reference OTU clustering to essentially merge all V regions into full-length reference 16S sequences. You may lose some of your sequences (if they poorly match the reference) but it is worth giving it a try...
Disclaimer: I do not have personal experience processing ion torrent data, and @cjone228 and others may be able to offer more guidance... I just want to make sure your question is not lost in the fray!
I saw that @Nicholas_Bokulich sent a reply to you just as I was typing a reply as well
This is a great question! Technically, you can use these fastq files as input (just to clarify, these are the fastq files that you get directly from your sequencer / server - not out of Ion Reporter). And those are exactly the files that Carli is using as input in her first post in this thread:
However there are a couple of tricky things going on that complicates data interpretation:
Because there are multiple variable regions being amplified from the same sample at the same time, reads from the same bacterium but different variable regions will be interpreted as "different" bacteria even though they come from the same source (I think this is a similar problem as differences in 16s copy number between bacteria but in a different way). I believe this is what SMURF is trying to overcome...
The other issue is that comparing sequences from different V regions is difficult. @colinbrislawn recently gave me the following explanation:
Counting is the hardest part of bioinformatics. If you count each region separately, you will be underestimating richness in some regions, and double (or tripple!) counting ASVs in others.
Most (all?) stats methods in this field presume that you are measuring a single source of diversity:
So you count the number of unique s in each state
or you count the number of unique s in each national park
or you count the number of unique s in each watershed
All of these metrics make sense on their own.
(Total unique in Montana < Idaho, p=0.02)*
But some national parks are shared by several states! And watershed often always cover multiple states!
So while you have measured three things well, combining them is hard, if not impossible …unless…
instead of states and parks and watersheds, you have a geolocation of each point.
Now you have a unified way to compare all s.
And this unified method can scale beyond the geographical constructs of the US!
The states, parks, and watersheds are your different sequences regions. But what you really need is something to unify all three.
And I think you can bring your s together with a
Therefore, if you use the consensus fastq files as input, we think you are limited to core metrics analysis that relies on phylogenetic analysis (e.g. Faith's PD for alpha diversity and Unifrac for beta diversity)
Our goal for identifying sequences (and primer sequences) from the the different V regions is that if there was some insurmountable problem in performing the consensus analysis pipeline, then we would be able to at least pick sequences from one variable region to analyze our data. And also, if we discovered that there were some better way to analyze the data then everyone could start doing it!
So, according to your and @Nicholas_Bokulich comments, the fastq files are 'useless'... unless we may decipher how these .fastq consensus sequences are assembled by the PGM server.
From my previous experience on the T cell receptor V-D-J region (on the alpha and beta chains) sequencing (Sanger's sequencing - years ago), the consensus was (manually) built by placing the V-D-J sequences within the conserved domains.
Wrongly, I was assuming that here the .fastq consensus were built in a similar way, but now I realize that this is not true...
So, basically it is not (yet) possible to extract from these .fastq files a single V region for the pipeline analysis - since primer sequences are not present.
Correct?
Thanks.
Really I tried to extract from my reads the v3 and v4 regions. I used some primers that I suppose to be used in the kit and the analysis work. The only problem was that a big part of you sequence are lose in the bioinformatic analysis
@KMaki and I are working with the same data as you and wanted to ask you all after running dada2, what percentages did you get from the denoising output in the variable from the denoising stats qzv:
“Percent of input passed filter”
We see that the nice Parkinson’s tutorial has an average of percent of about 95%. Our Ion Torrent specific data has an average about .2% to our best which is about 30% across all runs. What did you all see here after this step?
Try increasing the trimming parameters to remove low quality starting and ending positions.
Also, just to check, are you using denoise-pyro? The PD-mouse tutorial uses Illumina data, which has different defaults than pyro data like Ion Torrent.
Hi Jen! Sorry to hear you are having this issue - I just finished the DADA2 step for my samples, which span 4 separate chips, and for each of them it seems as though typically 70-80% of the reads pass the filter. This was true for @Lauren's samples too.
This is the error message:
There was a problem importing /Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv:
/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/manifest_1_15.csv is not a(n) SingleEndFastqManifestPhred33 file:
Line 3 has 1 cells ([’/Users/rparadiso/Documents/Progetti/Microbiota_orecchio_cane/QIIME_analysis/FastQ/1_15/1_DX.fastq.gz’]), expected 3.
@cjone228 and I want to clarify this point - we just realized that using the the full length sklearn classifier is not advised because the pre-trained sklearn classifier doesn't do well with mixed-orientation reads (see post). We just had the problem where we were only getting ~60-70% classification with both Silva and GG and realized that mixed orientation reads are likely the culprit. Thus, we will try to move forward with classify-consensus-vsearch. Just found this post - super helpful! @KMaki have you gotten vsearch consensus classifier to work? If so, do you mind sharing your final code?
Investigation of this hurdle brought up another question: do we need to deal with our mixed orientation reads prior to the taxonomy steps? (Just a reminder that IT produces single end reads, but each sample includes reads from 12 different pooled amplicons: 6F primers and 6R primers.) @Jen_Sposted on this topic and @Nicholas_Bokulich recommended separating out F from R reads prior to denoising, denoising separately, and then merging. Is this something we should do? Could mixed oriented reads be part of why only ~70% of our sequences are passing the dada2 filter? (see @cjone228 and @Jen_S discussion on that below).
Ideally yes, since having mixed-orientation reads will inflate (particularly alpha) diversity estimates, since each true ASV is being counted twice (once for each orientation). OTU clustering may be a way to correct this post-dada2, but I have not looked into it yet.
Just to clarify, this is if the F/R reads are being re-oriented as @Jen_S was doing... because I think dada2 would probably explode if it is fed re-oriented reads, since the quality profiles would be reversed as well. If you are sorting out orientation issues post-dada2, you don't need to separate for denoising.
No, probably not. That's more likely related to trimming/truncation parameters and quality profile more generally.
I am happy you found the post helpful! I am going to do my best to answer your questions, but will do so in reverse to line up with workflow.
The most effective adjustment that I made in DADA2 to increase the amount of sequences that passed denoising was decreasing my trun length. Decreasing the truncation length from 250 bp to 150 bp increased the amount of sequences that passed through DADA2 filtering from ~2% up to percentages in the 80s. This was not true across all areas but helped greatly. I saw that you don’t truncate so it may be worthwhile to try seeing if truncating at 250 or 150 improves your filtering percentages without sacrificing accuracy (I think I remember reading you have some mock samples so you can evaluate this). We also trim at 15 bp and use the DADA2 –pyro option.
VSEARCH has worked well for us so far! We are still doing the comparisons to see how it lines up to our old workflow but so far we have been happy with the result.
The code for vsearch classification is below:
qiime feature-classifier classify-consensus-vsearch
–i-query merged_rep_seqs_pyro_trun150.qza \ #This is rep seqs artifact
that comes from DADA2 output after merging all of the runs together
(if applicable)
–i-reference-reads gg_99_otus.qza \ #This is the green genes 99%
OTUs that we converted to an artifact (Code at bottom)
–i-reference-taxonomy gg_ref-taxonomy.qza \ #Reference tax from
Greengenes also converted to qiime Artifact (code below)
–o-classification vsearch_gg_classification_trun150.qza #output