So, according to your and @Nicholas_Bokulich comments, the fastq files are ‘useless’… unless we may decipher how these .fastq consensus sequences are assembled by the PGM server.
From my previous experience on the T cell receptor V-D-J region (on the alpha and beta chains) sequencing (Sanger’s sequencing - years ago), the consensus was (manually) built by placing the V-D-J sequences within the conserved domains.
Wrongly, I was assuming that here the .fastq consensus were built in a similar way, but now I realize that this is not true…
So, basically it is not (yet) possible to extract from these .fastq files a single V region for the pipeline analysis - since primer sequences are not present.
Really I tried to extract from my reads the v3 and v4 regions. I used some primers that I suppose to be used in the kit and the analysis work. The only problem was that a big part of you sequence are lose in the bioinformatic analysis
@KMaki and I are working with the same data as you and wanted to ask you all after running dada2, what percentages did you get from the denoising output in the variable from the denoising stats qzv:
“Percent of input passed filter”
We see that the nice Parkinson’s tutorial has an average of percent of about 95%. Our Ion Torrent specific data has an average about .2% to our best which is about 30% across all runs. What did you all see here after this step?
Hi Jen! Sorry to hear you are having this issue - I just finished the DADA2 step for my samples, which span 4 separate chips, and for each of them it seems as though typically 70-80% of the reads pass the filter. This was true for @Lauren’s samples too.
@cjone228 and I want to clarify this point - we just realized that using the the full length sklearn classifier is not advised because the pre-trained sklearn classifier doesn't do well with mixed-orientation reads (see post). We just had the problem where we were only getting ~60-70% classification with both Silva and GG and realized that mixed orientation reads are likely the culprit. Thus, we will try to move forward with classify-consensus-vsearch. Just found this post - super helpful! @KMaki have you gotten vsearch consensus classifier to work? If so, do you mind sharing your final code?
Investigation of this hurdle brought up another question: do we need to deal with our mixed orientation reads prior to the taxonomy steps? (Just a reminder that IT produces single end reads, but each sample includes reads from 12 different pooled amplicons: 6F primers and 6R primers.) @Jen_Sposted on this topic and @Nicholas_Bokulich recommended separating out F from R reads prior to denoising, denoising separately, and then merging. Is this something we should do? Could mixed oriented reads be part of why only ~70% of our sequences are passing the dada2 filter? (see @cjone228 and @Jen_S discussion on that below).
Ideally yes, since having mixed-orientation reads will inflate (particularly alpha) diversity estimates, since each true ASV is being counted twice (once for each orientation). OTU clustering may be a way to correct this post-dada2, but I have not looked into it yet.
Just to clarify, this is if the F/R reads are being re-oriented as @Jen_S was doing… because I think dada2 would probably explode if it is fed re-oriented reads, since the quality profiles would be reversed as well. If you are sorting out orientation issues post-dada2, you don’t need to separate for denoising.
No, probably not. That’s more likely related to trimming/truncation parameters and quality profile more generally.
I am happy you found the post helpful! I am going to do my best to answer your questions, but will do so in reverse to line up with workflow.
The most effective adjustment that I made in DADA2 to increase the amount of sequences that passed denoising was decreasing my trun length. Decreasing the truncation length from 250 bp to 150 bp increased the amount of sequences that passed through DADA2 filtering from ~2% up to percentages in the 80s. This was not true across all areas but helped greatly. I saw that you don’t truncate so it may be worthwhile to try seeing if truncating at 250 or 150 improves your filtering percentages without sacrificing accuracy (I think I remember reading you have some mock samples so you can evaluate this). We also trim at 15 bp and use the DADA2 –pyro option.
VSEARCH has worked well for us so far! We are still doing the comparisons to see how it lines up to our old workflow but so far we have been happy with the result.
The code for vsearch classification is below:
qiime feature-classifier classify-consensus-vsearch
–i-query merged_rep_seqs_pyro_trun150.qza \ #This is rep seqs artifact
that comes from DADA2 output after merging all of the runs together
–i-reference-reads gg_99_otus.qza \ #This is the green genes 99%
OTUs that we converted to an artifact (Code at bottom)
–i-reference-taxonomy gg_ref-taxonomy.qza \ #Reference tax from
Greengenes also converted to qiime Artifact (code below)
–o-classification vsearch_gg_classification_trun150.qza #output
It’s about time we get a QIIME 2 tutorial for ion torrent sequencing put together, per the title of this topic. Maybe everyone who has posted to this topic should team up to write a tutorial that we can post in the “community tutorials” section of the forum (and eventually the QIIME 2 library)… anyone interested? It takes a village to write a tutorial…
@Jen_S and I are definitely interested in putting together a tutorial. We are currently trying to contact IonTorrent to see if there is a way to extract V regions with a plugin or some other way to preserve their intellectual property while working with the QIIME2 workflow. This will determine how we arrange the future workflow so based on what plays out we can start working on the IonTorrent analysis tutorial.
I have a follow up question about my vsearch taxonomy classifier. One of my regions has a higher percentage of unclassified taxa than the other and I’m thinking I could potentially tweak some of the vsearch settings in the “qiime feature-classifier classify-consensus-vsearch” command.
I am currently using the default parameters (as I have in my code in my response to @Lauren
I am curious if you have any suggestions for parameters to adjust as a starting point to try and improve this unclassified tax %?
My DADA2 stats in this region are pretty good and nothing changed when I increased the truncation length of the segment pulled from greengenes so I think tweaking the vsearch parameters would be a good next testing point.
Appreciate any thoughts/suggestions you may have, thank you!
Sorry for my delayed response - thank you both very much for your input!
This makes sense - @cjone228 and I are completely new to bioinformatics so we are learning along the way. It didn't occur to us until last week that we were importing mixed reads using SingleEndFastqManifestPhred33V2. We now see on the Qiime2 Docs importing data page that "In this variant of the fastq manifest format, the read directions must all either be forward or reverse."
I think that it makes sense for us to re-orient our reads as @Jen_S did because approximately half of our reads are F and half R.
@KMaki, your reply post on VSEARCH is very clear (thank you!) - however I think @cjone228 and I need to take a step back and work on re-doing the initial steps before we get to this point.
@Nicholas_Bokulich, I would be happy to contribute to a tutorial, however I need to figure out what I am doing first!
Now I have a few new discoveries / questions that go back to late February when I was going through output files from Ion Reporter with @colinbrislawn:
It occurred to me that I only listed a few of the files that IR spits out. Turns out it also spits out the following list:
I see that IR is using Qiime v 1.9.1, so is it possible for me to convert some of these files into Qiime2 compatible files and use them? My concern is that IR does everything through it's own server, so we wouldn't have access to any of the intermediate files (if there are any). Also, I am happy to open more of the folders and show you what is in them if that is helpful.
I see that I have some taxonomy .biom files, so yesterday I was trying to convert the OTU_family.biom file into a Qiime2 .qza file. I found this post and attempted to follow the instructions (see screenshot below)
I picked --type FeatureData [Taxonomy] because is doesn't look like my file has any sequences in it and I picked --input-format BIOMV210Format because my file states "BIOM-Format 2.1.7" and it was made with Qiime v 1.9.1 (see .biom contents in screenshot below)
Once converted you might be able to import as taxonomy, although based on the screenshot, I’m not 100% sure that’ll work, because that special import invocation expects the taxonomy to be stored as metadata in the biom table, rather than the IDs themselves.
Unclassified taxa can occur for a few reasons with the vsearch-based classifier:
there really is no good match.
Seems unlikely if this is region-specific, but that could occur if the reference sequences have a jumble of different regions covered (e.g., different primers were used for sequencing those reference sequences). It looks like you are using Greengenes? So that is probably not the case, either.
Can occur if --p-perc-identity is set too high… but the default is really low, so that’s probably not the issue.
Can also occur if reads are in the wrong orientation, but the default is to check both orientations, so not the issue here.
there is more than one match at the basal taxonomic level (kingdom in the case of greengenes) above the consensus threshold. So theoretically you may be hitting both bacteria and archaea, leading to “unclassified” (because reliable classification cannot be achieved at even the basal level).
Since your query sequences (and possibly also reference sequences) are mixed orientations and you are searching in both directions, you could be picking up hits in both directions (only one of which would be valid!). I recommend increasing --p-perc-identity a little bit to see what happens, also test out the max-hits and top-hits-only settings to see how they impact your results. If you suddenly get much better results (without setting max-hits or max-accepts to 1!!!) then you’ve struck gold.
Could you clarify? Are you using extract-reads on the reference sequences? If so, don’t use trunc-len on those reads (since your query seqs are in mixed orientations they can hit at either end of that read)
Yes that works too, but based on @thermokarst’s notes, it looks like the biom tables you have are already collapsed on taxonomy so may not be enormously helpful.
Let’s circle back to this — I think the prospect of building a community tutorial (especially one written by different research groups who are bringing different perspectives to the table) is really exciting.
Good luck to you both! Let us know if you have any more questions.
Thanks so much for the input, I will try to vsearch parameters and report back
As you guessed, currently we are pulling out segments from the greengenes “99_otus” artifact we import into QIIME2 corresponding to our desired V regions using “qiime feature-classifier extract-reads”. Would you mind expanding on your statement:
Do you mean since we have mixed orientations there is a concern that the greengenes sequences needed to classify taxa using vsearch may be missed if we truncate?
Should be not using the trun-len at all for the extract reads command ?
Example code below (X’s would be the primer sequences)
(lowercase "g"s == primer sites)
(lowercase "c"s == part of amplicon sequence that corresponds to region that was truncated from query by dada2)
(bold == part of amplicon that appears in truncated forward reads/reference sequences)
Using extract-reads on that reference would yield an exact match (or in real life just match the same region/conditions so enable exact or similar matches between query and ref):
truncated query: ACGTACGTACGTACGTACGTACGTACGT
truncated refseq: ACGTACGTACGTACGTACGTACGTACGT
However, when orientations are mixed, you don’t know which end of the amplicon your query sequences are on… you will have a mixture of:
and its reverse complement:
Which truncated (on 3’ end of each read) will yield:
So you are covering different parts of the complete amplicon. Hence, the reference sequences should be untruncated to cover the full amplicon so that you can hit any part.
keep both, but adjust to expected ranges (this could also cause unassignments if the ranges are not being set correctly). Check the lit for what the expected amplicon ranges are — setting broad limits probably does not hurt, these are really just used as safeguards, since occasionally some primer sets and reference database combinations can yield some spurious hits that cause issues during classification… unusually short or large amplicons are a good indicator of spurious hits.
What region was good and what region was bad? Sometimes it’s not luck, sometimes it does depend on region, primer, reference db, etc… e.g., I’ve seen issues with V1 primers on some databases before because some of the reference sequences might not have the correct forward primer included in the sequence. The default settings were designed based on benchmarks of different 16S domains as well as ITS… sort of general “catch all” settings… but for unusual amplicons (maybe?) or for other marker genes (not you, since you have 16S but just saying) these settings may need to be tweaked. Having a mock community to re-optimize for novel primer sets is ideal!
Hello to all,
I tried your pipeline on my data and it works …
first of all thanks to this initiative several times over the years I have asked for it to be created without success so it really seems to me a great thing!
I have two questions for you:
Why do you use sepp for the phylogenetic tree? can’t use mafft?
There is a way to solve the problem in the alpha diversity? (Faith’s PD is not always enough)