With many of our data sets, we often amplify non-target DNA, e.g., arthropods, nematodes, and host plants. In QIIME1, the best solution we found to this was exclude_seqs_by_blast.py, with an e-value of 1e-05-1e-10, which would remove most non-target DNA before assigning taxonomy, and also give an output of those sequences. Is there a similar function or work-around you can suggest in QIIME2?
At the moment we don’t have similar functionality in QIIME 2, though we plan to have this available in the 2017.8 release (due out at the end of August). Sorry for the inconvenience - we expect that that release will fill in a lot of the missing microbiome analysis functionality relative to QIIME 1.
In the meantime, I think the best way to achieve this would be to export your sequences, and then run
exclude_seqs_by_blast.py with QIIME 1. It’s straight-forward to have QIIME 1 and QIIME 2 installed side-by-side on the same computer, since they are both installable with
conda. See the QIIME 1 conda install instructions if you don’t already have it installed.
The way I would do this is as follows:
FeatureData[Sequences]artifact as a fasta file. See the exporting tutorial, but this will roughly be:
qiime tools export rep-set.qza --output-dir rep-set(assuming the file you want to export is called
Activate your QIIME 1 environment, and run
exclude_seqs_by_blast.pyon the fasta file that is generated in Step 1. This will create a few different files - the one we care about here (I think) is
matching.fna, which is all of the sequences that you want to retain in your downstream analysis. (If you instead want to discard the sequences that match your reference database, use
non-matching.fnafor the next step.)
matching.fnato a list of feature identifiers. You can do this as follows:
grep '^>' matching.fna | cut -c 2- > seqs-to-keep.txt
Activate your QIIME 2 environment, and run
qiime feature-table filter-features --m-metadata-file seqs-to-keep.txt ..., providing your
FeatureTable[Frequency]artifact as input. This will remove any features (ie, sequences) that did not match your BLAST database from the table, so all downstream analyses that use this table, including taxonomy visualizations, will not include those features.
Want to give that a try and let me know if you have any questions? Thanks for your patience as we get all of this functionality implemented in QIIME 2!
@gregcaporaso Thank you, this seems easy enough. The other concern I have is with BLAST. When running the exclude_seqs_by_blast.py, sequences are incorrectly sorted to the non-matching.fna file. I’m looking at AMF in soils right now and using the MAARJAM database. When I manually BLAST these sequences at the MAARJAM website, some of the non-matching.fna sequences match 100% with 100% coverage to fungi in that database so it doesn’t make sense that they were sorted out using the exclude_seqs_by_blast.py script (e = 1e-05). I am wondering if you can explain why that happens and if perhaps there is a way to improve upon this function in QIIME2? We are running into some real concerns regarding non-target DNA amplified from common universal AMF primers.
In the output from
exclude_seqs_by_blast.py there should be a file called
raw_blast_results.txt that you can review to get information on the results of BLAST search. I think there must a difference in how BLAST is being run in these two cases (in QIIME 1 and on the MAARJAM website), which is causing a difference in the scores, but I agree that the specific case that you’re pointing out sounds suspicious. If you notice anything that looks wrong in
raw_blast_results.txt, please follow up here so we can be sure that it’s addressed (and fixed if necessary for the QIIME 2 implementation of this functionality).
Hope this helps!
Looking at raw blast results, everything passes the e-score score threshold, so I'm assuming the good matches are filtered out based on the percent alignment. I ran it again to be sure, and sequences that match to my database (>99% coverage and identity) are still filtered out. I have tried to figure out why, but guess I'm not sure how 'percent_aligned' is defined in this script.
Here is the QIIME raw blast output for one of the sequences filtered out using default thresholds:
And here is the MAARJAM alignment for the same sequence:
Score = 450.596 bits (234), Expect = 6.4e-127
Identities = 238/240 (99.17%)
Strand = +/+
Query: 1 CCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGATCGAATTGTT 66
Sbjct: 23 CCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGATCGAATTGTT 88
Query: 67 GGTCGTGCTTCGGTACGCACTGGCATTATCGGTTCCTACCTTCTGAAGAACCGCGATGTCATTAAT 132
Sbjct: 89 GGTCGTGCTTCGGTACGCACTGGCATTATCGGTTCCTACCTTCTGAAGAACCGCGATGTCATTAAT 154
Query: 133 TTGGTGTCGTGGGGAATCAGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAGGCgcACGCT 198
Sbjct: 155 TTGGTGTCGTGGGGAATCAGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAGGCttACGCT 220
Query: 199 TGAATACATTAGCATGGAATAATGAAATAGGACGTTTGATTC 240
Sbjct: 221 TGAATACATTAGCATGGAATAATGAAATAGGACGTTTGATTC 262
Sorry, here is the best QIIME raw_blast output for that sequence, as sorted by e-value:
That script filters based on E-value and percent identity of the best match, where the default percent identity is 97%. In this case, reducing that value with the
--percent_aligned parameter should retain the sequence (and setting that value to 0.0 would disable the percent alignment filter and only filter based on E-value).
Hope this helps!
@gregcaporaso I understand that reducing the threshold would allow me to retain more of the sequences that I want, what I don’t understand is how the percent alignment is calculated and why sequences that match 99%+ to my database when blasted manually are filtered out using this script. Could it be that sequences are penalized here if they are shorter than the reference sequence?
@Lorinda, are you seeing a different percent identity for that subject/query sequence pair when BLASTing with MAARJAM?
@gregcaporaso, I am. In MAARJAM I get 99.17% identity, but in QIIME the best match is shown at 96.6%. Looking at the alignment from MAARJAM above, only 2/240 base pairs were mismatches (all of my seqs were trimmed to 240 bp).
I know it is not a big difference, I’m just curious what is driving it, since it seems so obvious that this sequence should be retained. I have manually blasted about 30 of the ‘non-matching’ sequences, and this doesn’t happen often, maybe for ~5% of the sequences.
Ok, thanks. I’m not sure exactly what is causing the difference, but I suspect that it’s differences in how the BLAST searches are run.
MAARJAM is using BLAST+, while QIIME 1 is using legacy BLAST (though QIIME 2 will use BLAST+ when we implement this functionality). That may or may not cause a difference.
exclude_seqs_by_blast.py is using
blastn, with the default
gapextend settings (5 and 2, respectively), and with a
word_size of 28 (which is not default). If you try calling
--word_size 11, does that change your results? Do you know how these parameter values compare to the parameters being using by MAARJAM?
@gregcaporaso Thanks for the quick reply.
Gap extend and gap open are both set at 1 in MAARJAM, but I’m not sure that applies here since the differences are mismatches and not gaps? I will try reducing the word size and see if that improves things. Thanks for you help.
The latest release of QIIME 2 (2017.10) includes a new plugin,
q2-quality-control, with similar support as QIIME 1’s
exclude_seqs_by_blast.py script. The relevant QIIME 2 command is
qiime quality-control exclude-seqs, check it out in the q2-quality-control tutorial!