Removing non-target DNA from representative sequences

With many of our data sets, we often amplify non-target DNA, e.g., arthropods, nematodes, and host plants. In QIIME1, the best solution we found to this was exclude_seqs_by_blast.py, with an e-value of 1e-05-1e-10, which would remove most non-target DNA before assigning taxonomy, and also give an output of those sequences. Is there a similar function or work-around you can suggest in QIIME2?

Hi @Lorinda,
At the moment we don’t have similar functionality in QIIME 2, though we plan to have this available in the 2017.8 release (due out at the end of August). Sorry for the inconvenience - we expect that that release will fill in a lot of the missing microbiome analysis functionality relative to QIIME 1.

In the meantime, I think the best way to achieve this would be to export your sequences, and then run exclude_seqs_by_blast.py with QIIME 1. It’s straight-forward to have QIIME 1 and QIIME 2 installed side-by-side on the same computer, since they are both installable with conda. See the QIIME 1 conda install instructions if you don’t already have it installed.

The way I would do this is as follows:

  1. Export your FeatureData[Sequences] artifact as a fasta file. See the exporting tutorial, but this will roughly be: qiime tools export rep-set.qza --output-dir rep-set (assuming the file you want to export is called rep-set.qza).

  2. Activate your QIIME 1 environment, and run exclude_seqs_by_blast.py on the fasta file that is generated in Step 1. This will create a few different files - the one we care about here (I think) is matching.fna, which is all of the sequences that you want to retain in your downstream analysis. (If you instead want to discard the sequences that match your reference database, use non-matching.fna for the next step.)

  3. Convert matching.fna to a list of feature identifiers. You can do this as follows:
    grep '^>' matching.fna | cut -c 2- > seqs-to-keep.txt

  4. Activate your QIIME 2 environment, and run qiime feature-table filter-features --m-metadata-file seqs-to-keep.txt ..., providing your FeatureTable[Frequency] artifact as input. This will remove any features (ie, sequences) that did not match your BLAST database from the table, so all downstream analyses that use this table, including taxonomy visualizations, will not include those features.

Want to give that a try and let me know if you have any questions? Thanks for your patience as we get all of this functionality implemented in QIIME 2!

1 Like

@gregcaporaso Thank you, this seems easy enough. The other concern I have is with BLAST. When running the exclude_seqs_by_blast.py, sequences are incorrectly sorted to the non-matching.fna file. I’m looking at AMF in soils right now and using the MAARJAM database. When I manually BLAST these sequences at the MAARJAM website, some of the non-matching.fna sequences match 100% with 100% coverage to fungi in that database so it doesn’t make sense that they were sorted out using the exclude_seqs_by_blast.py script (e = 1e-05). I am wondering if you can explain why that happens and if perhaps there is a way to improve upon this function in QIIME2? We are running into some real concerns regarding non-target DNA amplified from common universal AMF primers.

Hi @Lorinda,
In the output from exclude_seqs_by_blast.py there should be a file called raw_blast_results.txt that you can review to get information on the results of BLAST search. I think there must a difference in how BLAST is being run in these two cases (in QIIME 1 and on the MAARJAM website), which is causing a difference in the scores, but I agree that the specific case that you’re pointing out sounds suspicious. If you notice anything that looks wrong in raw_blast_results.txt, please follow up here so we can be sure that it’s addressed (and fixed if necessary for the QIIME 2 implementation of this functionality).

Hope this helps!

@gregcaporaso
Looking at raw blast results, everything passes the e-score score threshold, so I'm assuming the good matches are filtered out based on the percent alignment. I ran it again to be sure, and sequences that match to my database (>99% coverage and identity) are still filtered out. I have tried to figure out why, but guess I'm not sure how 'percent_aligned' is defined in this script.

Here is the QIIME raw blast output for one of the sequences filtered out using default thresholds:

And here is the MAARJAM alignment for the same sequence:

Score = 450.596 bits (234), Expect = 6.4e-127
Identities = 238/240 (99.17%)
Strand = +/+

Query: 1 CCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGATCGAATTGTT 66
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 23 CCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGATCGAATTGTT 88

Query: 67 GGTCGTGCTTCGGTACGCACTGGCATTATCGGTTCCTACCTTCTGAAGAACCGCGATGTCATTAAT 132
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 89 GGTCGTGCTTCGGTACGCACTGGCATTATCGGTTCCTACCTTCTGAAGAACCGCGATGTCATTAAT 154

Query: 133 TTGGTGTCGTGGGGAATCAGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAGGCgcACGCT 198
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||
Sbjct: 155 TTGGTGTCGTGGGGAATCAGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAGGCttACGCT 220

Query: 199 TGAATACATTAGCATGGAATAATGAAATAGGACGTTTGATTC 240
||||||||||||||||||||||||||||||||||||||||||
Sbjct: 221 TGAATACATTAGCATGGAATAATGAAATAGGACGTTTGATTC 262

Sorry, here is the best QIIME raw_blast output for that sequence, as sorted by e-value:

Hi @Lorinda,
That script filters based on E-value and percent identity of the best match, where the default percent identity is 97%. In this case, reducing that value with the --percent_aligned parameter should retain the sequence (and setting that value to 0.0 would disable the percent alignment filter and only filter based on E-value).

Hope this helps!

@gregcaporaso I understand that reducing the threshold would allow me to retain more of the sequences that I want, what I don’t understand is how the percent alignment is calculated and why sequences that match 99%+ to my database when blasted manually are filtered out using this script. Could it be that sequences are penalized here if they are shorter than the reference sequence?

@Lorinda, are you seeing a different percent identity for that subject/query sequence pair when BLASTing with MAARJAM?

@gregcaporaso, I am. In MAARJAM I get 99.17% identity, but in QIIME the best match is shown at 96.6%. Looking at the alignment from MAARJAM above, only 2/240 base pairs were mismatches (all of my seqs were trimmed to 240 bp).

I know it is not a big difference, I’m just curious what is driving it, since it seems so obvious that this sequence should be retained. I have manually blasted about 30 of the ‘non-matching’ sequences, and this doesn’t happen often, maybe for ~5% of the sequences.

Ok, thanks. I’m not sure exactly what is causing the difference, but I suspect that it’s differences in how the BLAST searches are run.

MAARJAM is using BLAST+, while QIIME 1 is using legacy BLAST (though QIIME 2 will use BLAST+ when we implement this functionality). That may or may not cause a difference.

exclude_seqs_by_blast.py is using blastn, with the default gapopen and gapextend settings (5 and 2, respectively), and with a word_size of 28 (which is not default). If you try calling exclude_seqs_by_blast.py with --word_size 11, does that change your results? Do you know how these parameter values compare to the parameters being using by MAARJAM?

@gregcaporaso Thanks for the quick reply. :slight_smile:
Gap extend and gap open are both set at 1 in MAARJAM, but I’m not sure that applies here since the differences are mismatches and not gaps? I will try reducing the word size and see if that improves things. Thanks for you help.

The latest release of QIIME 2 (2017.10) includes a new plugin, q2-quality-control, with similar support as QIIME 1’s exclude_seqs_by_blast.py script. The relevant QIIME 2 command is qiime quality-control exclude-seqs, check it out in the q2-quality-control tutorial! :tada:

2 Likes