Summarizing "misses" table help


(Mary DuPre) #1

Hi all,

I need a bit of assistance at attempting to summarize my “misses” OTU table after filtering sequences that matched less than 80% with the UNITE database. This has been recommended to me to check the total number of sequences are were filtered out of my samples. Below is the code that generated the “misses” file, as well as a screen shot that includes the summarized format. How can I summarize this table in qiime2 to visualize the total number of sequences filtered out?

Thank you,
Mary Ellyn

qiime quality-control exclude-seqs
–i-query-sequences table.qza
–i-reference-sequences sh_refs_qiime_ver7_dynamic_01.12.2017_NO_CONTAMS.qza
–p-method blast
–p-perc-identity 0.8
–p-perc-query-aligned 0.8
–o-sequence-hits hits-8-8.qza
–o-sequence-misses misses-8-8.qza


(Nicholas Bokulich) #2

You have two options:

  1. export the filtered and unfiltered sequences to count the number of lines
  2. use qiime feature-table filter-features --i-table table.qza --m-metadata-file hits-8-8.qza to filter your feature table to only contain “hits”. Then use qiime feature-table summarize to summarize the filtered and unfiltered tables. That visualization will contain a count of total features, which you can use to compare these tables.

I hope that helps!


(Mary DuPre) #3

Maybe I’m just confused by the hits/misses output, so if you could explain this process to me that would be very helpful.

I start out with my table.qza file where (when visualized) each sample has a sequence count of at least 2,000 and going up to ~15,000 sequences each. When I run the code I posted above, it tabulates the sequences into features (total hits and misses ~3500) which is not what I’m interested in? I want to see the whole dataset and the number of sequences that have been filtered out in order to better understand the quality of my filtering process.

Let me know if you need anything from me to better understand my confusion. Thanks!


(Nicholas Bokulich) #4

Perhaps the confusion is on my end. exclude-seqs is going to just split your sequences into those that hit the reference sequences, and those that miss. Using filter-features will then filter the misses from your feature table so that it only contains hits. Running summarize on both of those tables will tabulate the total number of sequences in each table — so you can compare how filtering impacts sequence depth, and the difference between the two will indicate the number of sequences that have been removed. That sounds like the comparison that you want to make — if it is not, please share these summary files and maybe describe in more detail how this differs from your goals.