Where am I going wrong - filter features output empty (greengenes2)

I'm using qiime2 installed with conda. I need some help troubleshooting where I could be going wrong. When I try to use the output of the filter features command I get an error indicating that the file is empty. I'm currently just not using the filter features output.

So my questions are:

  • any ideas where I'm going wrong?
  • it seems I don't strictly need to filter features - so am I OK to ignore this and carry in or does it mean I might be doing something else wrong that is affecting the results?

QIIME 2 version: 2023.9.2

My data is 16s v4 sequences - split between forward and reverse read FASTQ files.

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path manifest.csv --output-path paired-end-demux.qza --input-format PairedEndFastqManifestPhred33V2

qiime dada2 denoise-paired --verbose --i-demultiplexed-seqs paired-end-demux.qza
--p-trunc-len-f 145 --p-trunc-len-r 145 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

qiime greengenes2 filter-features
--i-feature-table table.qza
--i-reference ../../references/2022.10.taxonomy.asv.nwk.qza
--o-filtered-feature-table table.filtered.qza

qiime feature-classifier classify-sklearn
--i-classifier ../../references/gg_2022_10_backbone.v4.nb.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza

I should be using the filtered table here BUT it is coming back empty when doing taxa collapse so I am using the unfiltered table

qiime taxa collapse
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-level 7
--o-collapsed-table species-table.qza

I am not sure that I am right, but looks like you did not remove primers from the sequences with cutadapt after importing and before Dada2. Could you confirm that?
Since ASV ids are hashes of sequences itself your ids may be different from ids in the greengenes database if your sequences still contain primers.
Could you try to remove primers with cutadapt before Dada2?



Hi @rosew,

In the 2022.10 tree, we primarily placed 90, 100, and 150nt fwd reads relative to the EMP 16S V4 rRNA sequencing protocol. Reads not of those lengths, and not relative to 515F are unlikely to be represented. We did not place any stitched reads either, if a phylogenetic taxonomy is desired, I recommend using only the fwd reads. Alternatively, reads can be recruited closed reference using the non-v4-16s action. If just interested in taxonomy, the naive Bayes pre-trained model can be used (available under Data Resources), although we observed lower correlation to paired shotgun samples relative to the phylogenetic taxonomy (https://www.nature.com/articles/s41587-023-01845-1/figures/2).

It is necessary to filter or map the reads if the intention is to use either the phylogeny or taxonomy.



Hi Timur,
thanks so much for the suggestion! I didn't need to remove the primers as the lab told me the FASTQ files they provided are demultiplexed and do not contain primer sequences.

1 Like

That is exactly what I hear from sequencing centers before I find and remove primers...
If the problem is not yet solved I would try to search for primers with cutadapt and discard any sequences without primers (in cutadapt settings). If output file will be very small that means that primers were removed indeed. If the output file big enough and contains most of the reads, that means that primers were in sequence.


1 Like

hi Timur!
thanks for the follow up. I ran it and it didn't find much to remove but the filter-features were still empty like before. The below is the verbose output. I ran it on the forward reads only as I also started experimenting with the suggestion from Daniel.

This is cutadapt 4.5 with Python 3.8.15
Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 -o /data/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-uvhs5hb4/BS12-MTV84_0_L001_R1_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA /data/tmp/qiime2/ec2-user/data/b9b59679-edf7-4e1d-85e4-bb6abe24eba7/data/BS12-MTV84_0_L001_R1_001.fastq.gz
Processing single-end reads on 1 core ...
Finished in 2.249 s (21.939 µs/read; 2.73 M reads/minute).

=== Summary ===

Total reads processed: 102,499
Reads with adapters: 55 (0.1%)

== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads written (passing filters): 102,499 (100.0%)

Total basepairs processed: 15,477,349 bp
Quality-trimmed: 0 bp (0.0%)
Total written (filtered): 15,477,180 bp (100.0%)

=== Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 55 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1

Overview of removed sequences
length count expect max.err error counts
3 53 1601.5 0 53
5 2 100.1 0 2
Running external command line application. This may print messages to stdout and/or stderr.
The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: cutadapt --cores 1 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 -o /data/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-uvhs5hb4/BS12-MTV84_0_L001_R1_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA /data/tmp/qiime2/ec2-user/data/b9b59679-edf7-4e1d-85e4-bb6abe24eba7/data/BS12-MTV84_0_L001_R1_001.fastq.gz

Saved SampleData[SequencesWithQuality] to: trimmed-forward.qza

hi Daniel,
thanks so much for your help.
It looks like it would be better to do the phylogenetic taxonomy although at this point I am assigning just the taxonomy. I will look into doing the phylogenetic taxonomy next.
I tried using just the forward reads as you suggested but it didn't make a difference and the filter features are still resulting in an empty output.
I experimented with a different --p-trunc-len 150 with DADA2 as well, but it didn't help.

I don't suppose you have any other ideas?

Hi @rosew,

Could you share some of the generated ASVs?

The V4 representation in Greengenes2 2022.10 spans over 300,000 samples, representing over 20,000,000 ASVs and inclusive of the EMP / EMP500 sets among many other environmental sample sets. It would be unexpected for the data to not have reasonable representation, and concerning that there is no representation. I suspect there is a processing detail that we haven't isolated yet, and which we (i.e., the Greengenes2 team) need to better communicate


hi Daniel,
thanks for helping further! I'm completely new to this so a lot could be going wrong that I'm missing.

Here's the first few ASVs from the exported fasta that I generated from rep-seqs.qza from DADA2.


1 Like

Hi @rosew,

Great, thank you! Those look to be 145nt long:


I'm not familiar with cutadapt, but it looks like a few nt on the 3' end are being trimmed


I was using DADA2 to truncate to 145 NT. I've now set it to 151 hoping it would make a difference as per fastqc output quality seemed ok up till 151. I only tried cutadapt after it was suggested but that didn't introduce the issue I have and the below is now without cutadapt and it's still empty. I used 145 as that's what I understood my lab was recommending but I might be misunderstanding.

This is what it looks like now i've set it to 151 - is this looking better? I had tried previously with 150 as you mentioned having ASVs with 150 length.


Hi @rosew,

We didn't place 151nt fragments so they won't intersect with the tree. Would it be possible to share the 150nt length ASVs?


Hi @wasade,
sorry! here's some examples truncated to 150.


I was able to find a matching taxonomy when I searched here: Greengenes development server

1 Like

Great, thanks! Does filter-features work now?

Hi @wasade,
no, it's the same. I had previously tried it with 150 as well.

Ah, I think I see what's going on. These sequences use md5 hashes as their identifiers. Please use the 2022.10.taxonomy.asv.md5.qza file


@wasade, thank you! Where do I download that from? I don't see it here:

Sorry, typo! This file (http://ftp.microbio.me/greengenes_release/2022.10/2022.10.taxonomy.md5.nwk.qza)


@wasade I actually tried with that one yesterday already as I thought it was the most likely match but it also didn't make a difference.

Something is odd as you were able to search the database, and the content of the website is a reflection of the files on the FTP. Could you share the 150nt FeatureData[Sequence] artifact?