Where am I going wrong - filter features output empty (greengenes2)

I'm using qiime2 installed with conda. I need some help troubleshooting where I could be going wrong. When I try to use the output of the filter features command I get an error indicating that the file is empty. I'm currently just not using the filter features output.

So my questions are:

  • any ideas where I'm going wrong?
  • it seems I don't strictly need to filter features - so am I OK to ignore this and carry in or does it mean I might be doing something else wrong that is affecting the results?

QIIME 2 version: 2023.9.2

My data is 16s v4 sequences - split between forward and reverse read FASTQ files.

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path manifest.csv --output-path paired-end-demux.qza --input-format PairedEndFastqManifestPhred33V2

qiime dada2 denoise-paired --verbose --i-demultiplexed-seqs paired-end-demux.qza
--p-trunc-len-f 145 --p-trunc-len-r 145 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

qiime greengenes2 filter-features
--i-feature-table table.qza
--i-reference ../../references/2022.10.taxonomy.asv.nwk.qza
--o-filtered-feature-table table.filtered.qza

qiime feature-classifier classify-sklearn
--i-classifier ../../references/gg_2022_10_backbone.v4.nb.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza

I should be using the filtered table here BUT it is coming back empty when doing taxa collapse so I am using the unfiltered table

qiime taxa collapse
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-level 7
--o-collapsed-table species-table.qza

Hello!
I am not sure that I am right, but looks like you did not remove primers from the sequences with cutadapt after importing and before Dada2. Could you confirm that?
Since ASV ids are hashes of sequences itself your ids may be different from ids in the greengenes database if your sequences still contain primers.
Could you try to remove primers with cutadapt before Dada2?

Best,

3 Likes

Hi @rosew,

In the 2022.10 tree, we primarily placed 90, 100, and 150nt fwd reads relative to the EMP 16S V4 rRNA sequencing protocol. Reads not of those lengths, and not relative to 515F are unlikely to be represented. We did not place any stitched reads either, if a phylogenetic taxonomy is desired, I recommend using only the fwd reads. Alternatively, reads can be recruited closed reference using the non-v4-16s action. If just interested in taxonomy, the naive Bayes pre-trained model can be used (available under Data Resources), although we observed lower correlation to paired shotgun samples relative to the phylogenetic taxonomy (https://www.nature.com/articles/s41587-023-01845-1/figures/2).

It is necessary to filter or map the reads if the intention is to use either the phylogeny or taxonomy.

Best,
Daniel

3 Likes

Hi Timur,
thanks so much for the suggestion! I didn't need to remove the primers as the lab told me the FASTQ files they provided are demultiplexed and do not contain primer sequences.

1 Like

That is exactly what I hear from sequencing centers before I find and remove primers...
If the problem is not yet solved I would try to search for primers with cutadapt and discard any sequences without primers (in cutadapt settings). If output file will be very small that means that primers were removed indeed. If the output file big enough and contains most of the reads, that means that primers were in sequence.

Best,

1 Like

hi Timur!
thanks for the follow up. I ran it and it didn't find much to remove but the filter-features were still empty like before. The below is the verbose output. I ran it on the forward reads only as I also started experimenting with the suggestion from Daniel.

This is cutadapt 4.5 with Python 3.8.15
Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 -o /data/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-uvhs5hb4/BS12-MTV84_0_L001_R1_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA /data/tmp/qiime2/ec2-user/data/b9b59679-edf7-4e1d-85e4-bb6abe24eba7/data/BS12-MTV84_0_L001_R1_001.fastq.gz
Processing single-end reads on 1 core ...
Finished in 2.249 s (21.939 µs/read; 2.73 M reads/minute).

=== Summary ===

Total reads processed: 102,499
Reads with adapters: 55 (0.1%)

== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads written (passing filters): 102,499 (100.0%)

Total basepairs processed: 15,477,349 bp
Quality-trimmed: 0 bp (0.0%)
Total written (filtered): 15,477,180 bp (100.0%)

=== Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 55 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1

Overview of removed sequences
length count expect max.err error counts
3 53 1601.5 0 53
5 2 100.1 0 2
Running external command line application. This may print messages to stdout and/or stderr.
The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: cutadapt --cores 1 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 -o /data/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-uvhs5hb4/BS12-MTV84_0_L001_R1_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA /data/tmp/qiime2/ec2-user/data/b9b59679-edf7-4e1d-85e4-bb6abe24eba7/data/BS12-MTV84_0_L001_R1_001.fastq.gz

Saved SampleData[SequencesWithQuality] to: trimmed-forward.qza

hi Daniel,
thanks so much for your help.
It looks like it would be better to do the phylogenetic taxonomy although at this point I am assigning just the taxonomy. I will look into doing the phylogenetic taxonomy next.
I tried using just the forward reads as you suggested but it didn't make a difference and the filter features are still resulting in an empty output.
I experimented with a different --p-trunc-len 150 with DADA2 as well, but it didn't help.

I don't suppose you have any other ideas?

Hi @rosew,

Could you share some of the generated ASVs?

The V4 representation in Greengenes2 2022.10 spans over 300,000 samples, representing over 20,000,000 ASVs and inclusive of the EMP / EMP500 sets among many other environmental sample sets. It would be unexpected for the data to not have reasonable representation, and concerning that there is no representation. I suspect there is a processing detail that we haven't isolated yet, and which we (i.e., the Greengenes2 team) need to better communicate

Best,
Daniel

hi Daniel,
thanks for helping further! I'm completely new to this so a lot could be going wrong that I'm missing.

Here's the first few ASVs from the exported fasta that I generated from rep-seqs.qza from DADA2.

1843d8cb44910a7496c7fab6496d6d87
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATT
c26282d21a58755f7574d41467a9e9ba
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTGGAGTCTTGTAGAGGGGGGTAGAATT
f9d4dbf7cb87b624240737f71c24c2ec
AACGTAGGTCACAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAATCCATGGGCTCAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATT

1 Like

Hi @rosew,

Great, thank you! Those look to be 145nt long:

>>> len("TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATT")
145

I'm not familiar with cutadapt, but it looks like a few nt on the 3' end are being trimmed

Best,
Daniel

I was using DADA2 to truncate to 145 NT. I've now set it to 151 hoping it would make a difference as per fastqc output quality seemed ok up till 151. I only tried cutadapt after it was suggested but that didn't introduce the issue I have and the below is now without cutadapt and it's still empty. I used 145 as that's what I understood my lab was recommending but I might be misunderstanding.

This is what it looks like now i've set it to 151 - is this looking better? I had tried previously with 150 as you mentioned having ASVs with 150 length.

3929d29a31d4b66f7fdc71b92a3d41d0
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGT
b27049b35bbb279f683ee183ad0e3243
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTGGAGTCTTGTAGAGGGGGGTAGAATTCCAGGT
7f7ad2aa1b8e47800d098beabb1ab112
AACGTAGGTCACAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAATCCATGGGCTCAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGGT

Hi @rosew,

We didn't place 151nt fragments so they won't intersect with the tree. Would it be possible to share the 150nt length ASVs?

Best,
Daniel

Hi @wasade,
sorry! here's some examples truncated to 150.

c18afe570abfe82d2f746ecc6e291bab
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGG
ece1af985b63ebccd2833e9b5f0432e3
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTGGAGTCTTGTAGAGGGGGGTAGAATTCCAGG
b0645cbec7653c887cb868a248d95874
AACGTAGGTCACAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAATCCATGGGCTCAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGG
e91f1c5b67ffecb5ca3ca3b393a50f00
TACGTAGGGGGCAAGCGTTGTCCGGAATAATTGGGCGTAAAGGGCGCGTAGGCGGCTCGGTAAGTCTGGAGTGAAAGTCCTGCTTTTAAGGTGGGAATTGCTTTGGATACTGTCGGGCTTGAGTGCAGGAGAGGTTAGTGGAATTCCCAG
010f0ac2691bc0be12d0633d4b5d2cc4
AACGTAGGTCACAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAGAACAAGTTGGAAGTGAAATCCATGGGCTCAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGG

I was able to find a matching taxonomy when I searched here: Greengenes development server

1 Like

Great, thanks! Does filter-features work now?

Hi @wasade,
no, it's the same. I had previously tried it with 150 as well.

Ah, I think I see what's going on. These sequences use md5 hashes as their identifiers. Please use the 2022.10.taxonomy.asv.md5.qza file

Best,
Daniel

@wasade, thank you! Where do I download that from? I don't see it here:
https://ftp.microbio.me/greengenes_release/2022.10/

Sorry, typo! This file (http://ftp.microbio.me/greengenes_release/2022.10/2022.10.taxonomy.md5.nwk.qza)

Best,
Daniel

@wasade I actually tried with that one yesterday already as I thought it was the most likely match but it also didn't make a difference.

Something is odd as you were able to search the database, and the content of the website is a reflection of the files on the FTP. Could you share the 150nt FeatureData[Sequence] artifact?

Best,
Daniel