Missing Features from merged dada2 Output

Micro_Biologist · November 28, 2017, 3:24pm

Hello all,

So due to the limited compute power of my computer, I am forced to run my samples through Dada2 independently. I am then subsequently merging them using Feature table merge and feature table merge seq data.

However when I then try to run them through Vsearch OTU picking (denovo at 97%) it seems that there are more features in the sequence data than in the feature table. I have checked that all my samples are present in the feature table (as they are) and I have also checked the amount of features and there seems to be around 20 missing from the feature table, that are present in the sequence data.

The only thing I can think of is I am merging my data wrong, here is my feature table merge code:

qiime feature-table merge
--i-table1 table-1.qza
--i-table2 table-2.qza
--p-overlap-method sum
--o-merged-table 63table.qza

qiime feature-table merge
--i-table1 63table.qza
--i-table2 table-3.qza
--p-overlap-method sum
--o-merged-table 63table.qza

(the code continues with there being 93 samples)

Has this happened to anyone else? Am I just doing this wrong?

Thanks in advance, Jono

thermokarst · November 28, 2017, 5:43pm

Hi @Micro_Biologist! Sorry to hear things aren't going so well for you. Unfortunately, I don't think the strategy of running your samples through DADA2 one at a time will work as well as running them all at once - DADA2's error model works on the assumption that samples are being processed on a per-run basis (@benjjneb might have something different to say on this, so I will ping him for an assist, too). One option is to run this step on an AWS instance with lots of resources. Another easier option is to specify an --p-n-threads parameter when running q2-dada2 --- if you set this to 0 it will use all available resources. Check out the docs for more details!

How so? Is there an error when you run this command? If so, please copy-and-paste the complete command, and the complete error, when run with the --verbose flag.

Thanks!

benjjneb · November 29, 2017, 1:38am

We don't recommend this as it is not as accurate, but it will probably work fine as long as individual samples are reasonably deep (>10k reads per sample).

I don't really understand what this means. It might be best to restate this a bit more precisely, perhaps by indicating the exsct commands that get you from "sequence data" to "feature table", and how you are tallying up the number of "features ... in the sequence data" vs. the number of "features... in the feature table".

One potential explanation, is that if you have run 97% OTU clustering on the output of DADA2, it will of course result in fewer features, because sequence variants that are 97% similar will be lumped into one OTU.

Micro_Biologist · November 29, 2017, 10:01am

@thermokarst Unfortunately running an instance isn't possible, but thankfully accuracy is (oddly) not my major concern as this is just proof of concept that we can perform this analysis, and we are waiting for the go ahead to buy an adequate computer.

I just checked a few samples they seem to be above 10k reads, which I guess is good, but as I said accuracy isn't my priority yet I am just doing proof of concept.

I tallied the number of features in the table and seqs file by opening them manually and scrolling to the end and calculating the difference between the 2.

Sorry if I have not adequately explained my problem. I have imported and ran each sample through dada2 (which I know isn't ideal but I don't have the choice at the moment and it won't be the case when we start doing the analysis). I am then merging all of the tables together into a file called "63table.qza" using the command in my OP, I am also merging all the rep-seqs files from dada2 into a file called "63-rep-seqs.qza" using the feature-table merge seq-data command, I wont post it all as I am merging 93 samples, but here is a sample:

qiime feature-table merge-seq-data
--i-data1 rep-seqs-1.qza
--i-data2 rep-seqs-2.qza
--o-merged-data 63-rep-seqs.qza

qiime feature-table merge-seq-data
--i-data1 63-rep-seqs.qza
--i-data2 rep-seqs-3.qza
--o-merged-data 63-rep-seqs.qza

I am then trying to run this command:

qiime vsearch cluster-features-de-novo
--i-table 63table.qza
--i-sequences 63-rep-seqs.qza
--o-clustered-table 63table-97.qza
--o-clustered-sequences 63rep-seqs-97.qza
--p-perc-identity 0.97
--verbose

But I get the following error:

Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2017.10/lib/python3.5/site-packages/q2_vsearch/_cluster_features.py", line 68, in _fasta_with_sizes
feature_size = sizes[feature_id]
KeyError: '0858a0b2d108da823681823ea26812cc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2017.10/lib/python3.5/site-packages/q2cli/commands.py", line 218, in call
results = action(**arguments)
File "", line 2, in cluster_features_de_novo
File "/home/qiime2/miniconda/envs/qiime2-2017.10/lib/python3.5/site-packages/qiime2/sdk/action.py", line 220, in bound_callable
output_types, provenance)
File "/home/qiime2/miniconda/envs/qiime2-2017.10/lib/python3.5/site-packages/qiime2/sdk/action.py", line 355, in callable_executor
output_views = self._callable(**view_args)
File "/home/qiime2/miniconda/envs/qiime2-2017.10/lib/python3.5/site-packages/q2_vsearch/_cluster_features.py", line 89, in cluster_features_de_novo
_fasta_with_sizes(str(sequences), fasta_with_sizes.name, table)
File "/home/qiime2/miniconda/envs/qiime2-2017.10/lib/python3.5/site-packages/q2_vsearch/_cluster_features.py", line 73, in _fasta_with_sizes
% feature_id)
ValueError: Feature 0858a0b2d108da823681823ea26812cc is present in sequences, but not in table. The set of features in sequences must be identical to the set of features in table.

Plugin error from vsearch:

Feature 0858a0b2d108da823681823ea26812cc is present in sequences, but not in table. The set of features in sequences must be identical to the set of features in table.

See above for debug info.

I think this is happening because there are fewer sequences in the feature table than are present in the rep-seqs data.

Again thank you for your help both of you, really appreciated!

Jono

EDIT: Well I apologize for wasting your time the both of you, it seems like I called 'sample90' 's table table-1 by accident and when I checked for the presence of all the samples I must have become seriously number blind!

Thanks again for your help!

Jono

thermokarst · November 29, 2017, 4:42pm

Thanks for the update, @Micro_Biologist. I just wanted to make sure you saw this part of my response:

Running with multiple threads can significantly speed up your denoising step. I recognize that this was not the primary issue you were reporting on here, but, I just want to make sure you are aware of that parameter. Thanks!

Micro_Biologist · November 30, 2017, 8:43am

Yes thanks, running with 6 threads (so effectively maxes my i7 in my laptop I'm currently using because of how hyper-threading works)

system · December 31, 2017, 2:43pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.