Dereplicated files are bigger than original files?

emmlemore · January 23, 2024, 3:20pm

Hello I'm currently trying to create a reference database in order to create a classifier for the fungal ITS classification.

I downloaded the UNITE database and am following this RESCRIPt tutorial: Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs.

After the first dereplication step, I check the size of the file before and after dereplication and the dereplicated files are bigger than the original. I am confused as to how this can be? Maybe I'm not understanding how the dereplication step works?

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose &> get_unite_data_verbose.log & disown

qiime rescript dereplicate \
    --i-sequences sequences.qza \
    --i-taxa taxonomy.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_derep.qza \
    --o-dereplicated-taxa taxonomy_derep.qza

ls -l sequences.qza
-rw-rw-r-- 1 36360579 Jan 19 10:01 sequences.qza

ls -l sequences_derep.qza 
-rw-rw-r-- 1 36373325 Jan 23 11:02 sequences_derep.qza

ls -l taxonomy.qza
-rw-rw-r-- 1 6341585 Jan 19 10:01 taxonomy.qza

ls -l taxonomy_derep.qza 
-rw-rw-r-- 1 10332230 Jan 23 11:02 taxonomy_derep.qza

du -h taxonomy.qza
6.1M	taxonomy.qza

du -h taxonomy_derep.qza 
9.9M	taxonomy_derep.qza

Nicholas_Bokulich · January 23, 2024, 4:28pm

Hi @emmlemore ,

File size can be misleading here, because the QIIME 2 artifact also contains provenance information, citations, etc. These adjunct files are not large, but it means that file size can theoretically increase even if the data contents remain the same or shrink after a processing step.

In your case: the dereplication step is probably not doing much. The UNITE data files are already clustered to some extent (hence the 97, 99, and dynamic cluster options), and so dereplication without first trimming will probably lead to no change. So you probably have the same sequences in the output file, but the provenance and other metadata files increase in size, as more information about the process gets added.

If you want to check how many sequences you have before/after a dereplication or other step, you could use the qiime feature-table tabulate-seqs action or qiime rescript evaluate-seqs.

emmlemore · January 23, 2024, 5:37pm

Thank you - that makes sense!

system · February 23, 2024, 11:37pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.