Hello I'm currently trying to create a reference database in order to create a classifier for the fungal ITS classification.
I downloaded the UNITE database and am following this RESCRIPt tutorial: Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs.
After the first dereplication step, I check the size of the file before and after dereplication and the dereplicated files are bigger than the original. I am confused as to how this can be? Maybe I'm not understanding how the dereplication step works?
qiime rescript get-unite-data \
--p-version 9.0 \
--p-taxon-group eukaryotes \
--p-cluster-id dynamic \
--p-singletons \
--output-dir unite_rescript \
--verbose &> get_unite_data_verbose.log & disown
qiime rescript dereplicate \
--i-sequences sequences.qza \
--i-taxa taxonomy.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences sequences_derep.qza \
--o-dereplicated-taxa taxonomy_derep.qza
ls -l sequences.qza
-rw-rw-r-- 1 36360579 Jan 19 10:01 sequences.qza
ls -l sequences_derep.qza
-rw-rw-r-- 1 36373325 Jan 23 11:02 sequences_derep.qza
ls -l taxonomy.qza
-rw-rw-r-- 1 6341585 Jan 19 10:01 taxonomy.qza
ls -l taxonomy_derep.qza
-rw-rw-r-- 1 10332230 Jan 23 11:02 taxonomy_derep.qza
du -h taxonomy.qza
6.1M taxonomy.qza
du -h taxonomy_derep.qza
9.9M taxonomy_derep.qza