Hello QIIME2 users,
As the subject line says, I am trying to create a amplicon specific classifier using NCBI data I imported using the RESCRIPt plugin. Specifically, I looked for annotated nucleotide sequences of the N2 fixation nifH gene that represented taxonomically classified organisms, so roughly ~9000 sequences that were on average 150 - 500 bp in length.
Once in QIIME, I used some of the commands in the Silva tutorial for RESCRIPt to filter my database:
qiime rescript cull-seqs
qiime rescript filter-seqs-length-by-taxon
(average length of nifH gene is 344 bp, and the primers I used generated an amplicon length of 395 bp)
qiime rescript filter-taxa
qiime rescript dereplicate
(--p-rank-handles 'silva', --p-mode 'lca' --p-perc-identity 0.99)
After which I tried to extract reads based on whether they contain the primer set I used (Forward = igk3 / Reverse = dvv, which are in the 5' - 3' orientation)
qiime feature-classifier extract-reads \
--i-sequences My-sequences.qza \
--p-f-primer GCNWTHTAYGGNAARGGNGGNATHGGNAA \
--p-r-primer ATNGCRAANCCNCCRCANACNACRTC \
--o-reads My-sequences-IGK3f-DVVr.qza
For some reason though, despite the My-sequences.qza file only being 6MB in length and having 30GB of RAM and 4 CPUs, the code keeps running indefinitely (24 hours), which I think might be a little too long of a process. Also, I switched out the original "I" base for an "N" so that it was consistent with the IUPAC code. I also removed the command from the Silva tutorial --p-read-orientation 'forward', as I don't know what the orientation of the sequences I pulled from NCBI were.
Is there something wrong with my code, or a reason why this code would be stuck so long/not working as it stands?
Any help would be greatly appreciated. Thank you!