Using RESCRIPt to create an amplicon specific classifier with NCBI data

Hello QIIME2 users,

As the subject line says, I am trying to create a amplicon specific classifier using NCBI data I imported using the RESCRIPt plugin. Specifically, I looked for annotated nucleotide sequences of the N2 fixation nifH gene that represented taxonomically classified organisms, so roughly ~9000 sequences that were on average 150 - 500 bp in length.

Once in QIIME, I used some of the commands in the Silva tutorial for RESCRIPt to filter my database:

qiime rescript cull-seqs

qiime rescript filter-seqs-length-by-taxon (average length of nifH gene is 344 bp, and the primers I used generated an amplicon length of 395 bp)

qiime rescript filter-taxa

qiime rescript dereplicate (--p-rank-handles 'silva', --p-mode 'lca' --p-perc-identity 0.99)

After which I tried to extract reads based on whether they contain the primer set I used (Forward = igk3 / Reverse = dvv, which are in the 5' - 3' orientation)

qiime feature-classifier extract-reads \
    --i-sequences My-sequences.qza \
    --o-reads My-sequences-IGK3f-DVVr.qza

For some reason though, despite the My-sequences.qza file only being 6MB in length and having 30GB of RAM and 4 CPUs, the code keeps running indefinitely (24 hours), which I think might be a little too long of a process. Also, I switched out the original "I" base for an "N" so that it was consistent with the IUPAC code. I also removed the command from the Silva tutorial --p-read-orientation 'forward', as I don't know what the orientation of the sequences I pulled from NCBI were.

Is there something wrong with my code, or a reason why this code would be stuck so long/not working as it stands?

Any help would be greatly appreciated. Thank you!

Hi @bkramer,

Just as a small heads up, I have moved your post to the Community Plugin Support category, since RESCRIPt is not one of the core plugins within QIIME 2. Someone in there should be able to provide some insight!



Thank you so much!! I wasn't sure what I'd do unless someone responded to my problem.

Hello QIIME users,

Since Wednesday (8/25/21) afternoon, I've been trying to extract reads to create an amplicon-specific classifier for the nifH gene. These reference reads were imported from NCBI using RESCRIPT, after which I culled, filtered, and dereplicated the dataset, until I got the file attached (see below). I also used the following script to extract my reads based on the IGK3/DVV primer set:

qiime feature-classifier extract-reads
--i-sequences /home/qiime2/Desktop/PostImportednifH_Ben/ncbi-refseqs-cleaned-filt-derep-99-uniq.qza
--o-reads /home/qiime2/Desktop/PostImportednifH_Ben/nifH-NCBI-seqs-IGK3f-DVVr.qza

At the moment, this code is still running (white bar is still flashing), yet the code has been running for almost 48 hours. Based on the size of the file, and assuming I didn't make a mistake with my code, does it make sense that it's taking this long?

Any help would be greatly appreciated![ncbi-refseqs-cleaned-filt-derep-99-uniq.qza|attachment]ncbi-refseqs-cleaned-filt-derep-99-uniq.qza (765.8 KB)

Hi @bkramer,

Sorry for the late reply. But here are some thoughts:

Just so others have a good reference to work from, here is a good review and comparison of nifH primers is here. You can see it took a lot of work to make the database.

I think the main issue is that these are very degenerate primers and are likely making it quite difficult to align against the downloaded sequences. For example, I tried running feature-classifier extract-reads on only ~250 of the sequences from the nifH database and it just kept running. I killed the job after more than an hour. Tip: when things appear to take a long time, make a smaller subset of the data and try again.

One option may be to manually run cutadapt outside of :qiime2:, as it can handle FASTA and FASTQ files, while the plugin only works with FASTQ. I suggest this on the off-chance that cutadapt might be better able to handle these extremely degenerate primers.

I'd recommend that a "full-length" classifier be made and make sure that is sufficient, and working before diving into making an amplicon specific classifier. Warning: some of the returned accessions might point to genomic (complete or partial) data... which will take a very long time to perform primer searches on (compounding the primer base degeneracy issue). May want to specify non-genomic data in the query? Note, this is what makes this database quite good, as it has already extracted the gene segments from genomic data.

One note of caution, many researchers often, remove the primer sequences from their (Sanger?) sequence data prior to uploading to GenBank, others will leave the primer sequences and make note of the primer location in the sequence. I mention this because... for those data in which the primers have already been removed, then you'll not be able to search for and remove the primers, as there is no longer anything to search for and remove. This may or may not be a problem depending on how searching and trimming is performed.

Alternatively, if you already have a nifH alignment (either from the curated nifH database, or your own), you can run rescript trim-alignment, to extract the amplicon region based on the alignment position. You can use an alignment viewer like Unipro UGENE to make sure everything looks good. Many sequences might be in mixed orientation and align poorly, you should be able to use the alignment tools or rescript orient-seqs to help get this oriented properly.


1 Like

Thank you @SoilRotifer !

I think my plan at the moment is to use the trim-alignment and orient-seqs commands prior to building my classifier. Is there a webpage where I can determine what inputs are required for both commands?

Some details are outlined here, otherwise simply add --help to the end of any given command, e.g.

qiime rescript trim-alignment --help

and the help text will provide you with everything you need to know. :slight_smile:


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.