Hi @bkramer,
Sorry for the late reply. But here are some thoughts:
Just so others have a good reference to work from, here is a good review and comparison of nifH primers is here. You can see it took a lot of work to make the database.
I think the main issue is that these are very degenerate primers and are likely making it quite difficult to align against the downloaded sequences. For example, I tried running feature-classifier extract-reads
on only ~250 of the sequences from the nifH database and it just kept running. I killed the job after more than an hour. Tip: when things appear to take a long time, make a smaller subset of the data and try again.
One option may be to manually run cutadapt outside of :qiime2:, as it can handle FASTA and FASTQ files, while the plugin only works with FASTQ. I suggest this on the off-chance that cutadapt might be better able to handle these extremely degenerate primers.
I'd recommend that a "full-length" classifier be made and make sure that is sufficient, and working before diving into making an amplicon specific classifier. Warning: some of the returned accessions might point to genomic (complete or partial) data... which will take a very long time to perform primer searches on (compounding the primer base degeneracy issue). May want to specify non-genomic data in the query? Note, this is what makes this database quite good, as it has already extracted the gene segments from genomic data.
One note of caution, many researchers often, remove the primer sequences from their (Sanger?) sequence data prior to uploading to GenBank, others will leave the primer sequences and make note of the primer location in the sequence. I mention this because... for those data in which the primers have already been removed, then you'll not be able to search for and remove the primers, as there is no longer anything to search for and remove. This may or may not be a problem depending on how searching and trimming is performed.
Alternatively, if you already have a nifH alignment (either from the curated nifH database, or your own), you can run rescript trim-alignment
, to extract the amplicon region based on the alignment position. You can use an alignment viewer like Unipro UGENE to make sure everything looks good. Many sequences might be in mixed orientation and align poorly, you should be able to use the alignment tools or rescript orient-seqs
to help get this oriented properly.
-Mike