train a greengenes 2022 classifier for forward reads only

SoilRotifer · December 6, 2023, 6:19pm

If the majority of your reads are being assigned as "Unassigned" and "k__Bacteria", it could mean that your reads are in a mixed orientation, with respect to the classifier. That is, the both the forward and reverse reads appear in the R1 files, and similarly, in the R2 files. You can read many forum threads about this here.

One way to sanity-check if this is the case, is to run qiime feature-classifier classify-consensus-vsearch ... as vsearch does not care about read orientation. You can download the reference sequence and taxonomy files from here for use with classify-consensus-vsearch.

If you obtain reasonable taxonomic classification, then your reads are in mixed orientation for the naive bayes classifier. If you get the same result, then there is likely another issue...

Some tricks to get around this issue are outlined in the following posts:

I'll provide a third option below, which is inspired by the above two posts. But the second post is likely the way to go.

This approach outlined below assumes two things:

You will be using cutadapt to remove your primer sequences.
You can import your data via a manifest file.

If the two requirements are true than you can proceed to make two manifest files.

One manifest should look something like this:

sample-id     forward-absolute-filepath       reverse-absolute-filepath
sample-1      $PWD/some/filepath/sample0_R1.fastq.gz  $PWD/some/filepath/sample1_R2.fastq.gz
sample-2      $PWD/some/filepath/sample2_R1.fastq.gz  $PWD/some/filepath/sample2_R2.fastq.gz
sample-3      $PWD/some/filepath/sample3_R1.fastq.gz  $PWD/some/filepath/sample3_R2.fastq.gz
sample-4      $PWD/some/filepath/sample4_R1.fastq.gz  $PWD/some/filepath/sample4_R2.fastq.gz

...and second manifest that swaps file names, such that R2 is under the forward-absolute-filepath column, and R1 is under the reverse-absolute-filepath:

sample-id     forward-absolute-filepath       reverse-absolute-filepath
sample-1      $PWD/some/filepath/sample0_R2.fastq.gz  $PWD/some/filepath/sample1_R1.fastq.gz
sample-2      $PWD/some/filepath/sample2_R2.fastq.gz  $PWD/some/filepath/sample2_R1.fastq.gz
sample-3      $PWD/some/filepath/sample3_R2.fastq.gz  $PWD/some/filepath/sample3_R1.fastq.gz
sample-4      $PWD/some/filepath/sample4_R2.fastq.gz  $PWD/some/filepath/sample4_R1.fastq.gz

Once you've imported both of these into QIIME 2, you can run cutadapt, as you normally would on each of these two versions of the imported data.

Be sure that you set --p-discard-untrimmed! This way we'll remove all the read pairs in the wrong orientation. That is, the orientation is set by the manifest files. You should only expect about half the data to be trimmed. Repeat this command on the second batch of data imported with other manifest file.

Additionally, the reason for doing it this way is to avoid duplicate sequence names in one file.

At this point, assuming no other issues are involved, you should be able to merge the two sets of data. Currently, there is no way to merge demuxed data in QIIME 2. So, you'd have to export the resulting demuxed fastq files and then manually concatenate them for each sample. Then re-import via a manifest. I suppose you could denoise each separately then merge the features, but not sure if you'd be able to use the same trimming and truncation options.

Others on the forum might have additional solutions too.