Build closed reference OTU database from HOMD for multiple hypervariable regions

Hi QIIME2 friends -

I’m conducting a meta analysis on several datasets. All did 16S amplicon sequencing, but the hypervariable region sequenced varies between the datasets. Therefore I need to do close ref OTU picking. I would like to use HOMD as this is appropriate to the samples I’m analyzing.

My question is: Can anyone point me in the right direction for making closed reference database from the HOMD_16S_rRNA_RefSeq_V15.22.aligned.fasta and HOMD_16S_rRNA_RefSeq_V15.22.qiime.taxonomy files that can be used across all the different datasets? I will then merge the feature tables at the end to one big table and do downstream analyses from there.

Here is my pipeline so far.

#import data to qiime
##this will be dataset specific
qiime tools import --type ‘SampleData[PairedEndSequencesWithQuality]’ --input-path manifest.txt --output-path paired-end-demux.qza --input-format PairedEndFastqManifestPhred33V2

#check out the quality graphs of the samples
#need to drag and drop the output file to www.view.qiime2.org
qiime demux summarize
–i-data demux-paired-end.qza
–o-visualization demux.qzv

Trim amplicon primers

qiime cutadapt trim-paired
–i-demultiplexed-sequences paired-end-demux.qza
–p-front-f CCTACGGGNGGCWGCAG
–p-front-r GGACTACHVGGGTATCTAATCC
–p-match-read-wildcards
–p-match-adapter-wildcards
–o-trimmed-sequences trimmed.paired-end-demux.qza

Summarise the reads

qiime demux summarize
–i-data trimmed.paired-end-demux.qza
–o-visualization summary.trimmed.paired-end-demux.qza \

##merge paired end sequences
#min merge length is 250 and max merge length is 430
qiime vsearch join-pairs --i-demultiplexed-seqs trimmed.paired-end-demux.qza
–p-minmergelen 250
–p-maxmergelen 430
–o-joined-sequences merged.qza

##check out the quality graphs of the merged samples
#need to drag and drop the output file to www.view.qiime2.org
qiime demux summarize
–i-data merged.qza
–o-visualization summary.merged.qza

##quality filter sequences
qiime quality-filter q-score --i-demux merged.qza
–o-filtered-sequences filt.merged.qza
–o-filter-stats stats.filt.merged.qza

##dereplicate
qiime vsearch dereplicate-sequences --i-sequences filt.merged.qza
–o-dereplicated-table filt.merged.derep.table.qza
–o-dereplicated-sequences filt.merged.derep.seq.qza

I’m pretty sure from the dereplicated step I should go to qiime vsearch cluster-features-closed-reference but I’m not sure if I need to do any preprocessing of the reference dataset?

Thanks very much team!

Hi @brett,

Welcome to the :qiime2: forum!

There's a simple answer and a more complicated one. The direct answer to your question is to import the aligned sequences and use RESCRIPt to degap and filter.

My (entirely unsolicted) advice would be to denoise before you do clustering since it's better than your standard quality filtering. If you're combining multiple regions in different samples (one region / sample), closed reference picking is the way to go. If you're combining multiple regions within a single sample, some work I did recent suggests that sidle is the most accurate option and plays pretty nicely with most databases. (With multiple regions in a single sample, and a specialized database you're likely to get trustworthy species level resolution.)

Best,
Justine

3 Likes

Hi @jwdebelius - :wave:t4:

Thanks very much for the warm welcome- happy to be here learning qiime2 - and the info.
Sorry it has taken some time to respond, didn’t want to post a half answer. but I think I am there thanks to your help :+1:t4: :slight_smile: To clarify, we are combining multiple regions in different samples for a meta analysis.

Here’s how I got HOMD in, posted here so others may find it useful:
#clean up those headers in the fasta file. remove everything from the fasta header lines except for the seq ID
sed 's/|.*//' HOMD_16S_rRNA_RefSeq_V15.22.aligned.fasta > cleanedHOMD.fasta

#sequence file with cleaned headers. is an aligned sequence type
qiime tools import \ --type 'FeatureData[AlignedSequence]' \ --input-path cleanedHOMD.fasta \ --output-path HOMD15.22.seqs.qza

#the qiime formatted taxonomy file - change the first two header values to be ‘Feature ID’ and ‘Taxon’
qiime tools import \ --type 'FeatureData[Taxonomy]' \ --input-path HOMD_16S_rRNA_RefSeq_V15.22.qiime.taxonomy.reformat \ --output-path taxranks-HOMD15.22.qza

#unalign - to removed gaps and lower case. apply a min length after degapping
qiime rescript degap-seqs \ --i-aligned-sequences HOMD15.22.seqs.qza \ --p-min-length 1200 \ --o-degapped-sequences HOMD.degappedseqs.qza

#complete dereplication for closed ref approach
qiime rescript dereplicate \ --i-sequences HOMD.degappedseqs.qza \ --i-taxa taxranks-HOMD15.22.qza \ --p-rank-handles 'silva' \ --p-mode 'lca' \ --o-dereplicated-sequences HOMD-seqs-degap-derep-lca.qza \ --o-dereplicated-taxa HOMD-tax-derep-lca.qza

Updated pipeline and was able to get the HOMD (mostly) working with a test dataset of a few samples. Briefly,
-import files
-remove primers
-merge

#denoise
qiime deblur denoise-16S --i-demultiplexed-seqs merged.qza \ --p-trim-length 250 \ --p-sample-stats \ --o-table denoise.table.qza \ --o-representative-sequences denoise.rep.seqs.qza \ --o-stats denoise.stats.qza

#closed ref OTU picking
qiime vsearch cluster-features-closed-reference \ --i-sequences denoise.rep.seqs.qza \ --i-table denoise.table.qza \ --i-reference-sequences HOMD-seqs-degap-derep-lca.qza \ --p-perc-identity 0.99 \ --o-clustered-table HOMDtest_table99.qza \ --o-clustered-sequences HOMDtest_repseqs99.qza \ --o-unmatched-sequences HOMDtest_unmatchedseqs99.qza

#taxonomic classification of OTUs
qiime feature-classifier classify-consensus-vsearch \ --i-query HOMDtest_repseqs99.qza \ --i-reference-reads HOMD-seqs-degap-derep-lca.qza \ --i-reference-taxonomy HOMD-tax-derep-lca.qza \ --o-classification HOMDtaxonomy.qza

I have a few questions/clarifications:

  1. I did not include a --p-perc-identity threshold when dereplicating HOMD as the database is quite small and specific. However, when I cluster my closed reference OTUs I do apply a --p-perc-identity 0.99 at that point for my sequence data - is this okay?

  2. When I inspect the taxonomy table, the species level classification is missing for many of the classified sequences on my test dataset. I think this resolution is lost in the dereplication stage but I am not sure. When I take the Feature ID in the taxonomy table and match it back to the original downloaded HOMD taxonomy file I can see a species identification there - but not in my imported modified qiime taxa file (most only go down to genus level).
    Do you have any ideas where I can do a sanity check to see what’s happening here? I guess the whole point of wanting to use this taxonomic database is that’s specific to my sample origin. Could it also be the 250bp trimmed sequence length?

  3. Going forward, will I be able to use this HOMD qiime database for the other datasets that sequenced different hypervariable regions - then merge the tables together before additional downstream filtering, ie rarefaction etc.?

Thanks for all your help and valuable suggestions!
Cheers,
Brett

1 Like

Hi @brett ,

Great questions.

Yes that's totally fine, becuase these cluster steps are performed for different reasons/have different boundary conditions, so the % ids do not need to match. In your case, you dereplicate the database (but do not cluster) to remove redundancy. For closed-ref OTU clustering of the query seqs this %id is rather a lower boundary used for finding a match in the reference database. Sequences with < 99% match will be discarded. So for closed-ref it is normal to set a more permissive %id than you would for the reference database (and keep an eye on how many sequences are being lost).

Yes — you are dereplicating in "lca" mode, which means that the reference taxonomy is being trimmed to the least common ancestor of all reference sequences that have 100% identical sequences in the HOMD. Whether that's a good thing or a bad thing is for you to decide... on the one hand it means that you are seeing the true resolution of the HOMD sequences (i.e., because the species that you saw before are actually 100% identical and cannot be resolved) but on the other hand it can be quite disappointing to see how reliably short 16S reads can really resolve different species!

If you want to experiment with this, RESCRIPt has other dereplication "modes", e.g., to do modified LCA that is weighted by the majority (e.g., if 2/3 sequences in a derep cluster have the same taxonomy, that label will be used for the dereplicated sequence's taxonomy).

Yes! Because with closed-ref you are aligning reads from each marker gene into the full-length 16S reads.

Good luck!

2 Likes

Hi @Nicholas_Bokulich :wave:t4:

Thanks heaps for such a quick reply.

Apologies, I need to update question 2. for clarity. :upside_down_face:.

Some of the OTUs from the closed reference step are poorly classified.
For example, Feature ID 019N026A in my OTU table gets classified as g__Corynebacterium. In the reference HOMD taxonomy file Feature ID 019N026A is g__Corynebacterium;s__accolens.

I'm curious as to why that species classification doesn't append to the taxonomy string in the taxonomy table if the feature IDs are exactly the same?

Here is the script I used:
qiime vsearch cluster-features-closed-reference \ --i-sequences denoise.rep.seqs.qza \ --i-table denoise.table.qza \ --i-reference-sequences HOMD-seqs-degap-derep-lca.qza \ --p-perc-identity 0.99 \ --o-clustered-table HOMDtest_table99.qza \ --o-clustered-sequences HOMDtest_repseqs99.qza \ --o-unmatched-sequences HOMDtest_unmatchedseqs99.qza

#taxonomic classification of OTUs; this method searches the entire reference database before choosing the top N hits, not the first N hits qiime feature-classifier classify-consensus-vsearch \ --i-query HOMDtest_repseqs99.qza \ --i-reference-reads HOMD-seqs-degap-derep-lca.qza \ --i-reference-taxonomy HOMD-tax-derep-lca.qza \ --o-classification HOMDtaxonomy.qza

Files are attached if you need them.
HOMD-tax-derep-lca.qza (35.4 KB) HOMD-seqs-degap-derep-lca.qza (253.0 KB) denoise.rep.seqs.qza (26.7 KB) denoise.table.qza (26.8 KB)

Doing ‘lca’ or ‘majority’ didn’t change the outcome of the taxonomy file for the HOMD taxonomy reference table. I dereplicated using both of those and get the same number of taxonomy entries in the dereplicated output (1,013 entries). For reference, the original HOMD taxonomy file has 1,015 feature ids/taxa. So here is my best guess: the HOMD is already dereplicated and the 2 feature ids/taxa are lost at the degap step with the minimum sequence length supplied (1200bp).

Many thanks for all your help,
Brett

Because you are doing closed-ref OTU clustering, and then performing a taxonomic classification with VSEARCH + LCA as implemented in q2-feature-classifier.

you are right, HOMD is probably already dereplicated, but this is dereplication of (I assume) full-length 16S seqs, so the species differences are preserved.

So closed-ref OTU clustering is finding the closest match for each of your query sequences (which are only partial 16S hypervariable regions). The returned sequences adopt the feature ID of the closest HOMD match, but the original ASV sequence of the query.

So even though the feature ID matches the HOMD reference, the sequence is shorter (much shorter) and the VSEARCH+LCA reclassification shows that these partial sequences actually map to multiple HOMD sequences and cannot truly resolve to species level, as we would expect.

In that regard, you are doing the right thing — reclassifying the short sequences after clustering to see what the true taxonomic resolution is.

You could do another thing — skip the VSEARCH+LCA classification and just use the HOMD reference taxonomy after performing closed-ref OTU clustering. This would be assuming that the closest HOMD match to each query is the only match, so would ignore the fact that the short reads actually map to multiple reference sequences... :see_no_evil:

One thing that might help the taxonomic classification... classify-consensus-vsearch has a "top hits only" mode to choose only the top N equal best hits, instead of the top N hits. E.g., if there are 5 ref sequences with >97% id to the query, but two of these are 99% similar and the rest are 98%, then it will only take the first two for performing consensus classification. Might help!

Good luck!

2 Likes

Awesome! Thanks heaps.
I have used
qiime feature-classifier classify-consensus-vsearch \ --i-query HOMDtest_repseqs99.qza \ --i-reference-reads HOMD-seqs-degap-derep-lca.qza \ --i-reference-taxonomy HOMD-tax-derep-lca.qza \ --p-top-hits-only \ --o-classification HOMDtaxonomytophit.qza
which seems to be a happy medium and as close as we are going to get :slight_smile:
Thanks for all your help! Cheers!

1 Like