Determining Which File a Feature Originated In

alexkrohn · August 25, 2021, 7:14pm

The goal of my study is to determine which vertebrate or mussel species are present in various eDNA samples. For four genes we extracted DNA, sequenced them (separately), and are analyzing the results in Qiime2. For 48 samples, each sequenced at one gene, I ran the data through demultiplexing -> dada2 denoising (after changing parameters adequately to not lose all the data!) -> downloading reference data from NCBI (both bacteria, fungi, and vertebrates or mussels for each gene) -> filtered the reference data to length and using my primers -> fit a Bayesian classifier to the reference sequences and FINALLY ran the classifier on my real data.

In case it wasn't clear, the output of the demux is a rep-seqs.qza that contains sequences of one gene from 48 different environmental localities.

Here are the results, attached here. How can I figure out which sample these feature ID's came from? I assume each Feature ID is one (or a few) read(s), but I really need to know which sample those reads came from so I can determine where that species might have been detected.

If it helps, here is the code I ran:

qiime tools import --type 'SampleData[SequencesWithQuality]' --input-path manifest-file.txt --input-format SingleEndFastqManifestPhred33V2 --output-path demux-co1-r1.qza

qiime dada2 denoise-single \ --i-demultiplexed-seqs demux-co1-r2.qza \ --p-trim-left 20 \ --p-trunc-len 115 \ --o-representative-sequences rep-seqs-r2-115.qza \ --o-table table-r2-115.qza \ --o-denoising-stats denoise-stats-r2-115.qza

qiime rescript evaluate-fit-classifier --i-sequences coi-mussels-filtered-seqs.qza --i-taxonomy coi-mussels-taxonomy-unfiltered.qza --o-classifier coi-mussels-classifier.qza --o-evaluation coi-mussels-classifier-evaluation.qzv --o-observed-taxonomy coi-mussels-classifier-predicted-taxonomy.qza --verbose

qiime feature-classifier classify-sklearn \ --i-classifier ../coi-mussels-classifier.qza \ --i-reads rep-seqs-r1-115.qza \ --o-classification coi-r1-mussels-classified-taxonomy.qza

qiime metadata tabulate \ --m-input-file coi-r1-mussels-classified-taxonomy.qza \ --o-visualization coi-r1-mussels-classified-taxonomy.qzv

Is there a way to search individual samples (e.g. fastq files) to see if they contain a feature ID? Or do I have to run each of those above commands on one sample (i.e. one fastq file) to get taxonomic identification for that sample?

Thanks for your help!

colinbrislawn · August 27, 2021, 2:08pm

Hello again Alex,

That data is in your table-r2-115.qza file, which is a table / DataFrame of counts by feature and sample. (The rep-seqs-r2-115.qza file just has the feature IDs and sequences.)

There's a couple of ways to do this within Qiime 2. Check out these threads:

alexkrohn · August 27, 2021, 6:58pm

Hi again @colinbrislawn! This is close, but I'm not quite there yet. I essentially want to append sample names to the taxonomy output from the feature-classifier.

qiime tools export --input-path coi-r1-mussels-classified-taxonomy.qzv outputs a non-BIOM visualization, and qiime tools export --input-path coi-r1-mussels-classified-taxonomy.qza exports a non-BIOM taxonomy tsv, which still doesn't have the sample names on them.

Somehow I need to match the sample names from the table-r2-115.qza to the feautre ids in the coi-r1-mussels-classified-taxonomy.qza. I'm sure this is possible in Qiime, but I don't even know how to phrase that well enough to search for it...

In the end I've been running really long R scripts that just zgrep each taxonomy-associated sequence across all of the raw data files. Definitely not efficient!

Is there a way to do this in Qiime?

colinbrislawn · August 28, 2021, 3:50pm

OK. So your taxa table has feature names and taxonomy labels. Your table has feature names and sample names. You can join these by feature names, to get the table you describe.

If you would like to post the R script you are running now, I can help craft that merge command.

alexkrohn · August 30, 2021, 1:13pm

Ah, I see. Rather than an export or merge in Qiime2, you're suggesting to output them both to TSVs, then use a matching function in R based on the shared feature name column. Thanks for your help!

system · September 30, 2021, 7:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.