I wasn’t sure how to add to a closed topic, so I made a part 2.
My goal in Part 1 to use SourceTracker to identify likely sources of the organisms in my samples. To do so, I wanted to download data from different projects, and merge them with my data to have 1 frequency table that has all of my samples + samples from different environments I want to compare mine to.
I thought I solved this problem in “Merging seqs.fna from multiple projects” topic. However, Qiita was trimming the sequences too short (150 max for some projects) and a majority of the sequences in my own samples weren’t being identified to a specific enough level to be compared with other data. As a result, I was getting a lot of ‘unknowns’ in my Sourcetracker analysis. Did I say that in a confusing manner? Probably.
I’m trying something different now so I just wanted to share it with people who might want to also use SourceTracker.
Step 1: Download split library results (seqs.fna) of each project separately from Qiita and import into Qiime
qiime tools import
Step 2: Dereplicate samples to 100% otus
qiime vsearch dereplicate-sequences
Step 3a: Filter tables to only have samples I want (some projects come with 1000+ samples)
qiime feature-table filter-samples
Step3b: Filter frequencies that don’t show up in any samples (3a removed samples, which may mean some frequencies no longer show up in any samples)
qiime feature-table filter-features
Step 4: Filter seqs to only have samples I want. I couldn’t do this with a metadata input, but using the table in Step 3 works great
qiime feature-table filter-seqs
Step 5: Assign taxonomy
qiime feature-classifier classify-sklearn
Repeat steps 1-5 for all datasets you are interested in
Step 6: Merge taxonomies
Step 7: Merge tables
Step 8: Format for SourceTracker
I’m still assigning taxonomy right now, and will try Step 6 and 7 soon! I will add commands for them when I’m done and update on how sourcetracker outputs look!