Analyzing random sequences

Brigitta1 · August 2, 2022, 8:50am

Can I know if there is a way to randomly pick 1000 denoised sequences of a sample and get it's taxonomic classification?

Thank you in advance,
Brigitta

llenzi · August 2, 2022, 12:47pm

Hi @Brigitta1,

What I do in this case, is to use seqtk (GitHub - lh3/seqtk: Toolkit for processing sequences in FASTA/Q formats), you can install into your qiime 3 environment with: conda install -c bioconda seqtk (run this with the environment active!)

You can run something as the following:
seqtk sample read1.fq 10000 > sub1.fq

For the taxonomic assignment of the subsampled reads, you may try:
https://library.qiime2.org/plugins/q2-metaphlan2/12/

Hope it helps
Luca

SoilRotifer · August 2, 2022, 3:41pm

Hi @Brigitta1,

In addition to @llenzi's suggestion you can also make use of: qiime rescript subsample-fasta ...

If you'd like to subsample about 5% of the sequences, you'd use the following command:

qiime rescript subsample-fasta \
    --i-sequences seqs.qza \
    --p-subsample-size 0.05 \
    --p-random-seed 1234 \
    --o-sample-sequences sub-sampled-seqs.qza

Note, in this example you may not always get exactly 5% sequences as your output. That is we worked to make this fast and memory efficient. This command it will iterate through each sequence and pick a random value between 0 and 1. If that value is less than the --p-subsample-size then the sequence is written to file. Thus, if you had a 100 sequences, and wanted to subsample ~5% of them you might end up with 4, 5, or 6 sequences in your output.

-Cheers!
-Mike

llenzi · August 2, 2022, 4:48pm

Always forgot how magig is recript lol

Brigitta1 · August 5, 2022, 11:18am

Thank you so much @llenzi @SoilRotifer

system · September 5, 2022, 5:19pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.