Filter representative sequences according to sequence's length

sixvable · October 16, 2019, 2:23pm

Hello qiime2 team!

Recently I deal with my raw sequence which is sequenced under Miseq PE300 mode.
The quality of sequence is not really good.When I denoise it I found some representative sequences which are short than my expecting length.I blast it and I am sure they are not my target sequences.
So I want to filter my rep-seqs by length but I could not find a plugin which provide the function directly(Just like --p-min-length and --p-max-length).
Although I could transform it to an excel sheet and filter under microsoft office but I still want an easier way by deal under qiime2!
Could u implement that function in next version?

yanxianl · October 16, 2019, 2:42pm

Hi,

I'm not aware of QIIME2 command lines for filtering sequences based on length. Depending on the primers used, the amplicon sizes may actually vary quite a bit. If the goal is to remove the non-targetted sequences, you can try the q2-quality-control plugin, which allows you to filter sequences by alignment, say to a reference database to exclude non-bacterial sequences.

Hope that helps.

-Yanxian

sixvable · October 16, 2019, 3:34pm

Thank u @yanxianl
My sequences is actually targeting a functional gene so it has no reference database.
And some sequences length is far from my target amplicon size so I am sure they are not my aim sequences .Also check by blast to NT reference datasets.
Now I just try to filterd it using vsearch.

Nicholas_Bokulich · October 17, 2019, 2:16am

You can make one — it sounds like you already found one! This one:

It would just take a few representative sequences to do the trick, but the more the merrier.

That would do the same exact thing as what @yanxianl is recommending. With vsearch you will still presumably need some reference sequences to align against.

sixvable · October 17, 2019, 3:28am

Thank for advice @Nicholas_Bokulich

My target amplicon is Bacterial amoA(ammonium monooxygenase subunit A) . There is actually no good performance reference dataset.

Here is my rep-seqs file which only filter the singleton.rep-seqs-nosinglton.qzv (282.3 KB)
My target amplicon sequence size is 452bp but as u can see there is still some rep-seqs which is lower than 400bp(some even only 180bp ).
I have arranged my own reference database but I am not sure it would assign with a good performance.I have to test and adjust my reference database later. I dont want to take the risk to assign my feature and filter by it beacuse some real target may be discard due to uncorrect annotation.
Filter it first by rep-seqs length seems lower risk to me.

Now I use vsearch to filter it by command
Vsearch --fastx_filter rep-seqs-nonsingleton.fasta --fastq_minlen 400 --fastaout file

Is there a better way deal with that issue?

Nicholas_Bokulich · October 17, 2019, 3:30am

Sure that works — I had assumed you were using vsearch to filter by alignment

I was not recommending that you classify taxonomically, nor is @yanxianl. the q2-quality-control plugin would blast your sequences against a set of reference sequences and discard based on alignment quality.

Sounds like you have sorted things out with vsearch, though, and that's fine

sixvable · October 17, 2019, 4:03am

Thank u nick

Now I know what u mean about using q2-quality-control to filter my sequences.I will try that later！ Great solution!
Still I would want an implement of length filtering in qiime2 plugin.
Also I was banned a few month ago because a crosslink. May I unbanned now?

Nicholas_Bokulich · October 17, 2019, 4:09am

I agree that would be useful — any interest in contributing to the q2-vsearch plugin?

sixvable · October 17, 2019, 4:19am

Honestly I have no abilility to do this work.I am barely dont know how to use python.

system · November 17, 2019, 10:26am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.