How to filter sequences which has more than 8 same consecutive nucleotides in fastq file?

I want to filter my sequences which has more than 8 same consecutive nucleotides( “GGGGGGGG”, “CCCCCCCC” etc from my fastq files. How should I do that?
Thanks
Dawud

1 Like

Hi Dawud,

I’m just asking because I had an issue with this, but are these sequence barcodes that were found on the run? (Not the barcodes that were assigned, but what the Illumina MiSeq run produced?)

I am asking because I had problems like this, but it turned out that it was an issue with the sequencing core and the run. I believe there was an issue and many of the sequences that were sequenced had repeating G and C as barcodes. This is likely that the run failed with some issues with the amplicon libraries.

Ben

Hello Dawud,

Cut adapt supports trimming repeated-bases, which sounds like your goal.

Try using the cutadapt plugin and passing --p-adapter "G{8}" or --p-adapter "C{8}" and see if that does what you want.

(Not all cutadapt features are supported in the Qiime 2 plugin, but I think this one might work!)

Let me know what you find!
Colin

2 Likes

Thank you all for the comments, I ended up to use below codes in R and solved my problem.

fq <- FastqFile("/Users/path/to/file")

reads_fq <- readFastq(fq)

trimmed_fq <- reads_fq[grep(“GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC”,sread(reads_fq), invert = TRUE)]

writeFastq(trimmed_fq,“new_name_for_fq.fastq”,compress = FALSE)

1 Like

I have half of my sample’s fastq files already processed by one of the company and half of it is not due to the company filed bankruptcy and no more service available.
I just want to follow their instructions to do the same quality control for fastq files. They said any sequence which has more than 8 consecutive same nucleotides were trimmed.
In terms of why they do that and where it comes from, I don’t have any idea.
Thanks
Dawud

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.