extracting reference reads for aligned sequences

Hello,

I am pretty new in using qiime and microbiome analysis, as well.

I was following ‘Training Feature Classifiers’ tutorial for in-house developed reference database in aligned sequence format. When I got the error message of

Invalid value for “–i-sequences”: Expected an artifact of at least
type FeatureData[Sequence]. An artifact of type FeatureData[AlignedSequence]
was provided.

I looked and learned that 'qiime feature-classifier extract-reads` is not functional for aligned sequences. Can you please guide me how to proceed?

Thank you

1 Like

Good morning @eraysahin,

Welcome to the forums! :qiime2:

You could rerun feature-classifier extract-reads on an unaligned version of your database.
Or you could 'un-align' your reads and pass those into the plugin. (Removing all the - from the reads should accomplish this. Let me know if you would like help crafting a sed command to remove all the dashes!)

Colin

1 Like

Dear @colinbrislawn,

Thank you for your reply.

Unfortunately, I only have the aligned fasta version of the reference (developed by a colleague) which has approx. 93 GB of size, so hard to manipulate. There are both dots and dashes that I need to get rid of (attaching a prtscr of it), and I was struggling with converting to un-aligned version. I would be very happy if you can help me to run the necessary ‘sed’ command.

Best regards,

1 Like

Good morning!

OK, let’s dive in!

Let’s start with this detailed discussion about using sed to process fasta files:


With that in mind, let’s craft this command:

My initial thought was to replace all dashes with - nothing
sed 's/-//g' < input_aligned.fasta > output_unaligned.fasta

But… 1) that would replace dashes in the header and 2) that would not remove the . in the file.

We could follow up with
sed 's/\.//g' < output_unaligned.fasta > output_unaligned_no_dots.fasta
… but that would also remove periods in the headers!


It’s time to get serious and upgrade to awk
:sunglasses: :scream_cat:

Here’s what I came up with:

awk '/^>/ {print;next} {gsub(/\.|-/,"")}1' < test.fasta > test.unaligned.fasta

awk ''         will process your file one line at a time
/^>/           matches lines that start with >
{print;next}   prints line that match, then goes to the next line 
{gsub()}       globally substitutes characters like this:
               /pattern/,"replacement"
/\.|-/,""      in your case, . or - with nothing "

Try it and let me see how it works!

Colin

2 Likes

@colinbrislawn

It worked but the strings were left not merged, most probably '\n’s have been kept. With addition of some parameters;

awk 'NR==1 {print;next} /^>/ {ORS="\n"; print "\n"$0;next} {ORS=""; gsub(/\.|-/,"")}1' < alignedsequence.fasta > unaligned.fasta

It will take some time, when I checked first few sequences it seemed working.

Thank you so much for helping me with such details and clear explanations.

Best regards,

2 Likes

Great work Eray! :1st_place_medal:

Glad you got that awk script up and running. You can post again if you run into any issues or have more questions.

Colin

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.