extracting reference reads for aligned sequences

eraysahin · March 7, 2020, 8:11pm

Hello,

I am pretty new in using qiime and microbiome analysis, as well.

I was following 'Training Feature Classifiers' tutorial for in-house developed reference database in aligned sequence format. When I got the error message of

Invalid value for "--i-sequences": Expected an artifact of at least
type FeatureData[Sequence]. An artifact of type FeatureData[AlignedSequence]
was provided.

I looked and learned that 'qiime feature-classifier extract-reads` is not functional for aligned sequences. Can you please guide me how to proceed?

Thank you

colinbrislawn · March 9, 2020, 1:33pm

Good morning @eraysahin,

Welcome to the forums! :qiime2:

You could rerun feature-classifier extract-reads on an unaligned version of your database.
Or you could 'un-align' your reads and pass those into the plugin. (Removing all the - from the reads should accomplish this. Let me know if you would like help crafting a sed command to remove all the dashes!)

Colin

eraysahin · March 9, 2020, 2:20pm

Dear @colinbrislawn,

Thank you for your reply.

Unfortunately, I only have the aligned fasta version of the reference (developed by a colleague) which has approx. 93 GB of size, so hard to manipulate. There are both dots and dashes that I need to get rid of (attaching a prtscr of it), and I was struggling with converting to un-aligned version. I would be very happy if you can help me to run the necessary ‘sed’ command.

Best regards,

colinbrislawn · March 9, 2020, 3:04pm

Good morning!

OK, let's dive in!

Let's start with this detailed discussion about using sed to process fasta files:

github.com

McMahonLab/TaxAss/blob/master/tax-scripts/TaxAss_Directions.Rmd#L301-L319


      
          ### Format your file names  
          
          If you are using filenames other than the default `otus.fasta` and `otus.abund` the filenames: 
          
          - **must contain exact extensions `.fasta` and `.abund`**  
          			 	This is necessary for running the batch script, otherwise it will not be able to parse the names.  
          - **cannot contain any white space**  
          			 	No spaces in file names, use CamelCase, underscores, or periods.
            
          
          ___________
          
          <br>
          
          ### Format your sequence ID's  
          
          The seqIDs in your otus.fasta and otus.abund files must follow these requirements:
          
          - **cannot contain any whitespace**

With that in mind, let's craft this command:

My initial thought was to replace all dashes with - nothing
sed 's/-//g' < input_aligned.fasta > output_unaligned.fasta

But... 1) that would replace dashes in the header and 2) that would not remove the . in the file.

We could follow up with
sed 's/\.//g' < output_unaligned.fasta > output_unaligned_no_dots.fasta
... but that would also remove periods in the headers!

It's time to get serious and upgrade to awk

Here's what I came up with:

awk '/^>/ {print;next} {gsub(/\.|-/,"")}1' < test.fasta > test.unaligned.fasta

awk ''         will process your file one line at a time
/^>/           matches lines that start with >
{print;next}   prints line that match, then goes to the next line 
{gsub()}       globally substitutes characters like this:
               /pattern/,"replacement"
/\.|-/,""      in your case, . or - with nothing "

Try it and let me see how it works!

Colin

eraysahin · March 9, 2020, 5:51pm

@colinbrislawn

It worked but the strings were left not merged, most probably '\n's have been kept. With addition of some parameters;

awk 'NR==1 {print;next} /^>/ {ORS="\n"; print "\n"$0;next} {ORS=""; gsub(/\.|-/,"")}1' < alignedsequence.fasta > unaligned.fasta

It will take some time, when I checked first few sequences it seemed working.

Thank you so much for helping me with such details and clear explanations.

Best regards,

colinbrislawn · March 9, 2020, 5:54pm

Great work Eray!

Glad you got that awk script up and running. You can post again if you run into any issues or have more questions.

Colin

system · April 10, 2020, 1:41am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.