rescript: how to handle gapped sequence data

devonorourke · August 10, 2020, 1:14pm

The get-ncbi-data function returns a FeatureData[Sequence] formatted object, correct? If the NCBI data contains any gap (-) characters, I'm wondering how to get rid of those with RESCRIPt/QIIME functions.

One option might be to use qiime rescript degap-seqs, but the input for that is a different file format (FeatureData[AlignedSequence]). Perhaps @SoilRotifer might find a way to allow for the input to that function accept both AlignedSequence and Sequence format types?

Thanks for the assistance!

Nicholas_Bokulich · August 10, 2020, 3:13pm

The sequences should not have any gaps — and QIIME 2 should prevent that action from saving a FeatureData[Sequence] artifact if there are any gaps. Are you finding otherwise? If so this is something we can fix in get-ncbi-data.

devonorourke · August 10, 2020, 3:36pm

Haven't received any data yet, and I'm hoping for the best

Nevertheless, I'm raising this concern only because when I pulled COI data from BOLD, there were plenty of instances of gap characters in those sequences. I'll let you know about the sequence composition of NCBI data once I get it all downloaded.

I'll look for any potential error when downloading with get-ncbi-data regarding the gap artifacts too. Thanks