picrust2 input sequences

Good afternoon,
I am supposed to run a picrust analysis( I specify in addition that I am doing an analysis based on OTu and not on ASV) I was wondering since I have already found one issue, if I am doing correctly:
I am starting with a fasta file (.fasta) made by dereplicating sequences

``joined_import_filter_derep:
export HDF5_USE_FILE_LOCKING='FALSE';
$(CONDA_ACTIVATE) Miqiime2-2021.8;
qiime vsearch dereplicate-sequences
--i-sequences fil_joined.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

joined_import_filter_derep_seq_unzip:
unzip rep-seqs.qza -d rep-seqs``

those rep-seqs look something like this:

6a3eea7fb1f9e169b134fc27428e50177e2e9c5f A.join.fastq.gz_12564

CCTACGGGTGGCAGCAGTAGGGAATCTTCCACAATGGGCGAAAGCCTGATGGAGCAACGCCGCGTGGGTGAAGAAGGTCTTCGGATCGTAAAACCCTGTTGTTAGAGAAGAAAGTGCGTGAGAGTAACTGTTCACGTTTCGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGTAGTGCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGTTCGAAAGCGTGGGTAGCAAACAGGATTAGATACCCCAGTAGTC

24ce437aa9000774104de794963f862c598c7a42 B.fastq.gz_7568

The point of having speces in the header makes the picrust2 going into error.
However I am not sure if I do something like

awk '{print $1 }' FASTA_IN > FASTA_OUT

as suggested here:
picrust_input

is correct for me. My doubt is the following: my fasta contains data for all the samples, so if I excelude from the header the sample name, would it be useful to run a correct analysis?

I thank you very much,

Michela

Hello,

replace spaces with underscore _ to preserve the info. Everything that goes after a space in FASTA header is traditionally a commentary, not a header itself.

Cheers,
V

1 Like

Hi, do not think this could work since the sequences are not attributed to the same name as in biom file.
Could you please tell me in more detail?

My point is: in fasta I should have the name of the sequence same way as in the biom, however since I have a multifasta with multiple samples the problem is I shoudl be able to trace back both the sequence not changing the name and the sample, right?

Thanks a lot

You can remove comments (everything after space) at all then. Biom table preserves only hashes in actual FASTA headers, not comments.

Cheers
V

1 Like

Hi @MichelaRiba,

I just wanted to add that you can make use of q2-picrust2, which has just been updated to work with qiime2-2023.2. The tutorial is here.

It does not offer as much flexibility as that native picrust2 tool. But you can start with the plugin, then export the outputs for any other in-depth analyses.

1 Like

thanks a lot for the suggestion!

Michela

1 Like