Obtaining sequences per biological sample - version 2

rnasrah · March 25, 2020, 12:08am

]

Hello,
you have helped me with the same question back in 2017 with the older qiime, so apologies in advance for asking this again but cannot seem to do it with qiime2-2020.2.

How can I can basically get the following: matrix of sample x feature frequencies like shown in attached photo.
So just to recap, my dream file would look something like in the picture below (and instead of the feature ID, if I can get the actual sequence).

Thanks so much once again,

Rima

jwdebelius · March 25, 2020, 12:19am

Hi @rnasrah,

There should be a flag, --p-no-hashed-feature-ids that will let you keep your original sequences in the feature table.

Best,
Justine

rnasrah · March 25, 2020, 1:19am

Hi Justine!
Where would I find this? in which command?

Basically I want to export this “table” and work on it in excel outside of qiime.

Thanks!

rnasrah · March 25, 2020, 6:35pm

Hello again,

So basically,
I did the following commands:

qiime tools export
–input-path table.qza
–output-path exported-paired-table

then i used the convert biom function to convert it to a tsv file and open it in excel…

So now, my excel file gives me the frequency of each feature ID in each biological sample. However, instead of feature ID, I would like to get the actual sequence. How can I do that?

Thanks so much!

jwdebelius · March 25, 2020, 6:53pm

Hi @rnasrah,

You’re on the right track, I think!

I haven’t tested this yet (but its on my list of things to do today), but basically, I think you need to convert the representative sequences into a table and then add that as metadata. Let me play around with it and I’ll get back to you soon.

Best,
Justine

jwdebelius · March 25, 2020, 9:44pm

The thing I thought was going to work, didn’t work. So, I think I have a solution, but its somewhat inelegant and requires you to do some work in python. There are three paramters in this script you need to change:

table_fp should be the path to your actual table
seq_fp should be the path to your actual sequences
out_fp should be the place you want to save the sequences. This will be a tab-seperated file for excel.

import pandas as pd
from qiime2 import Artifact

table_fp = 'table.qza' # The actual path to your table
seq_fp = 'seq.qza' # The actual path to your sequences
out_fp = 'new_table.tsv' # the actual place you want to save the table with the the sequences

table = Artifact.load(table_fp).view(pd.DataFrame).T
repseq = Artifact.load(seq_fp).view(pd.Series).apply(lambda x: ''.join(str(x))

combined = pd.concat(axis=1, sort=False, objs=[table, repseq.loc[table.index])
combined.rename(columns={0: "representative_sequence"}, inplace=True)

combined.to_csv(out_fp, sep='\t')

You can run it by opening an ipython interpreter on your terminal by typing ipython and then run each line of code, updating your path for the three fp values.

Let us know how it works!

Best,
Justine

rnasrah · March 26, 2020, 12:49am

Thanks Justine.
So i tried to do what you said: ran ipython on my terminal window and ran the first command "import pandas as pd" and I got an error message (please see attached).

rnasrah · March 26, 2020, 12:50am

Hi again,

I was thinking my other option would be just to get the rep-seqs.qza file in excel and then I can just match with sequences/feature IDs from that file with the frequencies in the other file (i.e through R for example). Is there a way I can do that? (convert my rep-seqs.qza file). Please see attached picture of what part of the rep-seqs.qza file I would like to extract).

Thanks so much!
Rima

jwdebelius · March 26, 2020, 2:10am

Hi @rnasrah,

You need to run the code in your qiime2 enviroment. So, conda activate qiime2-… and then open ipython and run the code.

I looked at exporting the. rep seqs to a table, and they come out as a fasta in the current set up, which unfortunately, messed up my original idea.

Best,
Justine

rnasrah · March 26, 2020, 2:42am

Hi again Justine,

So I ran the code in ipython (see attached) and I still get the same frequency matrix: feature ID x biological samples.

But instead of feature ID, I would like the actual sequence.

Attached is a copy of my code and of the output file. Did I miss anything?

Thanks so much!

jwdebelius · March 26, 2020, 4:08pm

Hi @rnasrah,

So, it ran correctly, but this adds the sequence at. the end as metadata. You could take the last sequence column in excel and move it over, or add one line to the end of the python after

combined.rename(columns={0:"representative_sequence"}, inplace=True)
combined.set_index("representative_sequence", inplace=True)
combined.to_csv(out_fp, sep='\t')

@Nicholas_Bokulich also suggested this thread, because he thinks more qiime-o-matically than I do. So, I may have sent you on a wild goose chase (sorry).

Best,
Justine

rnasrah · March 26, 2020, 5:14pm

Thanks so much Justine and Nicholas!

Sorry, my bad, did not notice the last column! It worked like you suggested so thanks once again!

system · April 26, 2020, 11:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.