Merging seqs.fna from multiple projects

emescioglu · June 4, 2018, 10:11pm

Hi all,

My ultimate goal is to use SourceTracker to identify likely sources of the organisms in my samples.

I'm currently downloading raw data from both Qiita and MG-RAST, and I'm wondering when the best time to merge all of the data is. Should I process all of the different datasets separately up to a certain step or should I merge all seq.fna files in the beginning and run them all through QIIME at once?

Another question, some datasets have a separate file for each sample, and these look like:

NDMS001_0 M00949:38:000000000-A49F4:1:1101:17896:2538 1:N:0:GAATACCAAGTC orig_bc=GAATACCAAGTC new_bc=GAATACCAAGTC bc_diffs=0
TACGAAAGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAetc

Is there a command I can use to combine files of the different samples into one file?

Please let me know if this is already being discussed elsewhere!

Thank you!

Esra

emescioglu · June 5, 2018, 10:33pm

I see this is queued, thanks!

After some time away from my data, I realized that the separate files just mean they are already demultiplexed, which is probably where I want to be anyway !

Next/replacement question:

Data from some projects are fna files and data from others are fastq files. Since I can't convert the fna into fastq (no quality files), I assume I should convert the fastq files to fna files, put them all into one directory and import as demultiplexed sequences? If yes, can you please remind me what the right command would be? If no, please let me know what would be better !

Also, since I don't have the quality scores... how will I know where to trim the sequences later on?

Thanks!
Esra

ebolyen · June 6, 2018, 11:15pm

Hi @emescioglu!

Perhaps, you mention these files are called seqs.fna in which case it is probably the post-split-libraries format from QIIME 1, which means while the reads are "demultiplexed" they are all contained in the same file still. What you probably have are demultiplexed reads from many different studies.

Here's what I would recommend doing. Since you are dealing with multiple studies, work on merging at the feature-table level rather than the raw sequences.

This will let you perform a standard analysis on each study, and then merge the results later.

However, you must make sure that the end result is that each table represents the same "amplicon". Otherwise you will need to merge something more abstract like GG OTU or Taxonomy depending on your goal.

Could you describe your goal a bit and what amplicons you are dealing with? Merging data together can be a bit touchy. Also, any particular reason you aren't using Qiita for the meta-analysis? It was designed to handle this (and it should be easy if your data is already mostly from Qiita).

emescioglu · June 6, 2018, 11:29pm

Thanks for the response.

There is so good reason that I'm not using Qiita. I became overwhelmed by all of the options and ways to download and process data, and so I decided to get all of my data from the European Nucleotide Archive and MG-RAST. However, I'm not opposed to scrapping everything I've done and moving over to Qiita if that is what you recommend

I'm working with 16S, and my ultimate goal is to have a OTU table containing all of my own data (air samples) and data I've taken from other projects to use SourceTracker. I want to know what proportion of my samples come from different sources (marine, human, desert soils, etc.)

So far I have been able to import some of the data to qiime using:

qiime tools import --type 'SampleData[SequencesWithQuality]' --input-path [filename] --source-format CasavaOneEightSingleLanePerSampleDirFmt --output-path [path]

Would this work on the fasta files (no .quals for fastq conversion) too?

emescioglu · June 6, 2018, 11:53pm

I should add that Qiita doesn't have all of the data that I need and I would still like to use some data from MG-RAST. If I were to get all of the data from MG-RAST to a biom table, and then pulled some of the Qiita data as well (seems they are already in biom tables) - what would be the easiest way for to me merge my feature tables at this step to ensure non of the features aren't repeated?

emescioglu · June 7, 2018, 6:28pm

I forgot to clarify, the files are not called 'seqs.fna'. I misspoke. They are called "sampleID.fna" and there is one file for each sample. These are the non Qiita files that I would like to process with Qiime and then merge with other feature tables.

These are the files that look like
NDMS001_0 M00949:38:000000000-A49F4:1:1101:17896:2538 1:N:0:GAATACCAAGTC orig_bc=GAATACCAAGTC new_bc=GAATACCAAGTC bc_diffs=0
TACGAAAGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAetc

Since these are fna, and not fastq, is there a specific way to import them to Qiime since they have no quality files?

emescioglu · June 7, 2018, 10:04pm

I have completely scrapped all things not-Qiita.

This is where I am now.

I have several biom tables from different studies (16S V4, trimmed to 100bp), and I would like to combine them all to have 1 biom table to feed into SourceTracker

Questions:

Can I just use merge_otu_tables.py (qiime1) or qiime 2 version of this command to combine all the tables from various studies?
There are multiple Deblur output biom files in Qiita, but I assume I should use the reference-hit.biom files as the input for merging (Question1).

Sorry about sending 4 consecutive messages with different questions, which have been changing as I have been poking around.

Thanks!

ebolyen · June 8, 2018, 11:52pm

Hi @emescioglu,

We're clearly both fans of rapid iteration

Let me know if I'm answering the relevant questions:

Yes it sounds like your data is all of the same region of the same amplicon, so any comparisons between studies will be "fair" w.r.t. the sequence itself.

That sounds right, but I'm not a Qiita expert, would @antgonza or @wasade be able to confirm? Is it possible to use SourceTracker in Qiita (or at least download the merged table)?

wasade · June 9, 2018, 7:40pm

Hi @emescioglu,

Recommend the reference-hit.biom files. These have been passed through the positive filter.

On an aside, you may be interested in the command line interface to query redbiom, which is a caching layer on top of Qiita. It allows for rapid search and fetch of BIOM table data and metadata from Qiita. (@BenKaehler and I have a draft of a QIIME2 community tutorial on its use but I haven't had a window yet to finish it off...)

Best,
Daniel

emescioglu · June 10, 2018, 5:58am

Thank you, Daniel! I will check out redbiom and look forward to the tutorial!

Esra

emescioglu · June 11, 2018, 4:25pm

Hi Daniel,

I figured I would reply directly to you since you are more familiar with Qiita and the output files.

The reference-hit.biom have 100 (because I trimmed to 100) basepair sequences as their OUT ID. After I merge these files from different studies together, how do I assign taxonomy to the sequences so that all of the sequences from same organisms will be combined so that I don't have the same organism in more than once? To assign taxonomy on Qiime2, I need a biom table (generated) and a "rep-seq" file (not generated by Qiita)? Perhaps assigning taxonomy isn't even the best way to accomplish this goal. What do you think?

The reason I'm not just merging the results from the "Pick close-reference OTU" biom tables is, because I lose a lot of my own data with this method . An alternative to the taxonomy assignment - can I combine my data with other studies before the otu picking step, download onto my computer and pick otus in my preferred method (open-reference). If yes, which files? I suspect seqs.fna (preprocessed) produced by demultiplex step...

Thank you!

Esra

wasade · June 12, 2018, 12:57am

Hi @emescioglu,

You should be able to export the OTU IDs from the BIOM table and create either a FASTA file, or you should be able to download the reference-hit.seqs.fa from Qiita and assign taxonomy off that. If you really want to collapse your OTUs so only a single taxon is represented, I believe you can summarize/collapse by taxonomy but I don't know off hand how to do this in QIIME2 -- @ebolyen, do you know? However, I suspect you most likely want to operate on your table at the highest level of specificity, which would be the actual sequences.

I don't understand the comment about merging pick close-reference results, are you also using data somewhere that are based off a closed reference pipeline? You can merge Deblur output files but you do want to make sure the data are of the same sequence length and primer. You cannot directly merge Deblur and closed reference results. I'm not sure I understand why open-reference would be preferred here, could you elaborate?

Best,
Daniel

emescioglu · June 13, 2018, 3:28pm

Thanks Daniel!

All of the data I'm using is from Qiita. I wasn't sure if I should use the Deblur output files OR the closed reference outputs, which is why I went on a tangent about the closed reference otu picking. A large portion of my data is lost when I use closed-reference results, as opposed to open-reference, which is why I mentioned earlier than I prefer open-reference.

Again, my ultimate goal is to have a merged biom table including several studies in which I only have 1 sequence for the same taxon

It sounds like I should

merge Deblur results (making sure they are the same sequence length and primer)
import to Qiime as FrequencyTable and assign taxonomy using biom table from 1. and reference-hit.seqs.fa
consider collapsing taxonomy using QIIME

Thank you !

Esra