Hello,
I need to generate a mapping file for submitting my community sequence data to NCBI. The mapping file should have each sequence ID followed by the samples in which that sequence appears (tab delimited format). Would you please provide the right command for that? I couldn't find anything in the forum or in any of the main tutorials for this.
This is what I'm grappling now. You have to submit your samples to gave NIH generate a PRJNA# for a bio project, then a SAMN# for bio-samples.
The work flow looks like this:
Bioproject -> Biosamples -> SRA samples
The sequence ID is ultimately generated from NIH. Ask me questions, I'm like the only person among several labs that knows the process (and so I end up having to submit everything to SRA) Ben
Hi Ben.
I have 20+ environmental samples from an experiment where each feature was found in multiple samples. In my correspondence with an NCBI rep, they didn't mention anything about SRA. This is what they wrote:
[2] You should indicate that this submission is part of a large scale
study within the wizard so that you can include and register your BioProject
and BioSamples within a single submission.
[2] You can submit your 20 samples as part of this single BioProject.
Within the submission wizard, you will need to upload a mapping file that
lists each sequence ID within your fasta file along with the corresponding
BioSample that should be included with that sequence. You can only include
a single BioSample per sequence.
If some of these sequences were derived from or found in multiple environmental
sources, you will need to provide a tab-delimited mapping file that lists
each sequence ID along with the corresponding BioSamples in a comma-separated
list, for example:
Yes, ok this is the NIH SRA. I think the admin is asking you for everything so you can submit everything and generate the bio project # and bio sample # all together.
Again the flow goes like this:
Bioproject -> Biosamples -> SRA samples
Most microbiome data is 16S => metagenome
Do not submit any bio samples at this time Submit it and await a PRJNA number (it will be PRJNA XXXXXXXX)
XXXXXXX = series of numbers
Once you have a PRJNA number, this number should go into your publication.
Biosample:
So the BioSample submission is annoying
You need to use their metagenome file to submit your 20 samples
The submission requires specific words for it go through
For example, for human microbiome, you need to say the host is "Homo sapiens" and cannot deviate from this
Another weird thing is that they require you to submit the GPS coordinates from where the samples were collected -> you actually don't need to do this it can be filled in with "missing" which is what I usually do.
Submit these BioSamples. NIH will generate a SAMNXXXXX (XXXXX being a series of numbers for each sample) this is going to be important because you are going to want to submit forward/reverse samples per sample.
Sequence Read Archives (SRA)
Here is where you submit ANOTHER file which has your SAMN# associated with the samples you are submitting to the NIH SRA. These will connect the SAMN# to each sample to the samples forward/reverse.fastq files.
The file contains specific qualities about your sequence files such as amplicons, 16S RNA, Illumina MiSeq/HiSeq etc. You likely will need all the information from your sequencing core.
Actual submission of files
You will upload the files to a specific file location accessible through FTP that the SRA will generate for your reference. They will only generate these files after you finish the SRA submission.
If you have someone from NIH working on their end to help you submit these files I would recommend it. I am otherwise here to help! Ben
Thank you Ben... this sounds like a real pain. I am confused though because the NCBI admin said that I am supposed to submit ONE fasta file of all my samples combined and one mapping file but your description sounds like I would be submitting different fasta files for each biosample?
Yes, multiple files can go with one SAMN# (BioSample) so what I end up doing is I submit 2 files for one SAMN# (which are forward and reverse files). Ben
edit: the admin person may be demultiplexing and splitting the files for you (which is why they're asking for your sequenceID/sampleID w/ barcodes). I can only assume that is their intention.