generating a mapping file

I need to generate a mapping file for submitting my community sequence data to NCBI. The mapping file should have each sequence ID followed by the samples in which that sequence appears (tab delimited format). Would you please provide the right command for that? I couldn’t find anything in the forum or in any of the main tutorials for this.

Thank you!


Is this the NIH Sequence Read Archive (SRA)?

This is what I’m grappling now. You have to submit your samples to gave NIH generate a PRJNA# for a bio project, then a SAMN# for bio-samples.

The work flow looks like this:

Bioproject -> Biosamples -> SRA samples

The sequence ID is ultimately generated from NIH. Ask me questions, I’m like the only person among several labs that knows the process (and so I end up having to submit everything to SRA) :frowning: Ben

Hi Ben.
I have 20+ environmental samples from an experiment where each feature was found in multiple samples. In my correspondence with an NCBI rep, they didn’t mention anything about SRA. This is what they wrote:

[2] You should indicate that this submission is part of a large scale
study within the wizard so that you can include and register your BioProject
and BioSamples within a single submission.

[2] You can submit your 20 samples as part of this single BioProject.
Within the submission wizard, you will need to upload a mapping file that
lists each sequence ID within your fasta file along with the corresponding
BioSample that should be included with that sequence. You can only include
a single BioSample per sequence.

If some of these sequences were derived from or found in multiple environmental
sources, you will need to provide a tab-delimited mapping file that lists
each sequence ID along with the corresponding BioSamples in a comma-separated
list, for example:

Seq1 BioSampleID1,BioSampleID2
Seq2 BioSampleID1,BioSampleID3

Send this mapping file along with any questions to [email protected].

Thanks for any help that you can provide!

1 Like

Yes, ok this is the NIH SRA. I think the admin is asking you for everything so you can submit everything and generate the bio project # and bio sample # all together.

Again the flow goes like this:
Bioproject -> Biosamples -> SRA samples

I submit them separately. So, go to Submissions | Submission Portal. Then select new bio project.


Most microbiome data is 16S => metagenome
Do not submit any bio samples at this time
Submit it and await a PRJNA number (it will be PRJNA XXXXXXXX)
XXXXXXX = series of numbers
Once you have a PRJNA number, this number should go into your publication.


So the BioSample submission is annoying
You need to use their metagenome file to submit your 20 samples
The submission requires specific words for it go through
For example, for human microbiome, you need to say the host is "Homo sapiens" and cannot deviate from this
Another weird thing is that they require you to submit the GPS coordinates from where the samples were collected -> you actually don't need to do this it can be filled in with "missing" which is what I usually do.
Submit these BioSamples. NIH will generate a SAMNXXXXX (XXXXX being a series of numbers for each sample) this is going to be important because you are going to want to submit forward/reverse samples per sample.

Sequence Read Archives (SRA)

Here is where you submit ANOTHER file which has your SAMN# associated with the samples you are submitting to the NIH SRA. These will connect the SAMN# to each sample to the samples forward/reverse.fastq files.
The file contains specific qualities about your sequence files such as amplicons, 16S RNA, Illumina MiSeq/HiSeq etc. You likely will need all the information from your sequencing core.

Actual submission of files

You will upload the files to a specific file location accessible through FTP that the SRA will generate for your reference. They will only generate these files after you finish the SRA submission.

If you have someone from NIH working on their end to help you submit these files I would recommend it. I am otherwise here to help! Ben


Thank you Ben… this sounds like a real pain. I am confused though because the NCBI admin said that I am supposed to submit ONE fasta file of all my samples combined and one mapping file but your description sounds like I would be submitting different fasta files for each biosample?


1 Like

Yes, multiple files can go with one SAMN# (BioSample) so what I end up doing is I submit 2 files for one SAMN# (which are forward and reverse files). Ben

edit: the admin person may be demultiplexing and splitting the files for you (which is why they’re asking for your sequenceID/sampleID w/ barcodes). I can only assume that is their intention.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.