From QIIME fna to QIIME2 diversity and taxonomy data

mshelomi · May 28, 2018, 3:06am

Hello,

I have clean tags, with chimeras removed, as .fna files from the sequencer that look like this:
>Brain_1
TGGGGAATATTGG...
>Brain_2
TGCGGAATTTTA...

I also have it as clean .fastq files

@Brain_1 HISEQ:790:HCM3JBCX2:1:1101:20115:35645 1:N:0:ACACAGAA orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TGCGGAATT...
+
HIHFH...
@Brain_3 HISEQ:790:HCM3JBCX2:1:1101:11140:35953 1:N:0:ACACAGAA orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TGGGGAATATTG...
+
HHHH....

According to the sequencer, the following pre-processing and quality control was performed:

2.1 Data split: Paired-end reads was assigned to samples based on their unique barcode and truncated by cutting off the barcode and primer sequence.
2.2 Sequence assembly: Paired-end reads were merged using FLASH (V1.2.7, FLASh), a very fast and accurate analysis tool, which was designed to merge paired-end reads when at least some of the reads overlap the read generated from the opposite end of the same DNA fragment, and the splicing sequences were called raw tags.
2.3 Data Filtration: Quality filtering on the raw tags were performed under specific filtering conditions to obtain the high-quality clean tags according to the Qiime(V1.7.0，split_libraries_fastq.py – This script performs demultiplexing of Fastq sequence data where barcodes and sequences are contained in two separate fastq files (common on Illumina runs). — Homepage) quality controlled process.
2.4 Chimera removal: The tags were compared with the reference database(Gold database，UCHIME Home Page) using UCHIME algorithm(UCHIME Algorithm，UCHIME algorithm) to detect chimera sequences, and then the chimera sequences were removed. Then the Effective Tags finally obtained.

The Metadata is here: metadata.tsv (452 Bytes)

Each .fna and .fastq file is a separate sample (different tissue type). I would like to determine the alpha and beta diversities, do PCoA, taxonomic analysis, and differential abundance testing of the samples as one can do with the Moving Pictures dataset.

I concatenated all of my .fna files into one, big .fna file. Then I followed the instructions of this tutorial (Clustering sequences into OTUs using q2-vsearch — QIIME 2 2018.4.0 documentation) but the final product no longer has the Names of the samples that I would need for any later analysis, and I do not see where I can enter the products from this pipeline into the Moving Pictures pipeline.

How do I go from .fna or .fastq files to something that I can analyze as in Moving Pictures?

Nicholas_Bokulich · May 28, 2018, 1:59pm

Sounds like your data are in QIIME1DemuxFormat. Use the following command to import:

qiime tools import 
--input-path seqs.fna 
--output-path seqs.qza 
--type SampleData[Sequences]

Then proceed starting with denoising/otu picking steps.

I hope that helps!

mshelomi · May 29, 2018, 6:12am

Thanks for the reply, but it does not work.

Argument to parameter 'demultiplexed_seqs' is not a subtype of SampleData[JoinedSequencesWithQuality | PairedEndSequencesWithQuality | SequencesWithQuality].

I want to go from .fna to something that can be run through qiime diversity alpha-group-significance , qiime diversity alpha-correlation, qiime diversity beta-group-significance, qiime emperor plot, qiime taxa barplot ,etc.

What do I need to do?
I cannot attach .fna files to this forum, but attached a small, test .qza made with your imput method here: can this be used anywhere in the Moving Pictures tutorial pipeline? test.qza (4.2 KB)

Nicholas_Bokulich · May 29, 2018, 7:30pm

Let's start here. So it sounds like you were able to import and process your sequences using that tutorial (sorry, I was not aware that the advice I gave on importing was covered in that tutorial).

At the end of that tutorial, you will have a feature table and sequence data, which you can analyze following the steps in the moving pictures tutorial starting at this stage.

If you want to use dada2 or deblur for denoising, instead of OTU picking methods, you will need to start with the raw fastq information and process per the steps in e.g., the moving pictures or other tutorials (which cover all of the steps that you are describing). If you are using dada2 instead of OTU picking, you do not need to do read joining, qiime1-style filtering, or chimera removal.

Note that QIIME2 has methods for demultiplexing (with q2-demux), paired-end read joining (with q2-vsearch), qiime1-style filtering (with q2-quality-filter) and chimera removal (with q2-vsearch). The advantage of doing all of this in QIIME2 is that all of your analysis is traceable in provenance.

I hope that helps!

mshelomi · May 31, 2018, 10:44am

That helps, and I was able to go through the entire MP tutorial, mostly. but I still have a question regarding the importing.

Was I correct in concatenating the initial .fna files into one file?

I am seeing the problem in my alpha diversity: the samples with names lower in the alphabet show lower diversity, which does not fit the expected data. I suspect this is an artefact of the dereplication, which kept one name for the sequence even through the same sequence may have appeared in a different sample. Or am I misinterpreting things?

Let's say I have 5 .fna files, each corresponding to one tissue. Do I concatenate them all? I tried importing via a Manifest, but the program didn't work. Concatenation works, but is that the right thing to do?

Nicholas_Bokulich · May 31, 2018, 12:03pm

No. It looks like you already demultiplexed these files:

So rejoining does not make much sense.

That is suspicious.

You are misinterpreting. The dereplicated sequences contain one copy of each unique sequence. The feature table contains the counts of each feature (sequence) in each sample for calculating diversity estimates.

Do not concatenate. We can assist with importing errors you have been having (please open a new topic for that).

I would really recommend getting the rawest fastq data possible (do not follow the processing steps recommended by the sequencing center, and do not use the clean fasta as I'm not sure exactly what those are).

It would be much easier for downstream processing, and also take advantage of superior methods in :qiime2: vs. qiime1, to follow this advice:

I neglected to mention that doing so would also allow you to use denoising methods like dada2 or deblur in QIIME2, which are much more sensitive than OTU picking (which is what you are stuck doing if you demultiplex with qiime1, as you have already done).

Let us know if you want to go that route! The suspicious diversity results make me suspect that something is going wrong with your current workflow... the earlier you get it into QIIME2, the easier and more transparent the process becomes.

I hope that helps!

mshelomi · June 1, 2018, 3:04am

Perfect! Ok, I found my rawest files (Paired fastq.gz files), made an absolute filepath manifest for them, and imported them as follows:

qiime tools import \
	--type 'SampleData[PairedEndSequencesWithQuality]' \
	--input-path MANIFEST.csv \
	--output-path demux.qza \
	--source-format PairedEndFastqManifestPhred33

I then joined the paired end reads with 'qiime vsearch join-pairs' , viewed a summary with 'qiime demux summarize', did quality control with 'qiime quality-filter q-score-joined' and continued with Deblur. Problem solved!

system · July 2, 2018, 9:04am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.