16s pacbio anaylsis


Shuxian_LI

Hi,
I want to use the qiime2 to analyse the 16s pacbio data;
but I didn't find how to input the unpaired pacbio in the qiime2.
I checked the discussion in 2021, it can only use in R.
I would like to ask is there any update on the qiime2-2022.2 in this area?
If there is a tutorial doc on the website would be best.
Cheers!

Hi @Shuxian_LI,
Welcome in the forum!

You can import the sequences as single end sequences, please follow the importing tutorial for that.
(Importing data — QIIME 2 2022.2.0 documentation).
I would suggest using a manifest file, something like:

qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path se-33-manifest
--output-path single-end-demux.qza
--input-format SingleEndFastqManifestPhred33V2

Then you can denoise PacBio by using:
https://docs.qiime2.org/2022.2/plugins/available/dada2/denoise-ccs/

Hope it helps.
Cheers
Luca

1 Like

Thank you Luca;

I already imported the data, did the denoise, removed the chimeras and did the taxonomy.
And I find that the taxonomy of my data is only 20 lines, this is a soil sample, so it should contain more than this. and in mothur it has 300+ at genis level, I don't know where I did wrong.
I list my codes below:
cat manifest.txt

sample-id absolute-filepath
2 01.CleanData/B1S2/B1S2/B1S2.ccs.fastq.gz

qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path manifest.txt
--output-path B1S2.qza
--input-format SingleEndFastqManifestPhred33V2
Imported B1S2.ccs.fasta as DNASequencesDirectoryFormat to B1S2.ccs.qza
qiime dada2 denoise-ccs --i-demultiplexed-seqs B1S2.qza --p-front AGAGTTTGATCCTGGCTCAG --p-adapter GNTACCTTGTTACGACTT --p-max-mismatch 2 --p-trunc-len 0 --p-trim-left 0 --p-trunc-q 20 --p-min-len 1000 --p-max-len 1600 --p-pooling-method 'independent' --p-n-threads 35 --p-n-reads-learn 1000000 --p-hashed-feature-ids TRUE --o-table Frequency --o-representative-sequences Sequence --o-denoising-stats DADA2Stats
Saved FeatureTable[Frequency] to: Frequency.qza
Saved FeatureData[Sequence] to: Sequence.qza
Saved SampleData[DADA2Stats] to: DADA2Stats.qza
q2-vsearch
qiime vsearch uchime-denovo
--i-table Frequency.qza
--i-sequences Sequence.qza
--output-dir uchime-dn-out
Saved FeatureData[Sequence] to: uchime-dn-out/chimeras.qza
Saved FeatureData[Sequence] to: uchime-dn-out/nonchimeras.qza
Saved UchimeStats to: uchime-dn-out/stats.qza
qiime metadata tabulate
--m-input-file uchime-dn-out/stats.qza
--o-visualization uchime-dn-out/stats.qzv
Saved Visualization to: uchime-dn-out/stats.qzv

Exclude chimeras and “borderline chimeras”
qiime feature-table filter-features
--i-table Frequency.qza
--m-metadata-file uchime-dn-out/nonchimeras.qza
--o-filtered-table uchime-dn-out/table-nonchimeric-wo-borderline.qza
Saved FeatureTable[Frequency] to: uchime-dn-out/table-nonchimeric-wo-borderline.qza
qiime feature-table filter-seqs
--i-data Sequence.qza
--m-metadata-file uchime-dn-out/nonchimeras.qza
--o-filtered-data uchime-dn-out/rep-seqs-nonchimeric-wo-borderline.qza
Saved FeatureData[Sequence] to: uchime-dn-out/rep-seqs-nonchimeric-wo-borderline.qza
qiime feature-table summarize
--i-table uchime-dn-out/table-nonchimeric-wo-borderline.qza
--o-visualization uchime-dn-out/table-nonchimeric-wo-borderline.qzv
Saved Visualization to: uchime-dn-out/table-nonchimeric-wo-borderline.qzv

qiime feature-classifier classify-sklearn
--i-classifier silva-138-99-515-806-nb-classifier.qza
--i-reads rep-seqs-nonchimeric-wo-borderline.qza
--o-classification taxonomy.qza
Saved FeatureData[Taxonomy] to: taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv
Saved Visualization to: taxonomy.qzv

qiime feature-classifier classify-sklearn
--i-classifier silva-138-99-515-806-nb-classifier.qza
--i-reads rep-seqs-nonchimeric-wo-borderline.qza
--o-classification taxonomy.qza
qiime tools export
--input-path feature-table.qza
--output-path exported-feature-table

Massive thanks!

Hi @Shuxian_LI ,

Can you have a look in the 'DADA2Stats.qza', hoe many sequences are passing the filter setting you are using?
How many sequences are passing your 'uchime-denovo' chimera filtering?
Did you try to exclude the chimera filtering step?

On the taxonomic assignment step, did you create the classifier you are using? It seems created by selecting position between 515 and 806 in the 16S database. Given you are working with much longer sequences I would try to use a classifier obtained without any position selection.
Cheers
Luca

Hi Luca;
I checked DADA2Stats.qza in this code, and I find it is
173(0.48%) after I do the dada2...
And I change the primer to the universal primer used in dada2 handbook and loose the index, the reads rose to 232...
I check the link below to see the same problem happened to other people, but it wouldn't solve my problem.
high chimera rate in dada2 - User Support - QIIME 2 Forum

hi @Shuxian_LI,
I guess we need to focus on the dada2 step for now to get most out of this.
Could you share the DADA2Stats.qza here or send me in private so I can take a look what step is loosing all the reads?

Cheers
Luca

Thank you so much, can you share your email with me?

Hi, Luca;
I check the quality of my data and did the trim, the data quality rise after I did the trim, so could I do the dada2 after I processed the trim?

Cheers!

Hi,
my understanding is that you should be able to use dada2 after that.
For your information, I am following closely this thread:

Which may be related to the same issue you are seeing, although is not offering a solution yet
there are many moderators on the case beyond the screen!

Cheers
Luca

Hi Luca, I checked by using this sop below:

PacBio CCS Amplicon SOP v1 (qiime2) · LangilleLab/microbiome_helper Wiki (github.com)
qiime dada2 denoise-single \

--i-demultiplexed-seqs reads_qza/trimmed_reads.qza
--p-trunc-len 0
--p-max-ee 3
--p-n-threads 4
--p-n-reads-learn 100000
--output-dir dada2_output

And I find that dada2 is the key part for all the data lost
sample-id input filtered percentage of input passed filter
denoised non-chimeric percentage of input non-chimeric
#q2:types numeric numeric numeric numeric numeric numeric
2 4447 3867 86.96 49 49 1.1

So is there any ways I can increase the filtered remain?

Cheers

Hi @Shuxian_LI

This is actually normal because many sequencing providers do not provide the pacbio ccs reads with satisfactory quality so the dada2 will drop those "bad ccs reads" :woozy_face:.

Ask your sequencing provider what parameters they had used in ccs, especially minfullpass and minPredictedAccuracy ? I bet the minPredictedAccuracy is below 0.99.

I guess the reason why those sequencing providers do not deliver better ccs reads is because of the sequecing costs. More raw reads, better quality but higher costs which result in higher service price.

Check out @benjjneb 's repo , you will find a set of high-quality ccs reads which was delivered by Pacbio itself and work just fine using dada2.

Right now I suggest using otu clustering with --p-perc-identity 0.99 instead of dada2 denoising.

Do not forget to remove the primer and re-orient the ccs reads using dada2::removePrimers in R first.

Kind regards,
Sixvable

1 Like

Hi Luca and Sixvable;
Since you mentioned this in this:

the minPredictedAccuracy is below 0.99

I check that minPredictedAccuracy = 0.995
I check the data and find that,
since this is a ccs data, and the adapter was removed by the commercial lab we used, so the dada2 denoise ccs can not process;
so I used this based on [PacBio CCS Amplicon SOP v1 (qiime2) · LangilleLab/microbiome_helper Wiki (github.com)] and try to solve this problem:

seqtk seq -r $file.fastq > $file.rc.fastq
cutadapt -a AGAGTTTGATCCTGGCTCAG -g GNTACCTTGTTACGACTT
--discard-untrimmed --no-indels -j 1 -m 1000 -M 1600
-o $file.trim.fastq $file.fastq

cutadapt -a AGAGTTTGATCCTGGCTCAG -g GNTACCTTGTTACGACTT
--discard-untrimmed --no-indels -j 1 -m 1000 -M 1600
-o $file.rc.trim.fastq $file.rc.fastq

cat $file.trim.fastq $file.rc.trim.fastq > $file.cat.fastq
qiime tools import
--type "SampleData[SequencesWithQuality]"
--input-path manifest.txt
--output-path all.qza
--input-format SingleEndFastqManifestPhred33V2

qiime dada2 denoise-single
--i-demultiplexed-seqs all.qza
--p-trunc-len 0
--p-max-ee 3
--p-n-threads 8
--p-n-reads-learn 100000
--output-dir output2

65% Of my data remained.
So I was wondering if the dada2 denoise-ccs was only thinking the reads forward not the back reverse?

Cheers!

Hi @Shuxian_LI ,
I think it could be a promising starting point. I would double check that in your final concatenated fastq you don't have replicates duplicated sequences, just in case any was found in both directions, although I agree is not likely I am not sure could not happen.
For the denoising, I would not use dada2 single, because it applies an error model specific for Illumina sequences. So, if dada2 css does not work with your data, as it seems, you could try to use deblur, which relies only on the bases in the sequences to denoise, or switching to a denovo clustering approach as @sixvable was suggesting.
Luca

2 Likes

Hi @Shuxian_LI

Use qiime dada2 denoise-ccs instead of denoise-single as @llenzi suggest because it use the Pacbio error model.

I think you misunderstood the adapters and primers. Primers is the oligonucleotides sequence you used to amplify the 16S rRNA amplicon such as 27F/1492R. During dada2::removePrimers processing the reads will be re-oriented based on the primers' locations. So if you can remove your primers using cutadapt than you can and should use dada2's removePrimers instead (in QIIME2 it should be --p-front and --p-adapter in qiime dada2 denoise-ccs, do not forget to give the right primer direction).

dada2 has its own chimera removing protocol so there is no need to use vsearch.

Which providers can deliver this type of reads in China mainland? :thinking: Could you share with me the company name and their price?

如果不方便可以使用中文交流,[email protected]

1 Like