Hi, all! I’m a beginner learning metagenome analysis using qiime2.
It’s quite tough to learn it by myself
I have 3 questions in total about specific stage in analysis process using qiime2.
I read a manual in qiime2 homepagedocs.qiime2.org) about training feature classifier, and there is one thing I don’t get it. They say ‘Taxonomic classifiers perform best when they are trained based on your specific sample preparation and sequencing parameters. You should follow the instructions in Training feature classifiers with q2-feature-classifier to train your own taxonomic classifiers’. But as I understand, training classifier is in order to classify OTU I picked from my samples, which means they should be classified by trained classifier with reference database, SILVA (what I use) or Greengenes. Then what does that ‘your specific sample preparation and sequencing parameters’ mean?? What I used for training classifier is ‘silva_132_99_16S.fna’(I import it with using qiime tools import) in rep_set directory of SILVA_132_QIIME_release for reference reads input data and ‘taxonomy_all_levels.txt’ for reference-taxonomy input data.
With using the classifier (in my former question), I did taxonomic analysis my samples. I tried 9 samples and I got open reference otu clusters by following clustering processes in ‘docs.qiime2.org’. I made a bar plot afterwards, and it shows only 6 samples analysis data, without 3 samples’. Is this a problem of my classifier? or metadata of samples or something?? Or does this mean there are no reference sequenced bacterial samples matched in those 3 samples?
After I denoise these samples, I tried to do ‘qiime diversity core-metrics-phylogenetic’ function. These are my Per-sample sequence counts.
I tried it with --p-sampling-depth 3100 at first, it says ‘Plugin error from denoise(name of directory I did this function).The rarefied table contains no samples or features. Verify your table is valid and that you provided a shallow enough sampling depth.’ So, I raised depth 5000, 8000 and 18000 but it reply the error every time. Is this a matter with the metadata? And which value I choose for this function? As I understand, the sampling depth means the amount of subsample in each sample, so if I input 4000 for sampling depth, qiime picks 5000 subsample for doing diversity analysis, so qiime picks all samples in sample 7, 8, and 9 while it picks only 5000 out of each total samples. Which means in case of sample 1, they are subsampled only a quarter of them, which could be slightly precise for it.
You did not provide a valid URL above. It looks like you may be reading a very outdated version of the documentation. The meaning of that sentence is clarified in more recent versions of that document:
Taxonomic classifiers perform best when they are trained based on your specific sample preparation and sequencing parameters, including the primers that were used for amplification and the length of your sequence reads.
You are correct. The classifier is trained on the reference sequences.
Those other 3 samples probably had insufficient sequencing depth and were dropped at some point. You should use qiime feature-table summarize to determine if/how many reads are retained in each sample.
No, this has nothing to do with the classifier.
No. If that were the case, you would see those 3 samples but they would be composed 100% of “Unclassified” taxa.
That should work with the sequencing counts that you listed. So either those sequence counts are from an earlier table or you are using the wrong file.
Would you please open up a second post regarding this second question? It is (probably) unrelated to the classifier questions, and opening a new post will make it easier for other users to read and respond to this question.
I this new post, please provide the QZV from qiime feature-table summarize and the full error message(s) you are receiving. That will help us troubleshoot.
I trainied my classifier with using 515F/806R primer sequences, my sequence length and 99, 90 otu data provided by SILVA(I made 2 classifier since I’d like to see difference between them).
And I noticed that the error I posted yesterday was provoked as I did not use dereplicated table and sequences, but I used table and sequences made after denoise procedure while doing OTU clustering. Which means, those 3 samples were dropped during a OTU clustering process. But, shouldn’t I use them? Since the manual (https://docs.qiime2.org/2018.6/tutorials/otu-clustering/) says,
‘The outputs from dereplicate-sequences are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureTable[Frequency] artifact is the feature table indicating the number of times each amplicon sequence variant (ASV) is observed in each of your samples. The FeatureData[Sequence] contains the mapping of each feature identifier to the sequence variant that defines that feature. These files are analogous to those generated by qiime dada2 denoise-* and qiime deblur denoise-*, except that no denoising, chimera removal, or other quality control has been applied in the dereplication process.’
So I thought that there would be no problem with using the table, sequences from dada2.
Anyhow, I got the fair result by using the classifier I trained with ‘right’ input data(maybe… I’m not sure ).
I made a bar plot with it and some of them shows only
D_0__Bacteria;;;;;;;;;;;; this or
mostly stops being classified at stage D_2 or D_3.
Like you said, if they can’t be classified since their reference data is insufficient, shouldn’t it be showed like ‘unassigned’ or ‘unclassified’???
It will only be “unassigned” if there are NO good matches in the database.
It a reliable species-level classification cannot be found for a query sequence, the process is repeated at genus level, then family level, etc, until a reliable classification can be made. So often the taxonomy classifications are truncated because that is the only level where a confident classification could be made. You can adjust this behavior with the confidence parameter, but I recommend using the defaults, which are based on benchmarks of 16S and other data (that paper also has good description of this method that you can read for more details, and also has alternative parameter settings you could try).