feature counts by sample? and how to do correct rarefy

1> Hello, I use qiime feature-table summarize (default setting) to check my feature table (ASVs). I don’t think the output is what I need. Is this possible to to give me a summary something like QIIME 1 OTU table. For example, I have 3 samples and 100 unqiue ASV. I want to something like this
sample 1 sample 2 sample 3
feature 1 10 20 4
feature 2 3 4 5

2>Also, how does people normally rarefy feature table in Qiime 2 (just switch from qiime 1). For example, I have 3 samples. The number of reads of each sample is 10,000, 20,000, 30,000.

I know qiime feature-table rarefy, but this seems to do different thing? I am really confused here. I don’t think the frequency in the feature table means the number of the reads? It’s just the feature counts.

3> Where can I find my reads/persamle information after I build the feature table. The only file that I can see this is after my reads/persample is after I multiplexer? However, after this step, I can’t see anything about reads/persample, even I run dada workflow. If I want to rarefy to equal number of reads across samples, I should do it before DADA workflow? – Or rarefy by equal sequencing depth has been updated?

4> I guess the point here is that is how can I track the sequence/persample after build a feature table. As my experience, I run Dada workflow and build a feature table. I did multiple filtering (e.g., filter ASVs that I don’t want or low abundance ASV). Eventually, I have an feature table that I am going to do some serious analyses and I would like to do equal-depth subsample and want to know the number of reads of the final feature table. Also, I don’t think I need to do any filtering for rep_seqs or raw seqs according to the tutorials. The rep_seqs and raw req files are almost never used.

frequency in the feature table (ASV) = number of reads or not?

Thanks

Hi @moonlight,
The feature-table.qza you have is exactly like the QIIME 1 OTU table. When you use feature-table summarize, this doesn't give you the exact content of the table but rather a summary of some useful information about the future table.

The qiime feature-table rarefy actually does exactly this, rarefies your table to a given depth. Any samples that do not have a minimum total count of that level will be discarded.

If you are using the diversity core-metrics pipelines they actually require a sampling depth as input which will also rarefy your table to that depth. Note that rarefying is not required for all downstream analyses. For example, for differential abundance tools like q2-ANCOM, q2-songbird, q2-ALDEX2, etc, it is recommended that you do not rarefy. The same goes with using q2-DEICODE which is used for building beta diversity distance matrices and PCA ordination.

The reads per sample information can be obtained by using feature-table summarize as demonstrated in the Moving Pictures tutorial I linked a couple of other places to you on the forum. Again, I would strongly recommend going through that carefully, I believe a lot of your questions would be answered there.

If you would like the raw .biom table (like QIIME 1) to use outside of QIIME 2 for some reason, the .biom table under the hood, can be export out using this exporting tutorial.

Once you have run your dada2 pipeline and obtained your feature-table.qza and rep-seqs.qza artifacts you don't need your raw .fastq files. Any filtering you do will happen after DADA2. Your rep-seqs.qza file will be used to build a phylogenetic tree and a taxonomic classifier. It is also used for a variety of other plugins, but you don't need to filter this to match your filtered feature-table per se, but if you were dealing with a big dataset with lots of unique ASVs, it certainly would help reduce computational time for tree building and taxonomic classifier building. So its not a bad habit to get use to.

This table is a count table exactly as you remember from QIIME 1.

|         | feature1 | featuer2 | feature3 | feature.. |
|---------|----------|----------|----------|-----------|
| sample1 | 5        | 3        | 2        | 3         |
| sample2 | 2        | 5        | 5        | 4         |
| sample3 | 0        | 99       | 7        | 6         |
| sample… | 3        | 2        | 0        | 0         |

3 Likes

Hello Mehrbod,

I appreciate your help. I will refresh the moving pictures tutorials, if I have further questions. I will ask you.

My workflow is not exactly same as the tutorials.

1> The tutorials calculate alpha and beta diversity before assign taxonomy. Well, I assign taxonomy first and check my feature table. If I have some taxa that I don’t want, I will filter them out directly. Later, I will rarefy the filtered feature table (this is what I want to do). Do you think my order matters or not?

So, if the workflow order doesn’t matter, I suppose I don’t have to filter rep-seqs files, as you explained.

2> Since I am doing pair-ends Illumina sequencing, I use the tutorials of Atacama soils (https://docs.qiime2.org/2019.10/tutorials/atacama-soils/). After de-noising, I have a stats like this (https://view.qiime2.org/visualization/?type=html&src=https%3A%2F%2Fdocs.qiime2.org%2F2019.10%2Fdata%2Ftutorials%2Fatacama-soils%2Fdenoising-stats.qzv)

Here is my question. There is some stats about merge in above results. The merge here–Do you mean join the overlapped region of paired reads? For example, if you have 2X300bp data and there are 50bp overlapped, you could join them into ~550bp reads? If so, what does the workflow deal with non-joinable reads. Let’s if you do fungal ITS, where the region size varies. If the workflow only seek the joinable reads, it will discard a lot of reads? Is it possible to control this? For example, A>join whatever can join. B>for those reads can’t be joined, still add to the seqs.qza/rep-seqs.qza for downstream analyses? C>Can I choose not to join F and R reads. Just simply concatenate F and R reads? (I found I have a lot of reads that I can’t join)

Thanks

This table is a count table exactly as you remember from QIIME 1.

feature1 featuer2 feature3 feature…
sample1 5 3 2 3
sample2 2 5 5 4
sample3 0 99 7 6
sample… 3 2 0 0

Hi Mehrbod,

1> How did you generate a table like this? I look at the tutorial (moving pictures). It can only give a table like this
(https://view.qiime2.org/visualization/?type=html&src=https%3A%2F%2Fdocs.qiime2.org%2F2019.10%2Fdata%2Ftutorials%2Fmoving-pictures%2Ftable.qzv)

Sample ID Feature Count
L4S137 9820
L4S63 9744
L1S57 8716
L4S112 8340
L1S208 8162
L1S76 7871

The is no break down table as you give to me.

2> Also, I don’t think your table is same as the tutorial table.
The tutorial table means feature counts, which means how many features find in each sample. In QIIME one it called OTUs, right?

However, you table is showing how many frequencies in each feature. I think we would say how many reads/seqs assign to each OTU in QIIME 1.

When I do rarefy, I would like to rarefy to equal number of frequencies (reads) across samples rather than rarefy to equal number of features (OTUs) across samples. As you suggested, I don’t have to filter my rep-seqs file each time that I filter my feature table. Let’s say I only have one oldest rep-seqs file, but I filtere my feature table three times and filter out 3000 features. These 3000 features should be correlated to 30,000 reads. Now I have my final feature table, but my rep-seqs files still the oldest. I don’t know how can I check the number of reads in my each sample in my final feature table (after filter three times). The final feature table only logs the number of features across samples but not the number of reads across samples. I order to rarefy, I have to know the number of reads accoss samples.

If I want to do alpha diversity analysis such as core diversity, I gotta give a --p-sampling-depth. In my case, I don’t know where I can find the sampling depth in my feature table summary. Unless I misunderstood, I don’t think the "feature count’ in my paste table means the number of reads. It means the counts of features? As you can see, total number of reads are always higher than total number features, as multiple reads can assign to one feature. (In QIIME 1, you could have multiple reads assigned to your OTU, so I support you could have multiple reads assgined to you features in QIIME 2)

Hi @moonlight,

Unless you are using the taxa collapse action, when you assign taxonomy by the typical methods in feature-classifier you are simply creating a new taxonomy artifact, not changing your feature-table. This is an important concept to be aware of in QIIME 2, in that you are keeping these objects separate.

Generally speaking, removing taxa from your feature-table without a very specific reason is not a good idea. I typically do all my analyses at the ASV level and call their taxonomy when needed and avoid collapsing my ASVs down to taxonomy since you lose some resolution there.
Which taxa do you not want? Is there a specific reason why you need to remove some? I ask because, this can bias your results and is not a typical approach, unless you are removing something like chloroplast or mitochondria.

Yes, but you don't get to control that overlap region, rather that is dictated by the primer sets you used. If the primers flank a region that are 550 in length and you have a 2x300 bp run then yes that would mean they theoretically will merge with 50 bp overlap.

The option of justConcatenate is not available in Qiime2 as there is in native DADA2 in R, and I would advise against doing something like this in general. See this recent thread on the topic. ITS analysis has some specific considerations that you need to be aware of, see this ITS tutorial for more details. If you are failing to merge these reads, you may consider just using your forward reads only.

Please see my initial answer from above:

In other words, you already have this table in this format within feature-table.qza.

Again, answered previously:

So, this is a summary of your feature-table, and in fact this holds some of the information you were asking for. For example: L4S137 has a total frequency count of 9820 across however many features. You can think of this as the total number of reads of that sample, not total number of unique features.

No, the Feature Count value in that table corresponds to sum of all counts across all features in that sample. Yes, you can think of features, ASV, and OTUs as the same units, though they are produced differently.

Sure, like I said this feature-table in QIIME2 and OTU table in QIIME 1 are the same.

Yes this is in fact the only definition of rarefying and the plugin does exactly that. It randomly throws away reads (regardless of their feature designation) from each sample to reach the set threshold.

Why? I'm not following this logic.

By using the feature-table summarize action as you have before. This is exactly the info you need.

I think this is again stemming from your misconception of what Feature Count is referring to in that table. See above answers.

A couple of additional comments, a lot of the answers I've provided here are easily found in those tutorials that you have seen, again I would advise you to read through those carefully. Also, it is always a good idea to search through this forum first with regards to questions you have, in that it is very likely that the same or related question has been answered already. This also brings me to my last point, in that, in the future please limit your questions to one per thread, or at least keep all questions related that topic. If you have additional, unrelated questions, please start a new topic for those. This helps us keep the forum organized and easy to search through for questions.

Hi Mehrbod,

Thanks for you patience and I really appreciate. I have figured out most of them.
Some follow up:

1>“Which taxa do you not want? Is there a specific reason why you need to remove some?” – This is a good question. We are trying to target Archaea. There is on prefect archeal primers so far. If you do 16S sequencing and assign taxa. A lot of them would be assigned to bacteria or unsigned. So, we want to filter them out and only leave those reads assigned.

2>“The option of justConcatenate is not available in Qiime2 as there is in native DADA2 in R”

Just to confirm – if I have 100 F reads and 100 R reads. If their quality are perfect, it should show as “200” reads in the merged column (https://view.qiime2.org/visualization/?type=html&src=https%3A%2F%2Fdocs.qiime2.org%2F2019.10%2Fdata%2Ftutorials%2Fatacama-soils%2Fdenoising-stats.qzv).

Am I correct? since we don’t do any concatenate, just simply pair them.

3> I know you doesn’t advise to concatenate to long reads. – I am wondering if long reads would have some advantage to assign to taxa (more accuracy) for downstream analysis?

4> Thanks for the suggestion on Fungal analyses. I will read that link. Hmm, my fungal primers’ barcode is on reverse reads (EMP primers). Hmm, not sure if I should use all reverse reads in this case. Basically, I got three fastq files. F.fastq, R.fastq, barcodes.fastq. – since F.fastq is no barode, I might not be able to demux if I just use single reads denoising workflow for Forward. Normally, Foward reads quality is better. It doesn’t make sense to me if we use reverse reads. I will see, if I have any questions. I will ask? – Any suggestion in advance?

5> The last question is about the script “qiime diversity core-metrics” and its output. I follow the moving picture tutorials.(https://docs.qiime2.org/2019.10/tutorials/moving-pictures/#featuretable-and-featuredata-summaries)

A>
qiime diversity core-metrics-phylogenetic
–i-phylogeny rooted-tree.qza
–i-table table.qza
–p-sampling-depth 1103
–m-metadata-file sample-metadata.tsv
–output-dir core-metrics-results
The default matrices are shannon, chao1, observed_otus etc? As far as, I know you there are more than 20 matrices in QIIME2. If I want to add one more matrix such as Good’s coverage, how can I add it to core diversity script?

Or I have to do it manually as suggested here (Alpha and Beta Diversity Explanations and Commands)

B> In the core alpha diversity analysis above, it rarefy the master feature table to equal depth of 1103, right? I don’t think the output of core diversity analysis save the sub-sample feature table? Would it be possible for me save it using this scripts? If I am not saving this subsample table, I can’t use this for downstream analysis. I know I can use the rarefy feature table to do it manually. Later, calculate the alpha diversity from the subsampled table. It seems the tutorial doesn’t do this. Mostly, I use the subsample feature table to plot taxonomic plots? Is this a good idea? If you plot taxonomic plot for you research, do you use the total table or subsampled table (equal deapth).

C> About the “observed_otus” output. – I can generate a observed_otus_vector.qza file and I use qiime tools export to export it to csv file.

Something looks like this

             observed_OTUs

sample 1 1000

sample 2 200

I am confused about the header “observed_OTUs”. Does this mean observed ASV? I think I use DADA2 workflow and I don’t do OTU clustering.

6>This is a general question. Do you normally filter out those rare OTUs/ASVs in your research? If you do, any general rules about this? The tutorial remove low abundance features, which is less than 10. I am not sure if this is general rule? I did my dataset at 50? Is this too high? I think this is trade-off. If you remove too many, it will give you a good rarefaction curve, but you lose diversity.

What do you normally do?

Thanks in advance

HI @moonlight,
As I mentioned before, please open new topics for these unrelated questions you have and we can answer them there. We can’t just keep answering everything in one thread. Also, please do your due diligence in searching the forum for these questions as I can guarantee most of them have been asked and answered there. Thanks

1 Like

Sorry, I forgot it. I opened new topics.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.