Working with ITS data on a supercomputer/server

Dear all, I am a new user to qiime, and I decide to learn how to use qiime2.
I’ve gone through several tutorials provided by the qiime website, and now are trying to work on my own sequencing results, which are 48 fastq.gz files for 24 ITS sequencing samples.

At the moment, I am running the DADA2 denoise step using the command “qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f-250 --p-trunc-len-r 250 --o-representative-sequences rep-seqs-dada2.qza --o-table-dada2.qza”

According to my experience with official tutorial, I would expect this step will take days if my codings are correct.
And i heard that if using a computer server/supercomputer may significantly reduce the running time. So my questions are (1) is my paired end coding/choices are good?(2) this is more important, which is how I can use or connect to a computer server, for example to the one in my university.

Cheers

Hi @hongwei2017,

The parameters you choose should be based on where your quality score starts dropping off, so without summary visualization of your sequences, I can't really give any specific advice.

That being said, for ITS, you'll want to make sure that there are no reverse primers on your forward reads. We don't have anything in QIIME 2 to help with that (yet). But after trimming primers, you may want to disable truncation filtering entirely by setting trunc_len to 0 as some of your forward reads may end up shorter than others (as they ran into the reverse primer). That part is kind of up to you, but as it stands, anything shorter than 250 will be dropped.

We can't help you with that here, you'll want to contact your university's IT or HPC department. They may have user-training or other resources for using their cluster.

1 Like

Hi Ebolyen,
Thanks for replying! In my case, the reverse reads (30~300 bp) dropped in quality more quickly than the forward. The quality score dropped below 20 after 200bp for the reverse; for the forward, the quality dropped below 20 after 260bp. Now I may say my choice of 250bp for truncating both forward and reverse reads was not wise!
As can be seen from FastQC, I see there are over represented reads (in attached picture), which is about 50 bp. I have checked, the first 19 and 20 bp respectively are those forward and reverse primer sequences, but no ideas what are the rest of 30bp.Maybe they are adapters or? I will double check.

(1)So do you think my coding now is ok:

$ qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f 250 --p-trunc-len-r 200 --p-trim-left-f 50 --p-trim-left-r 50 --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza

(2) Is there anything i need to take care about by using qiime2 for analysing the ITS sequencing data? I heard that fungal sequencing analysis is more complexed. The primer i was using is gITS7 and ITS4

Cheers

One more question: Do qiime users need to do ITS sequences extraction using ITSx, and then imported sequences into qiime after this step?
I am wondering do you have a detailed analysis guideline for ITS analysis? I am thirsty for it.

1 Like

Hi @hongwei2017, the proposed parameters for dada2 denoise-paired look okay, although I think @ebolyen suggested setting the trunc-len parameters to 0 so that your shorter sequences aren't removed. Ultimately I think you are going to need to experiment with these parameters a bit --- run denoise-paired with a few different settings and see how the results are impacted (and let us know what you find!).

Generally speaking, the downstream analyses are pretty similar, you shouldn't have to do much (or anything) different. @gregcaporaso wrote up a draft ITS analysis tutorial, perhaps that is worth taking a look at!
Thanks!

Another caveat of ITS data is that it’s hard to produce a meaningful phylogenetic tree that works well with phylogenetic diversity metrics such as Faith’s PD or UniFrac. When performing diversity analyses with ITS data you’ll need to use non-phylogenetic metrics for alpha and beta diversity (e.g. Observed OTUs/Species, Shannon, Bray-Curtis, Jaccard, etc.).

Unfortunately the qiime diversity core-metrics workflow requires a phylogenetic tree, so you won’t be able to use that command with your data. We have an open issue to make core-metrics work without a phylogenetic tree – it’ll likely be in the upcoming 2017.8 release which will go live next week. We’ll follow up here when it’s available!

In the meantime, you can run each of the steps individually instead of using core-metrics – see @ebolyen’s suggestions in this topic.

3 Likes

@jairideout @thermokarst I read here: Import --i-reference-taxonomy taxonomy.tsv to .qza , that BLAST+ generally performs better with ITS data than vsearch. With one of my data sets it also performed better than training a feature classifier. Is there a specific taxonomic assignment method that you would suggest for all ITS2 or does it seem to vary depending on the data set?

Hi Jairideout,

Your reply to my questions are very appreciated! I will see if I can find a way to finalise ITS sequencing correctly.

Cheers

Sounds like you are able to corroborate the results!

The linked post by @BenKaehler is probably still the way to go, but pinging him and @Nicholas_Bokulich as they might have fresh information on this.

That is interesting but not necessarily surprising — classify-consensus-blast and classify-sklearn once optimized perform similarly on some ITS datasets that we have tested. How are you defining "better" in this case, though? Are you testing on some test data for which you know the "correct" classifications? Or does better mean more specific taxonomic information (e.g., classify-sklearn is only classifying to genus, blast+ predicts species). If the latter, you may want to be cautious in how you choose — the ambiguous classifications may be more reliable, e.g., if you are unable to resolve classification between similar species, or even have a species not present in the reference dataset. Blast, in particular, we find to be more prone than classify-sklearn to these types of overclassification errors, but if you have well-characterized sample types with fairly predictable compositions, then it may even be preferable to have such a bold classifier.

classify-sklearn is my recommendation, essentially for the reasons mentioned above. We have tested ITS1, not ITS2, so our current recommendations may not be 100% faithful to the peculiarities of ITS2 but are probably reliable enough for rule of thumb. @BenKaehler , have any other thoughts?

1 Like

As usual, @Nicholas_Bokulich, I have little to add.

I too would be interested in how performance is being measured.

Correct me if I’m wrong, @Nicholas_Bokulich, but particularly for ITS I think users should avoid trimming the reference sequences before training the sklearn classifier or using BLAST+.

@Nicholas_Bokulich I measure performance by manually running individual sequences through the UNITE database. That way I can see multiple ‘best matches’ for each sequence and use this to determine the most preferred taxonomic assignment. @BenKaehler I have also found that trimming reduces accuracy in BLAST+.

Hi @jairideout,

I am confused here. I am analysing the ITS sequencing data. And now I have got the table.qza and rep-seas.qza files. Then I assume I have finished the denoise step.

Then, what are the next steps should I do? According to the 16S tutorial, I should calculate diversity, but as you mentioned, the core metrics function doesn’t work for ITS, which needs to calculate separately. I could not find relative tutorial for calculating shannon, simpson and Obs in QIIME2 website other than using core metrics. So can you please guide me through this via providing some links?

Then about doing taxonomy analysis, for 16S, I see in tutorial as

“qiime feature-classifier classify-sklearn
–i-classifier gg-13-8-99-515-806-nb-classifier.qza
–i-reads rep-seqs.qza
–o-classification taxonomy.qza”

Then in the case of ITS, what should I use for classifier? Should I download this from somewhere or get my own one?

I wish there is a detailed tutorial for ITS sequencing analysis.

THANKS

Hi @ebolyen,

I would expect to generate a OTU table that contain info on both frequencies and taxonomy from QIIME. But I do not know which codes can do this. Do you have an idea about how to export a similar file like biome or otu table from qiime2? And also for ITS analysis, how to calculate diversity using separate codes rather than using core metrics? what i know is that there will be methods for calculating shanon, simps and abs diversity, which i don’t know how.

Thanks in advance,

Yup!

Core metrics mostly runs a couple different invocations of the following commands:

qiime feature-table rarefy

and then with the resulting table, it runs these commands:

qiime diversity alpha
qiime diversity beta
qiime diversity pcoa

You can run those with --help to learn more!

These in particular are covered by the qiime diversity alpha command above.

You could either train your own classifier for ITS data or as mentioned above in this thread, try classify-consensus-blast.

This is definitely on our todo-list, we're still missing a few things for upstream processing, but it's mostly feasible right now. Maybe a particularly motivated member of this forum could consider adding a Community Tutorial for ITS in the meanwhile :wink:

1 Like

An off-topic reply has been split into a new topic: Rarefaction curves

Please keep replies on-topic in the future.

QIIME 2 2017.9 now runs only non-phylogenetic metrics when executing core-metrics (which is pertinent to ITS analyses). To include phylogenetic metrics you can run core-metrics-phylogenetic! Check out the release announcement to learn more! :tada:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

QIIME 2 2017.12 has a new cutadapt plugin which provides trim-paired and trim-single which can be used to remove your reverse-primer on your forward reads!