Is mafft issue solved?

yeojuny · October 12, 2017, 10:32am

Continuing the discussion from Alignment Problem:

I had exactly same error massage with this. and I search the answer but I couldn't find the solution.
actually my rep-sea.qza file is quite big 32M. so even though I used 224Gb memory computer, and during running the alignment process used almost 60% of memory. But in the end I could't get any output. only got that Plugin error message. It was no problem when I used smaller size of rep-sea sequence files which I just ran before this file. Is this related with the size of file?

jairideout · October 12, 2017, 5:00pm

Hi @yeojuny! Unfortunately the MAFFT out-of-memory issue has not been solved yet. We're waiting on the upstream Deblur package, which is one of qiime2's dependencies, to update its MAFFT to a newer version where this memory error is handled in a better way.

I created an issue on Aug 25 to have Deblur update its MAFFT version but I'm not sure if there's been movement on that yet. I'm hoping this will get fixed for the upcoming qiime2 2017.10 release. We'll follow up here when this fix is available in a release!

@wasade can you provide an ETA for when Deblur will have an updated MAFFT?

Yes, I'd suspect you're running out of memory while trying to align that many sequences. Since you're able to align smaller numbers of sequences, it doesn't seem like an installation/deployment issue with MAFFT or qiime2. You'll probably need a computer with more memory to align this many sequences, but I also wonder why you have so many representative sequences in the first place. How did you generate these representative sequences?

yeojuny · October 13, 2017, 2:28am

Thank you @jairideout for kind info.

Regarding your question about my representative sequences,
I denoised merged-fasta qza file (53GB, total 174 million reads) by DADA2.
After DADA2, I've got around 45% sequences after filtering and de-noising, which is around 81 million reads.
so from this procedure, two files, rep-sea.qza (32MB) and table.qza (11MB) were produced.
We also concerned about too many features (~300,000) actually. But we thought which is normal with this big data. Although I wished to upload my original log file here, it was not allowed because it's 7.5MB txt file. So I will upload the part of the whole log file (abstracted by me). Alignment_log.txt (4.6 KB)

wasade · October 13, 2017, 7:11pm

Hey @jairideout,

The PR for this has been open in both Deblur and q2-deblur for a month but nearly closed out. I think it is feasible to cut the Deblur micro release in time for 2017.10. Thanks for checking in on this.

Best,
Daniel

jairideout · October 13, 2017, 8:24pm

Wow, 300,000 is a lot of features! I suspect something isn't right here with the denoising of your data. Can you provide me with the following info?

What version of qiime2 are you running? You can get that info by running qiime info.
Did you denoise your data with q2-dada2 (i.e. the qiime2 DADA2 wrapper) or with the R DADA2 library directly? What were the command(s) you ran, and what output did they produce?
Is this Illumina data, 454, or something else?
Is this 16S amplicon sequence data, ITS, or something else?
Have all sequencing artifacts been removed from your sequences before running DADA2? Things like adapters, primers, and barcodes need to be trimmed off before denoising with DADA2.
How did you produce the FASTA file of merged sequences? DADA2 requires quality scores to denoise so I'm not sure how it'd be possible to denoise a FASTA file (at least with qiime2). If you're working with paired-end data, DADA2 will work better if you leave the reads unjoined and let DADA2 do the joining for you.

Thanks!

yeojuny · October 14, 2017, 3:48am

So I will go through your questions one by one.

--> Here output of qiime info
System versions
Python version: 3.5.3
QIIME 2 release: 2017.6
QIIME 2 version: 2017.6.0
q2cli version: 2017.6.0

Installed plugins
alignment 2017.6.0
composition 2017.6.0
dada2 2017.6.0
deblur 2017.6.0
demux 2017.6.0
diversity 2017.6.0
emperor 2017.6.0
feature-classifier 2017.6.0
feature-table 2017.6.0
metadata 2017.6.0
phylogeny 2017.6.0
quality-filter 2017.6.0
taxa 2017.6.0
types 2017.6.0

--> I did denoise in qiime 2.

$ qiime dada2 denoise-paired --i-demultiplexed-seqs merged-sequences.qza --p-trim-left-f 17 --p-trim-left-r 21 --p- trunc-len-f 280 --p-trunc-len-r 240 --p-max-ee 2 --p-trunc-q 2 --p-chimera-method consensus --verbose --o-representative-sequences repseq.qza --o-table table.qza --p-n-threads 36

--> output were two files, repseq.qza and table.qza

    --> Illumina dada

      --> 16S amplicon, V3-V4

--> yes I removed all artifacts. actually we realized this effects really hugely on the DADA2 result. so when I compared the results with or without trimming out the primer sequence, 'without primer' sequence produced almost ten times more than 'with primer' sequences. The other influence was the chimera method, 'pooled' option made sequences discarded a lot. but to be honest, I don't know which one is better for filtering the real chimeras.

--> I followed the paired-end merging protocol with manifest file in QIIME2 tutorial page (Importing data — QIIME 2 2017.9.0 documentation).

Thanks for diagnosing my problems here in advance!

jairideout · October 16, 2017, 11:10pm

Thanks for following up with those details @yeojuny! I'll respond inline:

I noticed you're running qiime2 2017.6. That's a pretty old release and we can only support the latest release for the time being. Can you update to 2017.9 before running the analyses described below?

You'll definitely want to remove any primers, barcodes, adapters, etc. from your sequences before denoising with DADA2.

How many features (i.e. ASVs) are produced when you use the consensus chimera detection method vs. pooled? Can you also check the difference in the number of reads between the two methods?

Have you joined your paired-end reads in any way before importing them into qiime2? I'm guessing the answer is no because you were able to import as paired-end data, but I wanted to double-check. DADA2 works best with unjoined paired-end sequences as input, because it will join the sequences for you after denoising.

Can you also try running dada2 denoise-single on your sequences? You can use the merged-sequences.qza with denoise-single, it'll ignore the reverse reads and only denoise the forward reads. How many features and reads do you get with denoise-single vs denoise-paired?

ebolyen · October 27, 2017, 6:43pm

We were able to update MAFFT in QIIME 2 2017.10, so the memory issue should be fixed!

yeojuny · October 31, 2017, 2:42am

Thanks a lot for noticing the update.

yeojuny · October 31, 2017, 3:34am

Sorry for late reply though. I finally did what you asked.

So I updated to 2017.9, then two days ago, 2017.10

So here I can't compare directly between consensus vs pooled, because when I did consensus method, I removed primer sequences. Before consensus method, I did pooled method without removing primer sequences, and it took over 1 week to finish this dada2 script. I only can compare the number of features between 'consensus without primer seqs' vs 'pooled with primer seqs', which is 292,323 vs 165,110. Read count between ‘Consensus without primer’ vs ‘Pooled with primer’ was 80,921,008 vs 30,966,556.

I imported the sequences through "qiime2 tools import" with paired-end manifest file, to make "demux.qza", which is then the input file of DADA2 denoising.

denoise-single produced 357,579 features, more than 292,323 features by denoise-paired. Can I ask what do you want to look from this result?

Best,
Yeojun

jairideout · October 31, 2017, 11:32pm

Thanks for following up @yeojuny!

Great! Did you rerun your analyses with the new release? Do you see any differences in the results?

That's unfortunate that it takes so long to run! To decrease runtime, try using the --p-n-threads option to run DADA2 in parallel.

I'll reiterate that you will not obtain reasonable results if your sequence data contains sequencing artifacts such as primers. Can you try rerunning pooled chimera checking on your sequence data that has the primers removed? Otherwise there isn't a comparison we can make here between the chimera checking methods.

Great, that sounds like the right workflow! Thanks for confirming.

When you ran denoise-single, did you use the sequencing data that has primers removed?

Can you also provide the .qzv file produced by qiime demux summarize? That'll let me take a look at the quality scores in your forward and reverse reads in case there's anything going on there.

Thanks @yeojuny!

system · December 2, 2017, 5:33am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.