Unsure what kind of data I have and how to import.

bpscherer · March 9, 2020, 6:55pm

So I recently got involved in a collaboration with some people on my campus. Essentially the researcher had contracted with Ubiome to do their sequencing, but then the company went bankrupt and didn't fulfill their end of things. They ended up with a bunch of raw data, and I was asked to join in. I am excited, but a little unsure of the importing of the data.

I have 35 samples, but the number of files per sample is confusing me. I have eight fastq/GZ files per sample, 4 of them are R1 and 4 of them are R2. The number of files doesn't seem to match up with anything I've found on Qiime docs.

Now, I am used to having one forward and one reverse file, and am confused by this number of samples. I didn't do the sequencing myself, but I was told they used the following from Almonacid et al., 2017.

"Consolidated libraries were quantified by quan- titative real-time PCR using the Kapa Bio-Rad iCycler qPCR kit on a BioRad MyiQ before loading into the sequencer. Sequencing was performed in a pair-end modality on the Illumina NextSeq 500 platform rendering 2 x 150 bp pair-end sequences."

Any help at all would be greatly appreciated! Thanks in advance!

Almonacid, D. E., Kraal, L., Ossandon, F. J., Budovskaya, Y. V., Cardenas, J. P., Bik, E. M., … Apte, Z. S. (2017). 16S rRNA gene sequencing and healthy reference ranges for 28 clinically relevant microbial taxa from the human gut microbiome. PLoS ONE, 12(5), 1–15. 16S rRNA gene sequencing and healthy reference ranges for 28 clinically relevant microbial taxa from the human gut microbiome

colinbrislawn · March 10, 2020, 1:04pm

Good morning Brendan,

Welcome to the Qiime forums! :qiime2: It's good of you to help out your colleague, and I hope we can help you out too.

As you mentioned, R1 is the forward read and R2 is the reverse read for each sample.

The _L00[1,2,3,4] suffix is the lane of the Illumina sequencer. The Illumina NextSeq has 4 lanes, so this makes sense

Most labs put different samples on different lanes, but it looks like Ubiome used lanes as replicates! You should be safe to merge these technical reps after importing, or concatenate them now and continue to import. Both ways work

Looks like you are using Windows. Qiime 2 runs on unix-like systems, so you might need to install Windows Subsystem for Linux or get your hands on a linux server or Macbook.

I quickly checked the Ubiome paper you linked, and it looks like this is their process:
swarm d=1 fastidious -> uchime vsearch -> proportaty tax assignment to Silva 123 database

Their process of denoising, chimera removal, and taxonomy assignment sounds good!
How might you do a similar process in Qiime 2?
Just follow the Parkinson’s Mouse Tutorial!

Let us know if you have any questions!
Colin

bpscherer · March 11, 2020, 6:52pm

Thanks for the in-depth reply Collin!

I'm running QIIME through a server provided by my institution. I've done this successfully for some of my own data, and am just adapting as I go.

I ended up concatenating and then importing into QIIME2 successfully.

However, once I go through DADA2 I am losing almost all of my reads. I'll try to share what I have below.

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path catfiles
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path output/demux.qza

qiime demux summarize
--i-data output/demux.qza
--o-visualization output/demux.qzv

Below is from my demux.qzv.

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f 5
--p-trim-left-r 5
--p-trunc-len-f 150
--p-trunc-len-r 150
--o-table 1table.qza
--o-representative-sequences 1rep-seqs.qza
--o-denoising-stats 1denoising-stats.qza &

Below is from my DADA2 table

I'm rerunning my DADA2 step without any truncation, but I'm not sure that it will make a difference.

Do you have any insight as to why I'm losing nearly all my data?

Thanks again!

colinbrislawn · March 11, 2020, 8:43pm

Thanks for the update Brendan,

There should be some plots of read quality you get after importing. Can you post those?

Our goal here is to pick a combination of --p-trim and --p-trunc settings that will keep the maximum number of reads, and those quality plots will let us make an educated guess.

Thanks!
Colin

bpscherer · March 11, 2020, 8:51pm

HI Colin,

Here you go. I basically kept everything I think, but I'm not sure I did it right. Each forward and reverse read should be ~150 bp. Quality didn't seem to drop too much so I just kept everything aside from those first 5 which I thought might be the barcodes?

I'm currently rerunning it without taking anything off the front.

Thanks for any help!

Cheers,

Brendan

colinbrislawn · March 11, 2020, 9:43pm

OK, your quality looks good. And I feel like your settings should work...

Looks like they sequence the V4 region...

PCR amplification of the 16S rRNA genes was performed with primers containing universal primers amplifying the V4 variable region (515F: GTGCCAGCMGCCGCGGTAA and 806R: GGACTACHVGGGTWTCTAAT) [24].

Those primers imply that the amplicons should be ~300 bp long (806-515 = 291), and you only have 150 bp in each direction... so your reads might not have enough overlap to pair. That would explain why

There has got to be a workaround! Let's see what @thermokarst recommends.

Colin

bpscherer · March 13, 2020, 3:35pm

Interesting.

There must be some way of getting somewhere, I recently learned that my colleague got the raw fastq files from Ubiome, then sent them to Thryve which sent back excel files. So for each sample I also have an excel file that lists the bacterial taxa and the number of reads associated with that taxa. I think I could combine all these files into one table of reads/relative abundances, but I thought it was best to actually go through the bioinformatics myself.

I've tried running DADA2 twice now with no truncating. I'm running it on our local server, but it just seems to run for hours and then never gives any output. I don't see an error or anything, but the process definitely comes to an end, but no files show up in the output path. I'm not sure what's up with that.

Here's what I'm running, in case it is helpful.

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 150
--p-trunc-len-r 150
--o-table 1_table.qza
--o-representative-sequences 1_rep-seqs.qza
--o-denoising-stats 1_denoising-stats.qza &

colinbrislawn · March 13, 2020, 3:43pm

totally agree

I'm glad you got those raw files. You could try running dada2 using only the forward reads, which might be faster and will avoid any overlap problems.

For dada2, check out all the options here:
https://docs.qiime2.org/2020.2/plugins/available/dada2/denoise-single/

Specifically, make sure you set --p-n-threads as high as possible, as it's 1 by default. You can also reduce --p-n-reads-learn from 1 million (the default), to maybe 10k or so. This doesn't reduce quality much, but should increase speed.

Keep in touch, Brendan,
Colin

bpscherer · March 13, 2020, 3:58pm

I'm currently giving deblur a go, following the Moving Pictures tutorial.

I haven't used deblur before, but I thought I'd see where it gets me.

I will revisit DADA2 with your advice next.

Thanks for all the help and insight!

Best,

Brendan

bpscherer · March 14, 2020, 3:51pm

So I went through deblur, but the results are the exact same. Maybe that shouldn't be a surprise, but I wasn't sure what to expect.

qiime quality-filter q-score
--i-demux demux.qza
--o-filtered-sequences demux-filtered.qza
--o-filter-stats demux-filter-stats.qza

qiime deblur denoise-16S
--i-demultiplexed-seqs demux-filtered.qza
--p-trim-length 150
--o-representative-sequences rep-seqs-deblur.qza
--o-table table-deblur.qza
--p-sample-stats
--o-stats deblur-stats.qza

Any ideas from @thermokarst?

Thank you for all your help so far! I hope you're staying safe amidst the COVID pandemic.

colinbrislawn · March 14, 2020, 8:00pm

Afternoon Brendan,

I think I just made a rookie mistake:
That's a summary of Features Per Sample, which means different ASVs per sample. So you might still have 1000s of reads in each sample, but if this is a low diversity community, you might expect to have a low number of feature. For example, if you had an axenic culture, you would hope to see only 1 feature in that sample.

Denoising doesn't suffer from OTU inflation, so low counts is good. Low sparsity also make stats easier.

What's the expected richness of these samples? Maybe 10~20 features is perfect!

Instead of features per sample, could you get a list of reads per sample? Once we know our read counts are high, then we are good to go.

Colin

bpscherer · March 15, 2020, 4:59pm

Hi Colin,

So I think that table was a little misleading. There are a number of samples with 0 features, which is really the alarming bit. These are human fecal samples, so I was hoping for somewhat more diversity than what is making it through DADA2.

In the DADA2 stats qzv, it looks like very few reads are getting merged. Maybe this is due to the lack of overlap?

Still not sure what to do, though at least I have lots of time with everything shutting down.

Thanks again for any insight!

Brendan

colinbrislawn · March 15, 2020, 7:46pm

Hello Brendan,

There's the table we are looking for! And finally our problem is obvious:

Correct.

Given the read length compared to the amplicon length, I don't think you will be able to merge these reads. To keep this simple, can you try reprocessing using only the forward reads and see how many reads make it through and how many features you get?

Thanks,
Colin

bpscherer · March 17, 2020, 8:47pm

I ran it with just the forwards and it worked!

I'm futzing around with the data and trying out some alpha and beta diversity stuff, we'll see how it goes!

Do you think there is any work around or way to get the reads to merge? I wonder if Thryve used both directions or if they just did the forwards too.

Thanks again for all your help!

colinbrislawn · March 17, 2020, 9:52pm

Nice!

Great. Feel free to open a new thread if you have questions about the diversity metrics or whatnot.

Probably not. If the reads are shorter than the region, it's a biological problem we can't with software.
You could processing just the reverse reads, and see how that compares.
You could also revisit the 'justConcatenate` option by running dada2 directly, but that has its own issues.

Keep in touch!
Colin