Opened 9 years ago
Closed 9 years ago
#809 closed enhancement (fixed)
Improve performance of demux+merge step
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | critical | Milestone: | Reggie v3.6 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description
Sometimes this step is taking too long time. Typical run times are between 9 and 23 hours. We have made some investigations to try to see were the time is spent. There are two main steps:
- Demux
- Merge+Trimmomatic
The demux step is usually relatively quick, but still it varies between 1.5 and 6 hours. The second step varies between 6-18 hours. The second step can be divided into the following substeps (perfomed sequentially for each library):
- Merge
- Bowtie 2
- Trimmomatic 1
- Trimmomatic 2
- copy fragmentsize.txt > project-archive
- gzip R1.fastq.gz > project-archive
- gzip R2.fastq.gz > project-archive
The last 3 steps are important since we can use the file timestamps to check how much time is spent between them. For the faster jobs, a typical iteration takes about 30 minutes and most time is used in the gzip steps (10 minutes/gzip). Jobs that take longer time has a lot more variation. A single iteration can take more than 1 hour but it is still the gzip steps that uses most of the time.
The merge+trimmomatic step seems to have the biggest potential for improvement. We have a couple of ideas.
- Split the job into multiple jobs. The demux step is kept as it is, but temporary (merged) FASTQ files are stored back to the project archive. Then, the Trimmomatic step can be submitted as independent jobs for each library.
- Try to improve gzip performance by using lower compression and running both gzip commands in the background.
Since the second option require only a few minor changes we are going to try this first.
Change History (9)
comment:1 by , 9 years ago
Status: | new → assigned |
---|
comment:2 by , 9 years ago
comment:3 by , 9 years ago
I have now made some initial tests with running gzip -1
in the background. It is a debug run of a 2 lanes from a NextSeq flow cell with 21 samples resulting in 15.8M reads in total.
The results look promising.
Times | Before | After |
Total | 36m | 17m |
Demux | 6m | 6m |
Total/library | 60-150s | 30-50s |
Trimmomatic | 10-20s | 10-20s |
Gzip | 40-120s | 4-11s |
Gzip | 5MB/s | 50MB/s |
The range in times more or less correlates with the size of the FASTQ files. Most are around 100MB (uncompressed) and is at the faster end. The largest one is around 250MB.
The cost of the speed increase is that the compressed FASTQ files are around 20% larger.
comment:4 by , 9 years ago
comment:5 by , 9 years ago
Just found out about Pigz: http://zlib.net/pigz/
I reverted back to the original script (no background task, no -1 parameter) and replaced gzip
with pigz
. Seems to be almost just as fast and FASTQ file sizes remain the same!!
comment:6 by , 9 years ago
Results comparing gzip, gzip -1 and pigz.
Times | Gzip | Gzip -1 | Pigz |
Total | 36m | 17m | 18m |
Demux | 6m | 6m | 6m |
Total/library | 60-150s | 30-50s | 30-60s |
Trimmomatic | 10-20s | 10-20s | 10-20s |
Compress | 40-120s | 4-11s | 5-15s |
Compress | 5MB/s | 50MB/s | 35MB/s |
comment:7 by , 9 years ago
Due to some file system problems on compute-2-0 I had to force my last test onto one of the compute-3 nodes by requiring 20 slots (=20 threads). Seems like the pigz really benefit from this. The time is more not measurable. Most FASTQ files compress in less than 1 second. Total run time for the job was 10 minutes. The demux part used about 5 minutes and the total time for each library was typically 15-20 seconds, with Trimmomatic down to ~5-8 seconds and Bowtie ~10 seconds.
It might be worth increasing the number of slots given to the demux job even if the picard steps doesn't benefit that much the Trimmomatic and compression steps do.
comment:8 by , 9 years ago
comment:9 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Think it is time for some real tests now. What has been done so far will be included in the next Reggie release. If we find that more improvements are needed a new ticket should be created for a later release.
(In [3485]) References #809: Improve performance of demux+merge step
Added time measurements to the generated job script. It will output the current time after most steps and save that to the 'time.txt' file in the job results folder. The file is not imported back to BASE.