Improve performance of demux+merge step
|Reported by:||Nicklas Nordborg||Owned by:||Nicklas Nordborg|
Sometimes this step is taking too long time. Typical run times are between 9 and 23 hours. We have made some investigations to try to see were the time is spent. There are two main steps:
The demux step is usually relatively quick, but still it varies between 1.5 and 6 hours. The second step varies between 6-18 hours. The second step can be divided into the following substeps (perfomed sequentially for each library):
- Bowtie 2
- Trimmomatic 1
- Trimmomatic 2
- copy fragmentsize.txt > project-archive
- gzip R1.fastq.gz > project-archive
- gzip R2.fastq.gz > project-archive
The last 3 steps are important since we can use the file timestamps to check how much time is spent between them. For the faster jobs, a typical iteration takes about 30 minutes and most time is used in the gzip steps (10 minutes/gzip). Jobs that take longer time has a lot more variation. A single iteration can take more than 1 hour but it is still the gzip steps that uses most of the time.
The merge+trimmomatic step seems to have the biggest potential for improvement. We have a couple of ideas.
- Split the job into multiple jobs. The demux step is kept as it is, but temporary (merged) FASTQ files are stored back to the project archive. Then, the Trimmomatic step can be submitted as independent jobs for each library.
- Try to improve gzip performance by using lower compression and running both gzip commands in the background.
Since the second option require only a few minor changes we are going to try this first.