Opened 5 years ago
Closed 5 years ago
#1245 closed task (fixed)
Picard demux should output compressed fastq files
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | major | Milestone: | Reggie v4.27 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description
Demuxing the new NovaSeq sequencing runs with the S1 flow cell that has 2 billion reads is causing the temporary work disk to fill up. Partly because Picard also need to use temporary files due to the number of reads per tile is higher than it is for NextSeq or HiSeq. We can control this with the MAX_READS_IN_RAM_PER_TILE
parameter to Picard. It is currently set to 5,000,000. In an attempt to try to avoid the disk filling up we increased it to 20,000,000 but that caused Picard to run out of memory.
The easiest and most future-proof (there are flow cells with a lot more reads!) solution is to let Picard save compressed fastq files. There was an old comment (see comment:38:ticket:547) that Trimmomatic didn't like concatenated gz files, but that seems to work now. So we can use cat *.fastq.gz
in the merge step just as before.
It would be nice if we could avoid the temporary files created by Picard as well (that will cut the demux times to about 50%). In theory, they should not be needed since we typically just use 4 threads (=4 tiles) at the same time. However, after investigating the code in Picard we realized that Picard is using 4 threads for reading data and putting that in a queue. Actually, there is one queue and one thread for each sample that is writing data to fastq files. Test runs show that the writing is slightly slower than reading and the queues fill up more and more until all memory is consumed. I guess it depends a lot on hardware and other external factors how fast this happens and it is probably not easy to get around this situation in the general case. Possible workarounds are to split the demux into more parts. Just as we split it into one demux/lane we could split it further into tiles or libraries. But it is probably not worth the extra complication. The most elegant solution would be to make changes in Picard so that, for example, the reading threads are aware of how far behind the writing threads are. If there is a lot of data in the queue, the reading could be paused for some time...
In 5936: