Opened 5 years ago

Closed 5 years ago

#1160 closed task (fixed)

Investigate changes in Picard since our last modified version

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: Reggie v4.23
Component: net.sf.basedb.reggie Keywords:

Description (last modified by Nicklas Nordborg)

We are using a modified version of Picard for demuxing. It is base on Picard 1.128 and changes are listed on our BASE-hacks site:

The MIPs pipeline is using UMI (unique molecular indices) which require that 'M' is used in the READ_STRUCTURE parameter to Picard. This is not supported in our modified version, but works in a later version (for example: 2.20.0).

However, we may not be able to just switch to the newer Picard due to our modifications. This ticket is about investigating potential issues.

To summarize our findings from below:

It should be possible to use the newer Picard version. All critical modification have also been fixed in Picard. The exception is Item 1 in Fork release 2, but this only affects 31 old sequencing runs and can be fixed by manually copying the TileMetrics.bin file.

The remaining annoyance with all messages going to stderr should not prevent us from updating. If we can't get acceptable results by fixing code in Reggie, there is always the possibility to fork Picard again.

Change History (9)

comment:1 by Nicklas Nordborg, 5 years ago

Fork release 5 - Item 1

We do not use AddOrReplaceReadGroups any longer. It was removed from the legacy pipeline since we no longer keep BAM files there and in the new pipeline, Hisat has built-in support for this. See #983 and #997.

comment:2 by Nicklas Nordborg, 5 years ago

Fork relase 3 - Item 1

We do not use this (in the RNA-seq pipeline), but Picard now has support for this via a parameter IGNORE_UNEXPECTED_BARCODES. Which is good since it is useful when demuxing in the MIPs pipeline.

Fork release 3 - Item 2 and 3

There is a parameter in Picard for this also, but we prefer to use uncompressed files as long as possible and then compress with pigz. See #809 and #663.

comment:3 by Nicklas Nordborg, 5 years ago

Fork release 4 - Item 1

For background see #742. Picard has made some changes related to file size checks but it is not certain that it fixes our issues. This need to be tested. A quick check on the production server finds 25 sequencing runs (all of them on the NextSeq) with this problem and they seem to happen every now and then. The last one was just a week ago. So, this item is a real problem that we need to verify if it has been fixed in Picard or if we need some other workaround.

I have a made test run with a demux and it seems like Picard is able to handle this without crashing. On the other hand our code crashed when trying import data from the demultiplex_metrics.txt file. The reason was that there are more columns that causes our regular expressions to not find the data.

Last edited 5 years ago by Nicklas Nordborg (previous) (diff)

comment:4 by Nicklas Nordborg, 5 years ago

Fork release 2 - Item 1

See #631 for background information. A quick check on the file server finds 31 sequencing runs with a 12-byte TileMetricsOut.bin file. They are all in the late 2013 or first half of 2014 so my guess is that this has happened due to some "temporary" hiccup in file syncing. In any case, we can probably live without this fix and if we ever need to demux any of the 31 problematic sequencing runs again, it should be fixable by first renaming of copying TileMetrics.bin to TileMetricsOut.bin.

comment:5 by Nicklas Nordborg, 5 years ago

Fork release 1 - Item 5

This seems to be fixed in picard.

comment:6 by Nicklas Nordborg, 5 years ago

Fork release 1 - Item 4

The support for FIRST_TILE and MAX_TILES was implemented to speed up debugging and reduce used disk space. It was removed in [2464] with a comment that it doesn't save much time but then added again in [2497]. We can probably live without this.

comment:7 by Nicklas Nordborg, 5 years ago

Fork release 1 - Item 1, 2 and 3
Fork release 2 - Item 2

The changes are all related to separating error messages from other messages. The benefit is that it makes it easier for error handling code to catch and display the actual error message for the user instead of mixing this up with other messages. We can live without this, but may have to change some code so that we do not loose all messages. Will take some time test a go through everything.

comment:8 by Nicklas Nordborg, 5 years ago

Description: modified (diff)

comment:9 by Nicklas Nordborg, 5 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.