Opened 7 months ago

Closed 7 months ago

Last modified 7 months ago

#1592 closed defect (fixed)

Improve performance for indexing the reference VCF files

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: Variant Search v1.12
Component: net.sf.basedb.varsearch Keywords:
Cc:

Description

This is very slow with the imputed genotypes data set. It takes almost 20 hours and then it crashes due to the database connection is closed since it has not been used for a long time.

net.sf.basedb.core.ConnectionClosedException: The connection has been closed.
  at net.sf.basedb.core.DbControl.commit(DbControl.java:442)
  at net.sf.basedb.varsearch.index.SplitIndex.rebuildReferenceIndex(SplitIndex.java:352)
  at net.sf.basedb.varsearch.index.SplitIndex.doCustomAction(SplitIndex.java:243)
  at net.sf.basedb.varsearch.service.VarSearchService.autoUpdateIndexes(VarSearchService.java:366)
  at net.sf.basedb.varsearch.service.VarSearchService$IndexUpdateTimerTask.run(VarSearchService.java:476)
  at net.sf.basedb.util.timer.ThreadTimerTask$1.run(ThreadTimerTask.java:94)
  at java.base/java.lang.Thread.run(Thread.java:833)

There are multiple things that need to be fixed. First of all, we need to speed up the actual index. But it will take a long time no matter what so we also need to fix the database connection timeout issue.

I also noted that it is possible to abort the indexing process by stopping the service but that will replace the existing index (if there is one) with the new index that is only partly completed. In the log there is a message that the build is about to be aborted, but then the next message is that the index was built successfully.

Change History (5)

comment:1 by Nicklas Nordborg, 7 months ago

In 7699:

References #1592: Improve performance for indexing the reference VCF files

Starting custom actions in a separate thread. This should take care of some connection/session timeouts.

comment:2 by Nicklas Nordborg, 7 months ago

In 7700:

References #1592: Improve performance for indexing the reference VCF files

Better handling when re-indexing is aborted. It will keep the current index and report a proper message to the log.

comment:3 by Nicklas Nordborg, 7 months ago

In 7701:

References #1592: Improve performance for indexing the reference VCF files

Improved VCF parsing performance a lot by getting rid of many regexp-based string splitting methods. Most of the time we will only split on a single character which can be implemented a lot faster. Overall performance seems to have increased by a factor of 2 or so.

comment:4 by Nicklas Nordborg, 7 months ago

Resolution: fixed
Status: newclosed

In 7702:

Fixes #1592: Improve performance for indexing the reference VCF files

Implemented support for using multiple threads when a single item has more than one associated VCF file. There is also restriction that the total file size should be over 1GB which means that the multi-threaded case is only used when indexing the imputed reference panel.

comment:5 by Nicklas Nordborg, 7 months ago

Seems to work good now on the production server. Total speedup with 8 indexing threads is from ~1000 lines/second to ~20000 lines/second.

Note: See TracTickets for help on using tickets.