Opened 4 years ago

Closed 4 years ago

#1290 closed task (fixed)

Implement a variant search extension

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Variant Search v1.0
Component: net.sf.basedb.varsearch Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

The variant calling pipeline in Reggie (see #1199) produces VCF files with lots of variants. In theory it is "easy" to search for variants with grep or SnpSift, but this takes a really long time (several hours) due to the large number of files in the system. Some kind of indexing is needed before searching is realistic.

I have made some initial tests with Apache Lucene (https://lucene.apache.org/) and indexed a few of the annotations in the annotated and filtered VCF files. The results are very promising. Searching for variants in a gene or at a specific location typically takes only a few milliseconds. The results can be returned as a list of raw bioassay id:s which means that it should be possible to include this functionality in the regular table listing.

This functionality could of course be integrated into Reggie, but since we also copy the VCF files to Relax, it would be nice if we could implement this as a separate extension that works in both the Reggie and Relax environments.

Change History (46)

comment:1 by Nicklas Nordborg, 4 years ago

Component: not classifiednet.sf.basedb.varsearch
Description: modified (diff)
Milestone: Variant Search v1.0
Owner: changed from Jari Häkkinen to Nicklas Nordborg
Status: newaccepted

comment:2 by Nicklas Nordborg, 4 years ago

In 6107:

References #1290: Implement a variant search extension

Setting up the development environment with common files and folders for a typical extension.

comment:3 by Nicklas Nordborg, 4 years ago

In 6108:

References #1290: Implement a variant search extension

Added JAR files from Lucene that we need.

comment:4 by Nicklas Nordborg, 4 years ago

In 6109:

References #1290: Implement a variant search extension

Implemented a proof-of-concept demo that adds a "Lucene" column to the RawBioassays table. An indexed database must already exists in "lucence.db/filtered" subdirectory of the BASE userfiles directory.

comment:5 by Nicklas Nordborg, 4 years ago

In 6111:

References #1290: Implement a variant search extension

Started to implement a "service" that forms the core of this extension. It will provide access to Lucene databases in an optimal fashion and it will be used for indexing new data at regular intervals.

comment:6 by Nicklas Nordborg, 4 years ago

In 6112:

References #1290: Implement a variant search extension

Implemented indexing of VCF files. Indexing is triggered by adding raw bioassays to the "Variant index" item list.

The part that select fields and data to index need a lot of more development (it is currently using the proof-of-concept code). Error handling also need to be improved, as well as full re-indexing and removal of rawbioassays that has been deleted.

comment:7 by Nicklas Nordborg, 4 years ago

In 6113:

References #1290: Implement a variant search extension

Implemented indexing of VCF files. Indexing is triggered by adding raw bioassays to the "Variant index" item list.

The part that select fields and data to index need a lot of more development (it is currently using the proof-of-concept code). Error handling also need to be improved, as well as full re-indexing and removal of rawbioassays that has been deleted.

comment:8 by Nicklas Nordborg, 4 years ago

In 6114:

References #1290: Implement a variant search extension

Started to improve the way data is indexed by the RawBioAssayIndexer. Some of the simple fields have been added. Gene name can be found in two annotations ncbiRefSeq and ANN.Gene_Name so they are merged to a single field in the index. This field now also ignore case when searching.

comment:9 by Nicklas Nordborg, 4 years ago

In 6115:

References #1290: Implement a variant search extension

Started to improve the way data is indexed by the RawBioAssayIndexer. Some of the simple fields have been added. Gene name can be found in two annotations ncbiRefSeq and ANN.Gene_Name so they are merged to a single field in the index. This field now also ignore case when searching.

Adding new files in a separate commit. For some reason SVN complains about files being out of date if mixing new files with modified files:

svn: E155011: File '....\varsearch\src\net\sf\basedb\varsearch\analyze\AlphaNumericIgnoreCaseAnalyzer.java' is out of date

comment:10 by Nicklas Nordborg, 4 years ago

In 6118:

References #1290: Implement a variant search extension

Improved indexing of CosmicID which can be in two different annotations: cosmic_ID and/or cosmic_nc_ID. If both are present they are always the same.

comment:11 by Nicklas Nordborg, 4 years ago

In 6119:

References #1290: Implement a variant search extension

Implemented support for searching in numeric fields (eg. the 'pos') field. The regular query parser provided in the Lucene core don't have information about data types and assumes that all fields are text fields so we need to override some methods. See FieldAwareQueryParser.

comment:12 by Nicklas Nordborg, 4 years ago

In 6120:

References #1290: Implement a variant search extension

Implemented support for searching in numeric fields (eg. the 'pos') field. The regular query parser provided in the Lucene core don't have information about data types and assumes that all fields are text fields so we need to override some methods. See FieldAwareQueryParser.

comment:13 by Nicklas Nordborg, 4 years ago

In 6123:

References #1290: Implement a variant search extension

Implemented support for HGVS.p fields.

comment:14 by Nicklas Nordborg, 4 years ago

In 6129:

References #1290: Implement a variant search extension

Improvements to the display of search results.

comment:15 by Nicklas Nordborg, 4 years ago

In 6131:

References #1290: Implement a variant search extension

Automatically re-write query strings based on regular expressions so that users don't have to specify the field prefix in many cases. Examples:

  • rsNNNN -> rsid:rsNNNN (where NNNN is numeric)
  • COSVnnnn -> cosmic:COSVnnnn (where nnnn i numeric)
  • p.XXXX -> p:XXXX (where XXXX is non-whitespace)
  • ANNNB -> p:ANNNB (where A and B is [A-Z*=] and NNN is numeric)
  • c.XXXX -> c:XXXX (where XXXX is non-whitespace)
  • NNNNA>B -> c:NNNNA>B (where NNNN is numeric or -+ and A,B is [ACGT]
  • chrXY -> chrom:chrXY (where XY is number or X or Y)
  • NNNN -> pos:NNNN (where NNNN is numeric)

comment:16 by Nicklas Nordborg, 4 years ago

In 6136:

References #1290: Implement a variant search extension

Added support for searching with only '=' and '<>' which translates to a special filter for raw bioassays that have been indexed/not indexed.

The table column will now also display either 'Indexed - N variants' or 'Not indexed'.

comment:17 by Nicklas Nordborg, 4 years ago

In 6137:

References #1290: Implement a variant search extension

Implemented a query cache that is used to keep results for queries that take a long time (>250ms). This is particularly useful for '=' and '<>' queries that was implemented in [6136].

comment:18 by Nicklas Nordborg, 4 years ago

In [6142]

References #1290: Implement a variant search extension

Check length of REF and ALT values and clip if neccessary.

comment:19 by Nicklas Nordborg, 4 years ago

In [6143]

References #1290: Implement a variant search extension

Added 'TYPE' field to the index.

comment:20 by Nicklas Nordborg, 4 years ago

In [6138]

References #1290: Implement a variant search extension

Add some more logging.

comment:21 by Nicklas Nordborg, 4 years ago

In [6139]

References #1290: Implement a variant search extension

Add link to Genome Browser for "chr:pos" values.

comment:22 by Nicklas Nordborg, 4 years ago

In [6140]

References #1290: Implement a variant search extension

Use separate item lists for the 'filtered' and 'all' indexes.

Also implemented a feature that wait a bit more (up to 4 hours) before adding rawbioassays to the index if the list contains very few (<24) since it is more efficient (storage-wise) to index multiple documents at the same time. Since we are sequencing about 24 samples at the same time it is better if we wait until all 24 have been called for variants.

comment:23 by Nicklas Nordborg, 4 years ago

In [6141]

References #1290: Implement a variant search extension

Added a summary document for each raw bioassay to the index. This makes it a lot faster to search for the ID of indexed raw bioassays since we don't have to check all variants. The summary also contains a 'numVariants' field which can be used to filter on the the number of variants as a special filter. This is automatically triggered if the query string is purely numeric.

Examples:

  • =20: find rawbioassays with exactly 20 variants
  • >100: find rawbioassays with more than 100 variants
  • =0: find rawbioassays with 0 variants (that have a VCF file)
  • =, <>%: find rawbioassays that has not been indexed
  • %, <>: find rawbioassays that have been indexed, (>=0 will also work!)

comment:24 by Nicklas Nordborg, 4 years ago

In [6144]

References #1290: Implement a variant search extension

The Lucene query extension should only be enabled on the Raw bioassay list page.

comment:25 by Nicklas Nordborg, 4 years ago

In [6145] [6146]

References #1290: Implement a variant search extension

It is now possible to pause automatic indexing by setting AutoProcessing=Disable on the item lists.

Version 0, edited 4 years ago by Nicklas Nordborg (next)

comment:26 by Nicklas Nordborg, 4 years ago

In 6147:

References #1290: Implement a variant search extension

Improving some regular expressions for auto-prefixing so that they work with paranthesis. For example: TP53 AND (423C>G OR 742C>T) will now translate to gene:TP53 AND (c:423C>G OR c:742C>T) instead of gene:TP53 AND (gene:423C>G OR gene:742C>T)

comment:27 by Nicklas Nordborg, 4 years ago

In 6148:

References #1290: Implement a variant search extension

Added a "Help" icon to the table.

comment:28 by Nicklas Nordborg, 4 years ago

In 6171:

References #1290: Implement a variant search extension

Added some more fields to the index that are related to the genotype and number of reads of a variant for a sample:

  • GT: Genotype
  • DP: Depth
  • VD: Variant depth
  • AF: Allele frequency

comment:29 by Nicklas Nordborg, 4 years ago

In 6172:

References #1290: Implement a variant search extension

Added some more fields to the index that are related to the genotype and number of reads of a variant for a sample:

  • GT: Genotype
  • DP: Depth
  • VD: Variant depth
  • AF: Allele frequency

comment:30 by Nicklas Nordborg, 4 years ago

In 6175 and [6174]:

References #1290: Implement a variant search extension

Added some more fields to the index:

Last edited 4 years ago by Nicklas Nordborg (previous) (diff)

comment:31 by Nicklas Nordborg, 4 years ago

In 6233:

References #1290: Implement a variant search extension

Actually implemented multi-threaded searching. It requried the use of a CollectorManager implementation and not only a Collector as before.

comment:32 by Nicklas Nordborg, 4 years ago

In 6234:

References #1290: Implement a variant search extension

Added code to keep the login session alive if indexing should take too long time.

comment:33 by Nicklas Nordborg, 4 years ago

In 6235:

References #1290: Implement a variant search extension

Searching on the "rbaId" field should be handled as an integer field.

comment:34 by Nicklas Nordborg, 4 years ago

In 6236:

References #1290: Implement a variant search extension

Implemented limitation on the number of variants that are displayed in search results. There is a total limit for the entire table and also a limit per raw bioassay. The main reason for the limit is that browsers have a hard time displaying too many hits at the same time.

comment:35 by Nicklas Nordborg, 4 years ago

In 6241:

References #1290: Implement a variant search extension

Several improvements related to performance of indexing and searching.

  • Setting number of threads for indexing and merging
  • Merge policy will use smaller segments
  • It is possible to abort indexing by stopping the Variant Search Service
  • Searching is now time limited to 60 seconds. This was possible since the actual search for "documents" is typically fast. Most time is spent converting internal document id values to raw bioassay id values. This is done in RawBioAssayIdCollector which now has a timeout and it will simply ignore calls after the timeout. The user is notified if this happens.

comment:36 by Nicklas Nordborg, 4 years ago

In 6242:

References #1290: Implement a variant search extension

Several improvements related to performance of indexing and searching.

  • Setting number of threads for indexing and merging
  • Merge policy will use smaller segments
  • It is possible to abort indexing by stopping the Variant Search Service
  • Searching is now time limited to 60 seconds. This was possible since the actual search for "documents" is typically fast. Most time is spent converting internal document id values to raw bioassay id values. This is done in RawBioAssayIdCollector which now has a timeout and it will simply ignore calls after the timeout. The user is notified if this happens.

comment:37 by Nicklas Nordborg, 4 years ago

In 6245:

References #1290: Implement a variant search extension

Implemented functionality for setting a timeout in the query string. The default timeout is changed to 30 seconds. A user may force a longer (or shorter) timout by ending the query with /xx where xx is the timeout in seconds. The maximum allowed timeout is 600 seconds.

comment:38 by Nicklas Nordborg, 4 years ago

In 6246:

References #1290: Implement a variant search extension

Use single-threaded queries for some special queries since they should always be quick and the multi-threaded version may cause them to be put in a queue if some other user is doing a heavy query at the same time.

comment:39 by Nicklas Nordborg, 4 years ago

In 6248:

References #1290: Implement a variant search extension

Index the raw bioassay id field as a DocValues field instead. This means that values are stored column-wise instead of row-wise in the index and improves performance a lot when a query need to return many hits. Performance is improved up to 100 or 1000 times. For example a query for all variants on chr1 would return 80M+ hits in 10-20 minutes before this change and now takes only about 5 seconds.

comment:40 by Nicklas Nordborg, 4 years ago

In 6251:

References #1290: Implement a variant search extension

Remove cosmic_nc_ID from indexing since it seems like the ID values are not found on the COSMIC website.

comment:41 by Nicklas Nordborg, 4 years ago

In 6252:

References #1290: Implement a variant search extension

Minor changes to message about number of hits.

comment:42 by Nicklas Nordborg, 4 years ago

In 6253:

References #1290: Implement a variant search extension

Display an animation in the Variant details dialog in case loading the information takes some time.

comment:43 by Nicklas Nordborg, 4 years ago

In 6260:

References #1290: Implement a variant search extension

Change the marker for setting a timout to # since / interferes with queries against the gt field. For example, gt:1/1 would be interpreted as a guery gt:1 with a 1 second timeout.

Also fixes a typo and a debug message.

comment:44 by Nicklas Nordborg, 4 years ago

In 6262:

References #1290: Implement a variant search extension

Updated to Lucene 8.8.2. Existing databases need to be re-indexed.

comment:45 by Nicklas Nordborg, 4 years ago

In 6263:

References #1290: Implement a variant search extension

Implement sorting of results by chromosome+position when there are multiple variant hits per raw bioassay.

comment:46 by Nicklas Nordborg, 4 years ago

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.