Opened 3 years ago

Closed 3 years ago

#1354 closed enhancement (fixed)

Search functionality for the OncoArray-500K SNP chip

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Variant Search v1.5
Component: net.sf.basedb.varsearch Keywords:
Cc:

Description

Reggie will add VCF files with genotype data for the OncoArray 500K SNP chip (#1353). We should implement search functionality for this data set.

This search will have to work a bit differently than the other variant searches since it is not possible to index 500K variants for each item. This is due to limitations in the Lucene engine (max number of documents) and also due to performance.

Instead, the initial idea is to have a two-step search procedure. The first step is to have fixed database with gene and annotation information for the 500K variants on the chip. This search will return a list of SNP ID values. The second database is only indexing SNP ID values for each item in a way that makes it possible to search for REF (0/0), HET (0/1) or ALT (1/1) genotypes.

Change History (21)

comment:1 by Nicklas Nordborg, 3 years ago

Milestone: Variant Search v1.4Variant Search v1.5

Ticket retargeted after milestone closed

comment:2 by Nicklas Nordborg, 3 years ago

In 6539:

References #1354: Search functionality for the OncoArray-500K SNP chip

Added an item list (Genotype index (OncoArray500K)) that is the queue for raw bioassays that should be indexed.

comment:3 by Nicklas Nordborg, 3 years ago

In 6540:

References #1354: Search functionality for the OncoArray-500K SNP chip

Major refactoring of the indexing functionality. The OncoArray data is now indexed but it is different from the "normal" indexing since we can't index each variant separately. For each raw bioassay we create 3 "documents". One document each for gt:0/0, gt:0/1 and gt:1/1 with a field snps that is a list of SNP-ID values. We also create a 4:th document with some summary information which is identical to the normal indexing.

Some functions for getting metadata (counts, etc.) have been moved to the subclasses since the queries that are needed are different. There are still some functions that need to be fixed.

comment:4 by Nicklas Nordborg, 3 years ago

In 6541:

References #1354: Search functionality for the OncoArray-500K SNP chip

Major refactoring of the indexing functionality. The OncoArray data is now indexed but it is different from the "normal" indexing since we can't index each variant separately. For each raw bioassay we create 3 "documents". One document each for gt:0/0, gt:0/1 and gt:1/1 with a field snps that is a list of SNP-ID values. We also create a 4:th document with some summary information which is identical to the normal indexing.

Some functions for getting metadata (counts, etc.) have been moved to the subclasses since the queries that are needed are different. There are still some functions that need to be fixed.

comment:5 by Nicklas Nordborg, 3 years ago

In 6543:

References #1354: Search functionality for the OncoArray-500K SNP chip

Added a file type for VCF files that are used to define an array design.

comment:6 by Nicklas Nordborg, 3 years ago

In 6544:

References #1354: Search functionality for the OncoArray-500K SNP chip

Added functionality for implementing "custom actions" on an index. The idea is to use this for indexing the reference VCF file (not yet implemented).

Updated some functions for metadata and also implemented a separate "query status" for an index to make it possible to disable it for querying until everything has been setup. This feature is used for the OncoArray index until the reference VCF has been indexed.

comment:7 by Nicklas Nordborg, 3 years ago

In 6545:

References #1354: Search functionality for the OncoArray-500K SNP chip

Implemented first version of indexing the reference VCF file.

comment:8 by Nicklas Nordborg, 3 years ago

In 6547:

References #1354: Search functionality for the OncoArray-500K SNP chip

Implemented query functionality for the OncoArray chip. This is a two-step process which first searches the reference index for a list of matching SNPs and then the variant database for raw bioassays that have a 0/1 or 1/1 genotype.

The implementation for displaying the results is just a temporary solution to prove that is is working. It need to be re-factored.

comment:9 by Nicklas Nordborg, 3 years ago

In 6548:

References #1354: Search functionality for the OncoArray-500K SNP chip

Fixes issues with query results when the user has entered mulitple filter rows.

comment:10 by Nicklas Nordborg, 3 years ago

In 6549:

References #1354: Search functionality for the OncoArray-500K SNP chip

Do not load full document information if there are more SNPs than the max limit. It will improve speed in cases that only result in an error or a message that there are more hits.

comment:11 by Nicklas Nordborg, 3 years ago

In 6550:

References #1354: Search functionality for the OncoArray-500K SNP chip

Re-factored the loading of Document hits and moved the implementation to different subclasses for the normal variant calling indexes and the OncoArray index.

comment:12 by Nicklas Nordborg, 3 years ago

In 6551:

References #1354: Search functionality for the OncoArray-500K SNP chip

The "Variant details" dialog now supports loading information from VCF two files where the second file is just genotype information (the first file is all gene annotations).

comment:13 by Nicklas Nordborg, 3 years ago

In 6552:

References #1354: Search functionality for the OncoArray-500K SNP chip

Implemented support for subclassing the summary message about the number of hits and for setting default and allowed subcolumns.

The OncoArray index has no possibility to display or search on fields that have specific values per raw bioassay.

comment:14 by Nicklas Nordborg, 3 years ago

In 6553:

References #1354: Search functionality for the OncoArray-500K SNP chip

Re-factored the implementation to improve performance. Instead of loading each genotype per raw bioassay and SNP the implementation will now bulk load all raw bioassay IDs for a SNP/genotype combination. This reduces the number of queries that are needed and we can also cache the result as set of raw bioassay IDs which gives even better performance. The drawback is that we do not get the file ID but we only need that information in the "View details" dialog so we add a special case there instead.

comment:15 by Nicklas Nordborg, 3 years ago

In 6554:

References #1354: Search functionality for the OncoArray-500K SNP chip

Changed some text elements in the "View details" dialog since not everyting is a "variant call".

comment:16 by Nicklas Nordborg, 3 years ago

In 6555:

References #1354: Search functionality for the OncoArray-500K SNP chip

Implemented support for queries against the gt: field. As a result it also improves the support for using multiple filter rows.

comment:17 by Nicklas Nordborg, 3 years ago

In 6556:

References #1354: Search functionality for the OncoArray-500K SNP chip

Improved progress reporting and error handling when indexing the reference VCF file.

Implemented functionality for displaying error for the admin on the web page.

comment:18 by Nicklas Nordborg, 3 years ago

In 6558:

References #1354: Search functionality for the OncoArray-500K SNP chip

Added functionality for hiding and showing details for the indexes on the admin page.

comment:19 by Nicklas Nordborg, 3 years ago

In 6559:

References #1354: Search functionality for the OncoArray-500K SNP chip

Added support for using a '-' modified when searching on the 'gt' field. For example, -gt:0/1 is now possible to find only homozygous genotypes (0/0 or 1/1).

comment:20 by Nicklas Nordborg, 3 years ago

In 6560:

References #1354: Search functionality for the OncoArray-500K SNP chip

Do not display the "There are NNN other variant calls..." message in the "View details" dialog when the genotype is not a variant.

comment:21 by Nicklas Nordborg, 3 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.