Opened 4 years ago
Closed 3 years ago
#1290 closed task (fixed)
Implement a variant search extension
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | major | Milestone: | Variant Search v1.0 |
Component: | net.sf.basedb.varsearch | Keywords: | |
Cc: |
Description (last modified by )
The variant calling pipeline in Reggie (see #1199) produces VCF files with lots of variants. In theory it is "easy" to search for variants with grep
or SnpSift
, but this takes a really long time (several hours) due to the large number of files in the system. Some kind of indexing is needed before searching is realistic.
I have made some initial tests with Apache Lucene (https://lucene.apache.org/) and indexed a few of the annotations in the annotated and filtered VCF files. The results are very promising. Searching for variants in a gene or at a specific location typically takes only a few milliseconds. The results can be returned as a list of raw bioassay id:s which means that it should be possible to include this functionality in the regular table listing.
This functionality could of course be integrated into Reggie, but since we also copy the VCF files to Relax, it would be nice if we could implement this as a separate extension that works in both the Reggie and Relax environments.
Change History (46)
comment:1 by , 4 years ago
Component: | not classified → net.sf.basedb.varsearch |
---|---|
Description: | modified (diff) |
Milestone: | → Variant Search v1.0 |
Owner: | changed from | to
Status: | new → accepted |
comment:2 by , 4 years ago
comment:18 by , 4 years ago
comment:19 by , 4 years ago
comment:20 by , 4 years ago
comment:21 by , 4 years ago
comment:22 by , 4 years ago
In [6140]
References #1290: Implement a variant search extension
Use separate item lists for the 'filtered' and 'all' indexes.
Also implemented a feature that wait a bit more (up to 4 hours) before adding rawbioassays to the index if the list contains very few (<24) since it is more efficient (storage-wise) to index multiple documents at the same time. Since we are sequencing about 24 samples at the same time it is better if we wait until all 24 have been called for variants.
comment:23 by , 4 years ago
In [6141]
References #1290: Implement a variant search extension
Added a summary document for each raw bioassay to the index. This makes it a lot faster to search for the ID of indexed raw bioassays since we don't have to check all variants. The summary also contains a 'numVariants' field which can be used to filter on the the number of variants as a special filter. This is automatically triggered if the query string is purely numeric.
Examples:
=20
: find rawbioassays with exactly 20 variants>100
: find rawbioassays with more than 100 variants=0
: find rawbioassays with 0 variants (that have a VCF file)=
,<>%
: find rawbioassays that has not been indexed%
,<>
: find rawbioassays that have been indexed, (>=0
will also work!)
comment:24 by , 4 years ago
comment:25 by , 4 years ago
References #1290: Implement a variant search extension
It is now possible to pause automatic indexing by setting AutoProcessing=Disable
on the item lists.
comment:46 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
In 6107: