Changes between Initial Version and Version 1 of net.sf.basedb.varsearch/using


Ignore:
Timestamp:
Feb 23, 2021, 1:54:47 PM (22 months ago)
Author:
Nicklas Nordborg
Comment:

Started to write documentation about searching for variants

Legend:

Unmodified
Added
Removed
Modified
  • net.sf.basedb.varsearch/using

    v1 v1  
     1= How to use the Variant Search extension =
     2
     3== Introduction ==
     4The current search functionality is integrated directly in BASE. Go to the raw bioassays list page and add the **Variant (filtered)** column to the table using the **Columns...** dialog, or right-click on the title row and enabled it from the popup menu.
     5
     6By default, the column displays **Yes** or **No** depending on if the VCF file has been indexed or not. If **Yes**, the number of variants is also displayed.
     7
     8== Searching ==
     9
     10The variant database uses Apache Lucene instead of PostgreSQL. Thus, the
     11search field for this column behaves differently from other search fields
     12in BASE.
     13
     14It is possible to search for raw bioassays that have been indexed or not
     15and the number of variants:
     16
     17 * `%`: Find all indexed raw bioassays
     18 * `<>%` or `=`:  Find all raw bioassays that are not indexed
     19 * `=0`: Find indexed raw bioassays with 0 variants
     20 * `>=500`: Find indexed raw bioassays with at least 500 variants
     21
     22**Note! ** This syntax is similar to the regular BASE syntax, but is for example, not possible to use a list of values: `=1|2` (will not work)
     23
     24== Searching for variants ==
     25
     26A typical search in a Lucene database consists of a **field name** and **query string** separated by a **colon** (:). Here are a few examples:
     27
     28 * `gene:TP53`: Find variants in the TP53 gene
     29 * `chr17:7675189`: Find variants at the specified genomic location
     30 * `c:423C>G`: Find variants at the given transcript position
     31 * `p:C141W`: Find variants causing the specified protein change
     32 * `cosmic:COSV52706449`: Find variants for the COSMIC ID
     33 * `rsid:rs1057519977`: Find variants for the dbSNP ID
     34
     35**Tip! ** The prefix can in many cases be skipped since it is almost always possible to automatically make an educated guess about the field with some pattern matching on the query string. The **gene:** field is the default and will be used if a field can't be automatically guessed.
     36Here are the same examples as above:
     37
     38 * `TP53`: Since **gene:** is the default field
     39 * `423C>G`: Numbers followed by [ACGT] and >
     40 * `C141W`: Single-letter with numbers in-between
     41 * `COSV52706449`: 'COSV' follwed by numbers
     42 * `rs1057519977`: 'rs' followed by numbers
     43
     44**Note! ** If a query don't return the expected result always try again
     45with explicit field prefixes.
     46
     47It is possible to specify multiple conditions with **AND** and **OR**.
     48
     49 * `TP53 AND 423C>G`
     50 * `TP53 AND (423C>G OR 742C>T)`
     51
     52The **AND** and **OR** keywords are case-sensitive and **OR** is implicit if not specified. Thus, the query string `TP53 and 423C>G` will not return the expected results since it is interpreted as: `gene:TP53 OR gene:and OR c:423C>G`.
     53
     54For more information about Lucene query syntax see https://lucene.apache.org/core/8_8_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package.description
     55
     56== Indexed fields ==
     57
     58The variant database currently contains the following indexed fields:
     59
     60 * `gene:` Gene name from the **ncbiRefSeq** annotation the !SnpEff **Gene_Name** column. Case-insensitive.
     61 * `type:` Type of variant. Can be one of **SNV**, **Insertion**, **Deletion**, **Complex**.
     62 * `cosmic:` ID of the variant in the [https://cancer.sanger.ac.uk/cosmic COSMIC database].
     63 * `rsid:` ID of variant in the [https://www.ncbi.nlm.nih.gov/snp/ dbSNP database].
     64 * `c:` Transcript change. Taken from the **cosmic_CDS** annotation and !SnpEff **HGVS.c** column.
     65 * `p:` Expected amino-acid change in the protein. Taken from **cosmic_AA** annotation and !SnpEff **HGVS.p** column. **Note! ** COSMIC uses single-letter notation for amino acids, while !SnpEff uses three-letter notation. To make things simple, all values have been converted to single-letter notation in the index. See https://www.hgvs.org/mutnomen/codon.html for a table with translations.
     66 * `ref:` Reference allele
     67 * `alt:` Alternate allele
     68 * `chrom:` Name of chromosome(eg. ''chr1'', ''chr2'', etc)
     69 * `pos:` Location within the chromosome. The location is also indexed using the chromsome name as a field name. Thus, the query string `chrom:chr17 AND pos:7675189` can be written as `chr17:7675189` (recommended).
     70