| 1 | = How to use the Variant Search extension = |
| 2 | |
| 3 | == Introduction == |
| 4 | The current search functionality is integrated directly in BASE. Go to the raw bioassays list page and add the **Variant (filtered)** column to the table using the **Columns...** dialog, or right-click on the title row and enabled it from the popup menu. |
| 5 | |
| 6 | By default, the column displays **Yes** or **No** depending on if the VCF file has been indexed or not. If **Yes**, the number of variants is also displayed. |
| 7 | |
| 8 | == Searching == |
| 9 | |
| 10 | The variant database uses Apache Lucene instead of PostgreSQL. Thus, the |
| 11 | search field for this column behaves differently from other search fields |
| 12 | in BASE. |
| 13 | |
| 14 | It is possible to search for raw bioassays that have been indexed or not |
| 15 | and the number of variants: |
| 16 | |
| 17 | * `%`: Find all indexed raw bioassays |
| 18 | * `<>%` or `=`: Find all raw bioassays that are not indexed |
| 19 | * `=0`: Find indexed raw bioassays with 0 variants |
| 20 | * `>=500`: Find indexed raw bioassays with at least 500 variants |
| 21 | |
| 22 | **Note! ** This syntax is similar to the regular BASE syntax, but is for example, not possible to use a list of values: `=1|2` (will not work) |
| 23 | |
| 24 | == Searching for variants == |
| 25 | |
| 26 | A typical search in a Lucene database consists of a **field name** and **query string** separated by a **colon** (:). Here are a few examples: |
| 27 | |
| 28 | * `gene:TP53`: Find variants in the TP53 gene |
| 29 | * `chr17:7675189`: Find variants at the specified genomic location |
| 30 | * `c:423C>G`: Find variants at the given transcript position |
| 31 | * `p:C141W`: Find variants causing the specified protein change |
| 32 | * `cosmic:COSV52706449`: Find variants for the COSMIC ID |
| 33 | * `rsid:rs1057519977`: Find variants for the dbSNP ID |
| 34 | |
| 35 | **Tip! ** The prefix can in many cases be skipped since it is almost always possible to automatically make an educated guess about the field with some pattern matching on the query string. The **gene:** field is the default and will be used if a field can't be automatically guessed. |
| 36 | Here are the same examples as above: |
| 37 | |
| 38 | * `TP53`: Since **gene:** is the default field |
| 39 | * `423C>G`: Numbers followed by [ACGT] and > |
| 40 | * `C141W`: Single-letter with numbers in-between |
| 41 | * `COSV52706449`: 'COSV' follwed by numbers |
| 42 | * `rs1057519977`: 'rs' followed by numbers |
| 43 | |
| 44 | **Note! ** If a query don't return the expected result always try again |
| 45 | with explicit field prefixes. |
| 46 | |
| 47 | It is possible to specify multiple conditions with **AND** and **OR**. |
| 48 | |
| 49 | * `TP53 AND 423C>G` |
| 50 | * `TP53 AND (423C>G OR 742C>T)` |
| 51 | |
| 52 | The **AND** and **OR** keywords are case-sensitive and **OR** is implicit if not specified. Thus, the query string `TP53 and 423C>G` will not return the expected results since it is interpreted as: `gene:TP53 OR gene:and OR c:423C>G`. |
| 53 | |
| 54 | For more information about Lucene query syntax see https://lucene.apache.org/core/8_8_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package.description |
| 55 | |
| 56 | == Indexed fields == |
| 57 | |
| 58 | The variant database currently contains the following indexed fields: |
| 59 | |
| 60 | * `gene:` Gene name from the **ncbiRefSeq** annotation the !SnpEff **Gene_Name** column. Case-insensitive. |
| 61 | * `type:` Type of variant. Can be one of **SNV**, **Insertion**, **Deletion**, **Complex**. |
| 62 | * `cosmic:` ID of the variant in the [https://cancer.sanger.ac.uk/cosmic COSMIC database]. |
| 63 | * `rsid:` ID of variant in the [https://www.ncbi.nlm.nih.gov/snp/ dbSNP database]. |
| 64 | * `c:` Transcript change. Taken from the **cosmic_CDS** annotation and !SnpEff **HGVS.c** column. |
| 65 | * `p:` Expected amino-acid change in the protein. Taken from **cosmic_AA** annotation and !SnpEff **HGVS.p** column. **Note! ** COSMIC uses single-letter notation for amino acids, while !SnpEff uses three-letter notation. To make things simple, all values have been converted to single-letter notation in the index. See https://www.hgvs.org/mutnomen/codon.html for a table with translations. |
| 66 | * `ref:` Reference allele |
| 67 | * `alt:` Alternate allele |
| 68 | * `chrom:` Name of chromosome(eg. ''chr1'', ''chr2'', etc) |
| 69 | * `pos:` Location within the chromosome. The location is also indexed using the chromsome name as a field name. Thus, the query string `chrom:chr17 AND pos:7675189` can be written as `chr17:7675189` (recommended). |
| 70 | |