wiki:net.sf.basedb.varsearch/using

How to use the Variant Search extension

Introduction

The current search functionality is integrated directly in BASE. Go to the raw bioassays list page and add the Variant (filtered), Variants (all) and/or Variants (targeted) columns to the table using the Columns... dialog, or right-click on the title row and enable them from the popup menu.

By default, the columns display Yes or No depending on if the VCF file has been indexed or not. If Yes, the number of variants is also displayed.

  • Variants (all): This index contain all variants found in the variant calling pipeline. There are typically several thousands for each rawbioassay.
  • Variants (filtered): This index contain the variants that survived the filtering step in the variant calling pipeline. There are typically less than 100 variants for each raw bioassay.
  • Variants (targeted): This index contain the results from the targeted genotyping and include results for both when a variant is found and when it is not found. There are currently 3 sets of targeted genotyping:
    • ESR1 (13 resistance mutations related to standard hormone therapy)
    • PIK3CA (11 variants related to Alpelisib treatment)
    • DPYD (6 variants related to fluoropyrimidine-associated toxicity)

Searching

The variant database uses Apache Lucene instead of PostgreSQL. Thus, the search field for this column behaves differently from other search fields in BASE.

It is possible to search for raw bioassays that have been indexed or not and the number of variants:

  • %: Find all indexed raw bioassays
  • <>% or =: Find all raw bioassays that are not indexed
  • =0: Find indexed raw bioassays with 0 variants
  • >=500: Find indexed raw bioassays with at least 500 variants

Note! This syntax is similar to the regular BASE syntax, but is for example, not possible to use a list of values: =1|2 (will not work)

Searching for variants

A typical search in a Lucene database consists of a field name and query string separated by a colon (:). Here are a few examples:

  • gene:TP53: Find variants in the TP53 gene
  • chr17:7675189: Find variants at the specified genomic location
  • c:423C>G: Find variants at the given transcript position
  • p:C141W: Find variants causing the specified protein change
  • cosmic:COSV52706449: Find variants for the COSMIC ID
  • rsid:rs1057519977: Find variants for the dbSNP ID
  • gt:1/1: Find variants that are homozygous
  • dp:>=100: Find variants that has been sequenced to a depth 100 or more
  • vd:>=100: Find variants where at least 100 alternate alleles have been found
  • af:>0.8: Find variants where the alternate allele frequency is larger than 0.8
  • effect:synonymous: Find variants where the predicted effect is synonymous

Tip! The prefix can in many cases be skipped since it is almost always possible to automatically make an educated guess about the field with some pattern matching on the query string. The gene: field is the default and will be used if a field can't be automatically guessed. Here are the same examples as above:

  • TP53: Since gene: is the default field
  • 423C>G: Numbers followed by [ACGT] and >
  • C141W: Single-letter with numbers in-between
  • COSV52706449: 'COSV' follwed by numbers
  • rs1057519977: 'rs' followed by numbers
  • 1/1: 0 or 1 with forward slash in-between

Note! If a query don't return the expected result always try again with explicit field prefixes. The following fields are never auto-detected:

  • dp
  • vd
  • af
  • effect

It is possible to specify multiple conditions with AND and OR.

  • TP53 AND 423C>G
  • TP53 AND (423C>G OR 742C>T)

The AND and OR keywords are case-sensitive and OR is implicit if not specified. Thus, the query string TP53 and 423C>G will not return the expected results since it is interpreted as: gene:TP53 OR gene:and OR c:423C>G.

For more information about Lucene query syntax see https://lucene.apache.org/core/8_8_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package.description

Indexed fields

The variant database currently contains the following indexed fields:

  • gene: Gene name from the ncbiRefSeq annotation the SnpEff Gene_Name column. Case-insensitive.
  • type: Type of variant. Can be one of SNV, Insertion, Deletion, Complex.
  • cosmic: ID of the variant in the COSMIC database.
  • rsid: ID of variant in the dbSNP database.
  • c: Transcript change. Taken from the cosmic_CDS annotation and SnpEff HGVS.c column.
  • p: Expected amino-acid change in the protein. Taken from cosmic_AA annotation and SnpEff HGVS.p column. Note! COSMIC uses single-letter notation for amino acids, while SnpEff uses three-letter notation. To make things simple, all values have been converted to single-letter notation in the index. See https://www.hgvs.org/mutnomen/codon.html for a table with translations.
  • ref: Reference allele
  • alt: Alternate allele
  • chrom: Name of chromosome (eg. chr1, chr2, ..., chr22, chrX, chrY and chrM)
  • pos: Location within the chromosome. The location is also indexed using the chromsome name as a field name. Thus, the query string chrom:chr17 AND pos:7675189 can be written as chr17:7675189 (recommended).
  • effect: Expected effect of the variant as determined by SnpEff. There is a table with a list of possible values. The _variant suffix is ignored for those values that have it.
  • gt:: Genotype of the variant. Can be one of 0/1, 1/0 or 1/1.
  • dp:: Depth at the variant location.
  • vd:: Depth of the alternate allele at the variant location.
  • af:: Allele frequency of the alternate allele.

Notes about the targeted genotyping variants

The targeted genotyping index behaves a bit differently than the other indexes since it also contain information where the genotype is 0/0. When searching this index, the default is to automatically exclude entries where the genotype is 0/0. Otherwise, for example, a search for a gene would return results for all raw bioassays.

The exception is when the query string is searching on fields that are related to the genotype: gt, dp, vd and af. In this case the 0/0 entries are not excluded.

Notes about the OncoArray genotyping index

This index also behaves differently than the other indexes. Since there are almost 500 thousands variants that have been genotyped for each raw bioassay it is not possible to create a single index with all information. It has to be split into one index for the "design" that contain information and gene annotations about the 500 thousand variants, and one index that enumerates the variants for each of the three possible genotypes. The index doesn't contain any information about allele frequency or other information that is per raw bioassay and variant.

Thus, gt is the only genotype-related field that is searchable and it is only possible to use it once in the query.

Timeouts

The query against the Lucene database has a default timeout of 30 seconds. If complete results and information about the variants can't be loaded in that timeout the query will stop and only the results that have been gathered so far are used when matching against raw bioassays. A message is displayed in the column header:

The query did not finish within 30 seconds. Showing results based on 4.3M hits
out of 56.1M total hits.

It is possible to change the timeout by appending #TT to the query string, where TT is the desired timeout. For example: chr1#120 will increase the timeout to 120 seconds when searching for all variants in chromosome 1. The timeout is not allowed to be longer than 600 seconds.

Last modified 8 months ago Last modified on Aug 28, 2023, 11:25:05 AM
Note: See TracWiki for help on using the wiki.