Opened 3 weeks ago

Last modified 5 days ago

#1582 new task

Add support for searching among imputed genotypes

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Variant Search v1.11
Component: net.sf.basedb.varsearch Keywords:
Cc:

Description

Reggie is implementing an imputation pipeline that will provide genotypes for some ~85M variants (see #1575). It would be nice to be able to search among the genotypes to find samples with a given variant.

It should be possible to implement something that is similar to the current support for OncoArray where the 85M variants are mapped as an array design. We can probably not index each sample indivdually, but it might work to use the imputed VCF files where the genotypes are stored in a big matrix. HTSJDK (https://github.com/samtools/htsjdk) can read those files very efficiently.

Change History (16)

comment:1 by Nicklas Nordborg, 3 weeks ago

In 7651:

References #1582: Add support for searching among imputed genotypes

Started to work with this. Added an item list and an array design and a new Index implementation that builds on the OncoArrayIndex. So far, the only change is that it allows multiple VCF files to be attached to the array design since we probably want one VCF per chromosome.

comment:2 by Nicklas Nordborg, 2 weeks ago

In 7652:

References #1582: Add support for searching among imputed genotypes

Started with a bigger re-factoring. A new SplitIndex class will hold functionality for indexes that are split between a reference-index and a sample-index.

comment:3 by Nicklas Nordborg, 2 weeks ago

In 7653:

References #1582: Add support for searching among imputed genotypes

Implemented indexing of imputed raw bioassays. Lots of other changes to support VCF files that are not in the BASE file system.

comment:4 by Nicklas Nordborg, 12 days ago

In 7655:

References #1582: Add support for searching among imputed genotypes

Started to implement the part were the results from the reference query is translated to a query for samples. We use HTSJDK to read from the imputed result files on the locations we get from the reference query. We can then get the sample names from the VCF column headers for the requested genotype and create a query that is executed against the sample index. The implementation works, but there are more details to sort out.

comment:5 by Nicklas Nordborg, 11 days ago

In 7656:

References #1582: Add support for searching among imputed genotypes

The imputed index can now find results for each sample to display in the table.

comment:6 by Nicklas Nordborg, 11 days ago

In 7657:

References #1582: Add support for searching among imputed genotypes

Implemented support for AF and DR2 fields. It works, but solution need to be more flexible since there are now several differences between the different databases.

comment:7 by Nicklas Nordborg, 10 days ago

In 7658:

References #1582: Add support for searching among imputed genotypes

Implemented a bit more flixible solution for indexing custom fields.

comment:8 by Nicklas Nordborg, 10 days ago

In 7659:

References #1582: Add support for searching among imputed genotypes

Implemented a more flixible solution for subcolumns that can be displayed in the table.

comment:9 by Nicklas Nordborg, 6 days ago

In 7660:

References #1582: Add support for searching among imputed genotypes

Implemented support for displaying details about an imputed SNP. This required a lot of changes to underlying code for parsing VCF files so that we can get to the correct line for both VCF files stored in the BASE file system and VCF files stored on the local file system. Hopefully nothing is badly broken...

comment:10 by Nicklas Nordborg, 6 days ago

In 7661:

References #1582: Add support for searching among imputed genotypes

Cleaning up code and fixing some methods that returned incorrect counts (or no counts).

comment:11 by Nicklas Nordborg, 6 days ago

In 7662:

References #1582: Add support for searching among imputed genotypes

Changed "sampleName" to "externalName".

comment:12 by Nicklas Nordborg, 5 days ago

In 7663:

References #1582: Add support for searching among imputed genotypes

Switched from using VCFFileReader to using TabixReader (via our own VcfParser) when parsing the reference file for samples matching the requested genotype. The reason is that we get the genotypes as they are in the VCF file (eg. 0|1 instead of A|C).

Also fixed an issue with the TabixReader that may return INDEL lines before the requested position if the overlap with that position.

comment:13 by Nicklas Nordborg, 5 days ago

In 7664:

References #1582: Add support for searching among imputed genotypes

Some INFO annotations (such as AF and DR2) have multiple values for multi-allelic sites. To be able to search and display the values we need to add multiple entries to the database.

comment:14 by Nicklas Nordborg, 5 days ago

In 7665:

References #1582: Add support for searching among imputed genotypes

Fixes issue with displaying REF and ALT that contain HTML characters.

comment:15 by Nicklas Nordborg, 5 days ago

In 7666:

References #1582: Add support for searching among imputed genotypes

Added SVTYPE to the index.

comment:16 by Nicklas Nordborg, 5 days ago

In 7667:

References #1582: Add support for searching among imputed genotypes

If the TYPE and VT annotations are both missing we generate a type from the REF and ALT information.

Note: See TracTickets for help on using tickets.