id	summary	reporter	owner	description	type	status	priority	milestone	component	resolution	keywords	cc
1346	Implement support for OncoArray SNP data	Nicklas Nordborg	Nicklas Nordborg	"This ticket is about importing the !OncoArray SNP data that we already have for 1995 blood samples.

We already have BloodDNA items in BASE that represents the aliquots used for this. Typical name is `1234567.b.d.x1.x1`. The SNP data only have `1234567` as identifiers so we need to check that they match up.

The plan is to represent the lab-work also in the data-structure in BASE. The BloodDNA items should be linked to !PhysicalBioAssay items representing the actual SNP chip used. A new subtype `BeadChip` should be created with a naming convention `BeadChipNNNN`, where NNNN is a counter (similar to how `FlowCell`s are named). Annotations on a `BeadChip`:

 * !ChipType: (eg. OncoArray500K)
 * Barcode: (numerical, eg. 10001187003)
 * more... ?

The Barcode will allow to find data files related to the chip since it is expected to be part of a directory name in a given folder structure.
Each !BeadChip has 24 locations and should be linked to the 24 BloodDNA items used on the chip. To find the correct data files for each sample we need information about the location. Locations are named with row+column coordinates (`R01C01`... `R12C02`). Theoretically we can construct a location string from the index (1-24), but it may be better to store this as annotation on the `BloodDNA` items.

The scanning of a !BeadChip is represented by a !DerivedBioAssay `Scan` item. Dates, Scanner ID, etc. can be extract from the data we have and should be imported as annotations or linked Hardware items. The `DataFilesFolder` annotation will point to a folder with the scan data. We will need the *.idat files in the next step.

The scanned data (*.idat) will be analyzed by `iaap-cli` (https://support.illumina.com/downloads/iaap-genotyping-cli.html) to produce a set of 24 GTC files. The result from this step will be represented by !DerivedBioAssay `GenotypeCall` items (one for each sample). We use the regular naming convention by adding a `.gt` suffix (but the `x1` are removed). Example: `1234567.b.d.gt`. The GTC files will be stored in the `<project-archive>` using a similar convention to what we already have (eg. `../12/1234567.b/d.gt). Other useful metadata from the genotype calling can be stored as annotations:

 * Call rate
 * GC10, GC50
 * etc...

**Downstream analysis**

The GTC files can be used for extracting information to other formats. For example, it is possible to export tab-separated files or convert to VCF files. This will be addressed in another ticket.

**Wizards**

We do not plan to implement wizards for this. Reggie will simply create/define item types, annotation types, etc. that are needed. Batch importers will be used to import data and batch exporters will be used to get data into scripts that are manually created.

"	task	closed	major	Reggie v4.34	net.sf.basedb.reggie	fixed