Opened 15 years ago

Closed 15 years ago

Last modified 15 years ago

#155 closed enhancement (wontfix)

New error handling option for duplicated features when creating array design from a bgx-file

Reported by: johan Owned by: Nicklas Nordborg
Priority: minor Milestone:
Component: net.sf.basedb.illumina Keywords:
Cc:

Description

A bgx file contains array design information for Illumina arrays. The file is a gzip-compressed text file. The file also contains reporter annotations.

  • The Probe_Id column should be used as the reporter id.
  • The Array_Address_Id column is the Feature ID. This is the ID that is present in the raw data files. Thus, the array designs must use the FEATURE_ID as the feature identification method.


The bgx file has two sections: [Probes] and [Controls]. Features from both sections are imported to the array design. See ticket:90 for completed task of creating array design from BGX file.


A new error handling option for duplicated features would be useful (below):

When creating array design for Illumina Human HT12 bead chip the following error was received.
"net.sf.basedb.core.BaseException: Item already exists: Another feature with the same identifier already exists: Feature ID=0002100273 on line 48876: ILMN_2038774 0002..."

The reason for the duplicated feature ID is that the same feature (Feature ID=0002100273) is used in both the [Probes] and the [Controls] section.

When investigating the duplicated feature, the Probe_Id and Probe_Sequence are identical. Thus, the duplication is probably just an example of where Illumina use a general gene probe as one of their control probes. In the given example the gene probe is for EEF1A1 (gene symbol). This gene is used as a control probe (as a housekeeping gene) as it is involved in translation and frequently expressed at high levels.

This duplication of a feature, i.e., one duplicate present in the [Probes] section and the other in the [Controls] section should possibly be handled as a special case. One could import the features from the BGX file by selecting error handling "Duplicate feature = skip". An additional error handling option would be to skip feature only if the duplicates are exactly 2 and also separated in the sense that one is in the [Probes] section and the other in the [Controls] section.

In this way the user will have control over the specific type of duplication of features that is allowed.

When running the import using "Duplicate feature = skip" the following message is received. "Done: 49576 feature(s) inserted; 1 block(s) inserted; 11 spot(s) skipped due to errors". As a user you are naturally interested in having some control over what errors have occurred. With an enhanced error handling for duplicated features the user would not have to receive these errors.

Attachments (1)

HumanHT-12_V3_0_R1_11283641_A.bgx (6.2 MB ) - added by johan 15 years ago.
BGX file for Illumina Human HT12 bead chip

Change History (5)

by johan, 15 years ago

BGX file for Illumina Human HT12 bead chip

comment:1 by johan, 15 years ago

For clarity. The additional error handling option would be to skip the duplicated feature only if the duplicates are exactly 2 and separated in the sense that one is in the [Probes] section and the other (the duplicate that is to be skipped) is in the [Controls] section.

comment:2 by Nicklas Nordborg, 15 years ago

Component: net.sf.basedb.examples.extensionsnet.sf.basedb.illumina
Resolution: wontfix
Status: newclosed

The duplication control/skipping is happening in the core and it can only be ON (skip duplicates) or OFF (throw error). It doesn't know anything about sections and it doesn't keep any count on the number of times it has seen a specific feature.

The BGX feature importer is only a wrapper that on the fly changes the parsing of the file in a way that makes it possible for the generic feature importer in the core plug-in package to read the complete file in one go.

In the end this doesn't affect what is imported.

comment:3 by Jari Häkkinen, 15 years ago

Is it possible to call the core importer twice, first with the [Probes] section and then followed by the [Controls] section. In this way the two section imports would be independent, and probably no duplicates will be detected. If duplicates happen within sections, the bgx importer would detect in which section the duplicate was found.

comment:4 by Nicklas Nordborg, 15 years ago

No it is not possible. And it doesn't matter. There can only be one feature with a given FEATURE_ID on the array design. Duplicates must be skipped or generate an error. It doesn't matter if there is one in each section or 200. In any case, all of this is going on in the core and the ReporterMapImporter core plug-in. The BGX wrapper can't have parameters or options that is not in the ReporterMapImporter.

Note: See TracTickets for help on using tickets.