Opened 7 years ago

Closed 7 years ago

#1022 closed enhancement (fixed)

INCA import/statistics wizards changes due to INCA 2.0

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.16
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

INCA has been updated to version 2.0. Most variables have been renamed, which affects the INCA import and INCA statistics wizards. A quick check reveals that at least the following variables are referenced by name:

  • A030DiaDat
  • A080OpDat
  • A090InvCa_Värde
  • PATID
  • A030Sida_Beskrivning
  • A100ER_Värde
  • A100HER2_Värde
  • A100NHG_Värde
  • A100PR_Värde
  • A000Alder
  • A090HistoTumStl

We need to update the code to be able to work with the new variable names.

Change History (14)

comment:1 by Nicklas Nordborg, 7 years ago

The following table should map the old names to the new names.

INCA 1 INCA 2
A030DiaDat a_diag_dat
A080OpDat op_kir_dat
A090InvCa_Värde op_pad_invasiv_Värde
PATID PAT_ID
A030Sida_Beskrivning a_pat_sida_Beskrivning
A100ER_Värde op_pad_er_Värde
A100HER2_Värde op_pad_her2ish_Värde
A100NHG_Värde op_pad_nhg_Värde
A100PR_Värde op_pad_pr_Värde
A000Alder a_pat_alder
A090HistoTumStl op_pad_invstl

Some of the A100 variables also have a variant with a_pad_ prefix, but from the example data it seems like the information has been migrated to the op_pad_ variables.

Last edited 7 years ago by Nicklas Nordborg (previous) (diff)

comment:2 by Nicklas Nordborg, 7 years ago

The A030DiaDat annotation was defined as: Tidigaste datum då diagnos fastställdes kliniskt och/eller genom morfologisk undersökning.

The a_diag_dat annotation in the new version is defined as: F.o.m 2.0.0 provtagningsdatum tidigare diagnosdatum. Ange datum för första punktion/biopsi.

This may affect the meaning of the ReferenceDate and ReferenceDataSource annotation we use in Reggie for calculating relative dates. The ReferenceDateSource is an enum with possible options:

  • IncaDiagnosisDate
  • SamplingDate
  • ConsentDate
  • RegistrationDate

The first option is used whenever we get a date from INCA. Should we rename that option to something else? IncaSamplingDate? Or is this confusing with the second option?

comment:3 by Nicklas Nordborg, 7 years ago

Status: newassigned

comment:4 by Nicklas Nordborg, 7 years ago

(In [4702]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

The import wizard now uses INCA2_ as the expected prefix for INCA annotations.

a_diag_dat is used instead of A030DiaDat for setting the reference date.

a_pat_sida_Beskrivning is used instead of A030Sida_Beskrivning for matching the case with laterality.

The import wizard seems to be working, but has only been tested with a simle file (3 entries).

The statistics wizard crashes with an ArrayIndexOutOfBoundsException which is not suprising since it still use the old variable names.

comment:5 by Nicklas Nordborg, 7 years ago

(In [4703]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

Updated the statistics wizard to use the new variables as specified earlier in this ticket. It doens't crash and seems to produce a result. A more detailed check is required since some of the variables now have a list of options that is different from the old list.

comment:6 by Nicklas Nordborg, 7 years ago

(In [4704]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

The "Cancer type" filter that used to have three options:

1 = Only invasive cancer 2 = Only in situ cancer 3 = Both

Now only has two options:

1 = Invasive cancer with or without in situ cancer 2 = Only in situ cancer

The filter has been updated to reflect the changes. Options for the other variables seems to be the same as before (the options that we care about).

comment:7 by Nicklas Nordborg, 7 years ago

(In [4707]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

Added support for importing annotations from "uppföljning" file. Basically there are two things that are different from the regular file:

  • Laterality is taken from the u_pat_sida_Beskrivning variable.
  • If there are multiple lines with the same personal number and laterality only the line with the latest value in u_dat is kept.

comment:8 by Nicklas Nordborg, 7 years ago

(In [4708]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

Sorting header names in output files alphabetically and a few other minor changes.

comment:9 by Nicklas Nordborg, 7 years ago

(In [4718]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

Fixes a NullPointerException in the importer.

comment:10 by Nicklas Nordborg, 7 years ago

(In [4719]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

Ignore lines from the "uppföljning" file without a u_dat value. The filtering is happening at an early stage in the parsing process in order to not disturb downstreams functionality. In princicple it is as if the line was not in the file to begin with. A possible side effect is that line numbers may not be reported correctly.

comment:11 by Nicklas Nordborg, 7 years ago

(In [4720]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

Replaced some List variables with Set since the only interesting method used was contains() which should be faster sets.

comment:12 by Nicklas Nordborg, 7 years ago

(In [4746]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

The code for filtering the follow-up file on date did't work as expected due to interference with the filter that removed all but two data lines for the same patient. The filtering code has been re-organized and the the ">2 lines" filter has been replaced with a "missing lateratlity" filter.

There was also a problem with the output CSV file which included duplicate entries of all imported lines (but without the PAT* value).

comment:13 by Nicklas Nordborg, 7 years ago

(In [4747]) References #1022: INCA import/statistics wizards changes due to INCA 2.0

We should not care about the follow-up date on lines were we have no personal number. The should simply be included (in the output CSV file) as is.

comment:14 by Nicklas Nordborg, 7 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.