Opened 10 years ago

Closed 8 years ago

#525 closed task (fixed)

Import data from INCA

Reported by: Nicklas Nordborg Owned by: olle
Priority: major Milestone: Reggie v4.4
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

We get data from INCA at regular intervals in the form of a tab-separated file. Some of that data should be imported and attached to various items as annotations. Most of them should be related to patient, case, specimen or blood.

Specific details about what data should be attached to what item need to be defined.

Change History (71)

comment:1 by Nicklas Nordborg, 8 years ago

Milestone: Reggie v3.xReggie v4.x

Milestone renamed

comment:2 by olle, 8 years ago

Owner: changed from Nicklas Nordborg to olle

Ticket reassigned to "olle".

comment:3 by olle, 8 years ago

Status: newassigned

Ticket accepted.

comment:4 by olle, 8 years ago

Traceability note:

  • Creation of a csv file to be used when requesting information from the INCA database was introduced in Ticket #487 (Export information intended for INCA).
Last edited 8 years ago by olle (previous) (diff)

comment:5 by olle, 8 years ago

Background info:

  • INCA is an abbreviation of the Swedish expression "Informationsnätverk för cancervården", loosely translated as "Information Network for Cancer Care".
  • INCA database extracts for the SCAN-B project is normally obtained from "RCC syd" (an abbreviation of "Regionalt Cancercentrum syd", "Regional Cancer Center - south"), that manages information on cancer patients in southern Sweden.
Last edited 8 years ago by olle (previous) (diff)

comment:6 by olle, 8 years ago

Background info on INCA export files:

  • Traditionally, INCA export data has been retrieved in two files in spreadsheet format, that were then saved in tab-separated format. Inspection of two such example files, here called "INCA_file_a" and "INCA_file_b", revealed the following information:
  1. The first line in each file contained column header names, and was followed by lines with values.
  2. All column header names were unique in each file.
  3. 5 column header names were identical in both files.
  4. Column header names and column values might contain Swedish national characters 'å', 'ä', and 'ö'.
  5. Most column header names consist of Swedish words (or abbreviations of such).
  6. Some column values containing text might include tab characters (file "INCA_file_a" contained 337 such lines). These tab characters must be removed before the spreadsheet file is stored in tab-separated format, otherwise some lines will appear to contain too many columns, and there is no simple way to find what columns should be joined. Replacing the internal tab characters with empty strings are recommended, unless the tab character is used to separate two strings, in which case a space character should be used.
  7. Some column values containing text might include line feed characters (<LF>). Unlike the internal tab characters, there are ways to combine lines, that are too short, until a line of correct size results, but it is still recommended that the internal line feed characters are removed before the spreadsheet file is stored in tab-separated format.
  8. The files together contained 75 (45 + 30) column pairs, where the column header names in each pair consisted of a common unique identifier plus suffix "_Beskrivning" and "_Värde", respectively, corresponding to "_Description" and "_Value" in English. The value column contained an integer value, that was presumably stored in the database, while the description column contained a Swedish description of the property encoded by that particular value. The columns in a pair did not always come in the same order. File "INCA_file_b" contained two "_Description" columns, without a corresponding "_Value" column, where the contents in both cases consisted of either "Höger", "Vänster", or no value, corresponding to English "Right", "Left", or no value.
  9. Each line contained data for a single patient, but data for one patient might appear in more than one line.
  10. The files contained data for all patients in the requested time interval, but only personal numbers for patients, for which data had been requested in the csv file sent to INCA. However, all data lines had a temporary unique patient ID, which did not correspond to a value in the INCA database, but was added to the export file in order to identify entries related to the same patient.
  11. The sizes of the two INCA import example files in tab-separated format were 7.33 MB and 2.31 MB, respectively, indicating that there should be no problem holding the import data in memory on the server.
INCA example file # Column headers # Value lines # Value lines for requested patients
INCA_file_a 145 9425 6522
INCA_file_b 146 9425 6522
Both files (all columns) 291 9425 6522
Both files (unique columns) 286 9425 6522

The first version of INCA data import should only import data for patients, for which data had been requested in the csv file sent to INCA, i.e., those with personal numbers in the INCA export files.

A description of the variables in the INCA database from 2014-01-01 was available. This together with inspection of data in the two example export files gave the following result:

INCA example file # Column headers # Date columns # String columns # Integer columns # Boolean columns # Float columns
INCA_file_a 145 5 59 65 16 0
INCA_file_b 146 23 37 32 50 4
Both files (all columns) 291 28 96 97 66 4
Both files (unique columns) 286 28 94 94 66 4

Note regarding Boolean columns: If the INCA variable description described a variable as being of type "Kryssruta" (check box), or described as being set to value "sant" (true) at a specific event, the variable is regarded as Boolean. However, if the type is described as a list of values 0 and 1, corresponding to "Nej" (no), and "Ja" (yes), respectively, the type is regarded as Integer.

Types of the columns represented in both example files:

Column headers in both example files Value type Comment
PATID Integer Temporary patient id
PersonalNo String Personal number (for requested patients only)
A030Sida_Beskrivning String Laterality "Höger" (Right), "Vänster" (Left)
A030Sida_Värde Integer Laterality 1 = Right, 2 = Left
KON_VALUE Integer "Kön" (Sex) 1 = Male, 2 = Female

Inspection of the example files indicated the following:

Variable type Value in variable description Value in example files
Boolean variables related to check boxes checked = true 1 for checked = true, null (blank) for unchecked = false
Boolean variables not related to check boxes 1 for true, 0 for false 1 for true, 0 for false
Date "YYYYMMDD" format "YYYY-MM-DD" format
Float Decimal comma, 2 decimals Decimal point, 2 decimals

Note: Sweden, like most countries in central Europe, historically used a decimal comma, but after computers were used more regularly, technical and natural sciences converted in the 1970's to using a decimal point.

Last edited 8 years ago by olle (previous) (diff)

comment:7 by olle, 8 years ago

Possible discrepancies between SCAN-B and INCA data:

Possible causes for a SCAN-B patient entry not appearing in an INCA export file:

  1. The patient was operated at a site not belonging to "RCC syd" (an abbreviation of "Regionalt Cancercentrum syd", "Regional Cancer Center - south"), that manages information on cancer patients in southern Sweden. Two of nine SCAN-B sites, Uppsala and Jönköping, belong to this category at the time of writing.
  2. SCAN-B site Halmstad belongs to "RCC syd", but some of the patients operated there do not themselves belong to the region covered by INCA, and their records are therefore not sent there.
  3. There may be a delay of several months, before a patient record is sent to be included in INCA, while SCAN-B gets the referral form with specimen a few days after the operation.

Possible causes for an INCA patient entry not appearing in the SCAN-B database:

  1. The patient may have retracted the consent to participate in the SCAN-B study in the time period between records having been requested from INCA and having been received.
Last edited 8 years ago by olle (previous) (diff)

comment:8 by olle, 8 years ago

Recommended procedure for creating a tab-separated *.csv file suitable for INCA import into BASE from an *.xlsx INCA export file in spreadsheet format. The instructions are written for Apache OpenOffice Calc 3.4.1, but should be regarded as guidelines for use of other programs:

  1. Open INCA export file *.xlsx in OpenOffice.org Calc.
  2. Replace all tab characters by empty strings:
    a. Menu "Edit" -> "Select All".
    b. Menu "Edit" -> "Find & Replace...".
    c. Click button "More Options", select "Regular expressions" in opened sub-window, in order to allow special characters to be represented with escape character "\".
    d. Search for \t.
    e. Replace with "" (blank field).
    f. Click button "Replace All".
    g. Close "Find & Replace" dialog.
  3. Replace all newline characters by empty strings:
    a. Menu "Edit" -> "Select All".
    b. Menu "Edit" -> "Find & Replace...".
    c. Click button "More Options", select "Regular expressions" in opened sub-window, in order to allow special characters to be represented with escape character "\".
    d. Search for \n.
    e. Replace with "" (blank field).
    f. Click button "Replace All".
    g. Close "Find & Replace" dialog.
  4. Save edited file as *.csv file in tab-delimited format:
    a. Menu "File" -> "Save As...".
    b. In "Save As" dialog, select directory to save created file in.
    c. For "Save as type:" select "Text CSV (.csv) (*.csv)".
    d. For "File name:" change file extension to ".csv", if not already done by "Automatic file name extension".
    e. Click button "Save".
    f. In extra dialog, select "Keep Current Format" (not "Save in ODF Format").
    g. In "Export Text File" dialog, for "Character set" select "Unicode (UTF-8)".
    h. In "Export Text File" dialog, for "Field delimiter" select "{Tab}".
    i. In "Export Text File" dialog, for "Text delimiter" select "" (blank field).
    j. In "Export Text File" dialog, select check box option "Save cell content as shown" (all other check box options unselected).
    k. In "Export Text File" dialog, click button "OK".
  5. Close OpenOffice.org Calc window.
Last edited 8 years ago by olle (previous) (diff)

comment:9 by olle, 8 years ago

Design discussion:

Apart from technical considerations, there are some special issues regarding INCA import, that affects the software design:

  1. The INCA export files contain sensitive data, that can be traced to a specific patient via the personal number, so it is preferable not to require the files to be uploaded to BASE before import.
  2. The INCA import is special in that it does not affect existing item properties or annotations, except the new dedicated INCA annotations. It was therefore decided to let it be implemented by a specific Java servlet, IncaServlet, instead of the existing ImportServlet.
  3. Full import of the data in the two example export files will require ~285 new annotations in BASE (286 unique columns minus one or two used for mapping the data to existing BASE items). Even though they have to be added to the BASE database, it is preferable to keep the Reggie source code independent of the details of the annotations. This can be done if the INCA annotations are given names, that can be mapped to the column header names.

comment:10 by olle, 8 years ago

Design discussion:

  • It was decided to perform the INCA data import in a single session, using the complete set of INCA data files as input, since this allows a check to be made, whether some INCA data annotation types are missing in a specific import session.
  • The INCA import wizard should perform an initial check of each INCA data file, after which the results are presented to the user. It should be possible to initially select a simple file check, that skips an intricate database consistency check (part "C." in the table below), and therefore can be performed much faster.
  • Properties of the INCA file checks:

    a. If critical problems are encountered, import should be blocked.
    b. If problems with individual headers/data lines are encountered, the corresponding data columns/lines might be skipped during import; it is then the user's decision whether to fix the problems in the data file, or proceed with import of the eligible data.
    c. Basic results from the file check should be presented in the web form. In addition, it should be possible to open/download a text file with more detailed information from the file check by clicking on a button. The file should include the information presented in the web form, but also optional information on what headers or data lines problems were found with.
    d. In the report file, due to the sensitive type of information in the INCA data file, temporary patient ID values should be used instead of personal numbers to identify entries in the INCA file.
  • The INCA data file check should include four parts:
Check (Information) Comment
A. Basic check
Number of header columns At least 3 key headers required (checked later).
Number of lines of data
Number of lines with internal line feeds Wizard should remove the internal line feeds before import
Number of lines with too many columns None accepted
Number of lines with too few columns None accepted
B. Internal data check
Number of duplicate header columns None accepted
Temporary patient ID column index Column required
Personal number column index Column required
Laterality description column index Column required
Number of unknown headers Columns skipped at import
Number of data lines with personal no. Required for import
Number of personal no. with more than 2 lines Data lines skipped at import
Number of personal no.s with many identical lateralities Data lines skipped at import
C. Database consistency check (Only lines with personal no. processed)
Number of data lines with personal no. not in database Data lines skipped at import
Number of patient lateralities without database reference Data lines skipped at import
D. Database consistency check II (All files together)
Number of missing INCA headers INCA headers skipped at import
  • INCA import annotation types:

    a. All annotation types are coupled to Case items.
    b. Data in all columns in the two INCA example files except the temporary patient ID and the two mapping columns "PersonalNo" and "A030Sida_Beskrivning" should be imported to annotations. The personal number in the INCA data is used together with the laterality "A030Sida_Beskrivning" value for mapping an INCA entry to a Case entry in BASE, and are therefore not needed.
    c. The name of the annotation type corresponding to a data column should equal prefix "INCA_" plus the name of the header for the column.
    d. The value type of an annotation type should be one of Type.DATE, Type.STRING, Type.INT, Type.BOOLEAN, or Type.FLOAT, corresponding to the type of the corresponding column in the INCA data file, according to the description of the variables in the INCA database from 2014-01-01.
    e. INCA example file two contained two headers, "BN20_Sida_Beskrivning" and "BP20_Sida_Beskrivning", without the corresponding "value" headers, "BN20_Sida_Värde" and "BP20_Sida_Värde", respectively. In order to be able to check if some INCA data annotation types are missing in an import session, it was decided not to define annotation types for the latter two "_Värde" columns.
    f. Columns corresponding to list values in the INCA variable description, should be mapped to annotation types with value options set to the available values. However, except for the "A030Sida_Beskrivning" column used for laterality mapping, value options should only be set for integer "_Värde" columns, since the strings corresponding to these values in the INCA data files are not guaranteed to exactly match the descriptions strings in the INCA variable description.
    g. All INCA annotation types should belong to a new "INCA" annotation type category.
    h. Two extra annotation types, not coupled to columns in the INCA data file, should be added, one for the date that the INCA data was exported from the database, and one for the last date an INCA import was made for a Case item. These two annotation types should not have prefix "INCA_", and should not belong to the new "INCA" annotation type category, since they do not correspond to INCA import file headers, and should be excluded, when checking if some INCA data annotation types are missing in an import session. However, they should belong to the "Case" annotation type category.
    i. At import, a data line in the INCA data file should be mapped to the Case item corresponding to a patient with the same personal number as in the line, and where the Case item has a laterality matching the laterality description in the line.
    j. An INCA annotation should only be updated if the value from the INCA data file at import differs from the current annotation value. If the new value is null, corresponding to an empty cell in the INCA export spreadsheet file, the corresponding annotation should be removed, if existing.
    k. Annotations for the extra annotation types for the date that the INCA data was exported from the database, and the last date an INCA import was made for a Case item, should be updated, even if no INCA annotations for a Case items has been updated, to indicate that the INCA annotation values for the case item equals those of the latest INCA data file.
Last edited 8 years ago by olle (previous) (diff)

comment:11 by olle, 8 years ago

Functional specification update:

  • First version of the INCA data import will be based on the following functionality:
    a. A new "INCA import" entry will be added to section "Personal information wizards", sub-section "Export/import information to/from external registers", and will require a PatientCurator role to be used.
    b. Step 1 of the INCA import wizard will have two input fields, one for the INCA export date, and one for selecting the files containing the INCA data in tab-delimited format. Two buttons should exist; one for performing a (fast) simple file check, and a "Next" button for a more complete check.
    c. Step 2 should present the results after a performed check. Test results specific for a simple file, should be presented for each selected file. It should be possible to download a file with more detailed check results. In order to perform an import, the complete check must be performed. If the complete check does not find any fatal errors, an "Import" button should appear.
    d. After import has been performed, a summary report line should be shown.

Design update:

  1. JSP file index.jsp in resources/ updated with new "INCA import" entry in section "Personal information wizards", sub-section "Export/import information to/from external registers". The INCA import entry is linked to new JSP file import-inca.jsp in resources/personal/, and requires a PatientCurator role to be used.
  2. New JSP file import-inca.jsp in resources/personal/ added. It is linked to new javascript file import-inca.js in resources/personal/.
  3. New javascript file import-inca.js in resources/personal/ added.
    a. Functions for performing file checks or importing data appends the selected files to a FormData object, which is sent to command "ImportInca" in java servlet IncaServlet in a POST request using Reggie wizard function Wizard.asyncJsonRequest(url, callback, method, postdata), with new function initializeStep2(response) as callback function.
    b. Function initializeStep2(response) uses the JSON data in the response to dynamically build a table containing the results reported by the servlet. It displays a button linked to function downloadReportFile() for downloading a file with more detailed results. A path to a temporary file with more detailed results is read from the response, and stored in a hidden input field.
    c. Function downloadReportFile() calls command window.open(url) for a URL to command "DownloadIncaImportReportFile" in servlet IncaServlet. A path to a temporary file is added to the URL as parameter with name "tmpFilePath".
  4. Javascript file reggie-2.js in resources updated in function Wizard.asyncJsonRequest(url, callback, method, postdata) by not adding a request header for postdata, if the latter is an instance of FormData. This was needed to work with Firefox web browser, that only adds request header with needed "boundary" info, if the request header is not set explicitly.
  5. Data access object class/file Annotationtype.java in src/net/sf/basedb/reggie/dao/ updated:
    a. New data sample annotation types INCA_EXPORT_DATE and INCA_IMPORT_DATE defined.
  6. Java servlet class/file InstallServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doGet(HttpServletRequest req, HttpServletResponse resp) to include the new sample annotation types INCA_EXPORT_DATE and INCA_IMPORT_DATE , and add them to Subtype.CASE annotation type category.
  7. New java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ added.
    a. Protected method void doGet(HttpServletRequest req, HttpServletResponse resp) supports command "DownloadIncaImportReportFile", which retrieves a path to a temporary file from the request parameter "tmpFilePath", after which it sends the file contents to a PrintWriter object, for download by the user.
    b. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) supports command "ImportInca", that performs a check on retrieved files, and optionally imports the data to the database.
  8. XML configuration file servlets.xml in META-INF updated by adding new java servlet class IncaServlet to the servlet list.

comment:12 by olle, 8 years ago

(In [3786]) Refs #525. First version of the INCA data import. It is based on the following functionality:
a. A new "INCA import" entry will be added to section "Personal information wizards", sub-section "Export/import information to/from external registers", and will require a PatientCurator role to be used.
b. Step 1 of the INCA import wizard will have two input fields, one for the INCA export date, and one for selecting the files containing the INCA data in tab-delimited format. Two buttons should exist; one for performing a (fast) simple file check, and a "Next" button for a more complete check.
c. Step 2 should present the results after a performed check. Test results specific for a simple file, should be presented for each selected file. It should be possible to download a file with more detailed check results. In order to perform an import, the complete check must be performed. If the complete check does not find any fatal errors, an "Import" button should appear.
d. After import has been performed, a summary report line should be shown.

  1. JSP file index.jsp in resources/ updated with new "INCA import" entry in section "Personal information wizards", sub-section "Export/import information to/from external registers". The INCA import entry is linked to new JSP file import-inca.jsp in resources/personal/, and requires a PatientCurator role to be used.
  2. New JSP file import-inca.jsp in resources/personal/ added. It is linked to new javascript file import-inca.js in resources/personal/.
  3. New javascript file import-inca.js in resources/personal/ added.
    a. Functions for performing file checks or importing data appends the selected files to a FormData object, which is sent to command "ImportInca" in java servlet IncaServlet in a POST request using Reggie wizard function Wizard.asyncJsonRequest(url, callback, method, postdata), with new function initializeStep2(response) as callback function.
    b. Function initializeStep2(response) uses the JSON data in the response to dynamically build a table containing the results reported by the servlet. It displays a button linked to function downloadReportFile() for downloading a file with more detailed results. A path to a temporary file with more detailed results is read from the response, and stored in a hidden input field.
    c. Function downloadReportFile() calls command window.open(url) for a URL to command "DownloadIncaImportReportFile" in servlet IncaServlet. A path to a temporary file is added to the URL as parameter with name "tmpFilePath".
  4. Javascript file reggie-2.js in resources updated in function Wizard.asyncJsonRequest(url, callback, method, postdata) by not adding a request header for postdata, if the latter is an instance of FormData. This was needed to work with Firefox web browser, that only adds request header with needed "boundary" info, if the request header is not set explicitly.
  5. Data access object class/file Annotationtype.java in src/net/sf/basedb/reggie/dao/ updated:
    a. New data sample annotation types INCA_EXPORT_DATE and INCA_IMPORT_DATE defined.
  6. Java servlet class/file InstallServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doGet(HttpServletRequest req, HttpServletResponse resp) to include the new sample annotation types INCA_EXPORT_DATE and INCA_IMPORT_DATE , and add them to Subtype.CASE annotation type category.
  7. New java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ added.
    a. Protected method void doGet(HttpServletRequest req, HttpServletResponse resp) supports command "DownloadIncaImportReportFile", which retrieves a path to a temporary file from the request parameter "tmpFilePath", after which it sends the file contents to a PrintWriter object, for download by the user.
    b. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) supports command "ImportInca", that performs a check on retrieved files, and optionally imports the data to the database.
  8. XML configuration file servlets.xml in META-INF updated by adding new java servlet class IncaServlet to the servlet list.

comment:13 by olle, 8 years ago

(In [3801]) Refs #525. Java servlet class IncaServlet refactored to make code more readable. Redundant statements, unnecessary checks, and code residues from test output during development have been removed. Some comments have been added:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca", by removal of redundant statements, unnecessary checks, and code residues from test output during development. Some comments have been added.

comment:14 by olle, 8 years ago

Note on INCA annotations:

  • INCA file columns "A030PrepNr" and "A090VPrepNr" contain PAD values, which are regarded as sensitive data, similar to name and personal number. The corresponding annotation types "INCA_A030PrepNr" and "INCA_A090VPrepNr" should therefore require PatientCurator role to be inspected.
  • Test import of anonymized data on a local system for 75% of the lines intended for import in the two available INCA files, revealed a number of discrepancies between the INCA variable description from 2014-01-01 and the INCA files (numbers quoted in the comment column are for the original INCA files, including items not intended for import):
Column Odd value in file Corresponds to Comment
A040MKlass_Värde 20 "MX Fjärrmetastaser kan ej bedömas" Value '20' is deprecated according to the variable description, but 892 items found in indata files.
A090PadTyp_Värde 97 "PAD ej utförd" According to the variable description, this should be represented by '3', not '97', but 67 items with '97' found, none with '3'.
A080OrsKompLgl2_Värde 10 "Axillutrymning efter SN pga tumördata (t ex positiv SN)" Not included in variable description, but 157 items found.
A030PatKod D5 Only value that is not an integer (9311 integer values). Variable description does not specify that the value should be an integer, so the annotation should be of value type Type.STRING.

Update 2016-04-04:

  • A new INCA variable description from 2016-03-04 is now available. The only difference related to the comments above, is that value 10 for column A080OrsKompLgl2_Värde now is included for "Axillutrymning efter SN pga tumördata (t ex positiv SN)".
Last edited 8 years ago by olle (previous) (diff)

comment:15 by olle, 8 years ago

(In [3816]) Refs #525. Java servlet class IncaServlet refactored to make code more readable. Redundant statements, unnecessary checks, and code residues from test output during development have been removed. Some comments have been added:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. ItemQuery for INCA annotation types updated to include types shared to current project and to logged-in user. The latter is needed in order to include INCA annotation types with values representing PAD-numbers, and which therefore are shared to the PatientCurator group.
    b. Exception when trying to open a file is now re-thrown after a log message has been written.

comment:16 by olle, 8 years ago

Tests of initial version of INCA import:

Test setup:

  • Tests were performed on a local system with an anonymized subset of the SCAN-B data. Two test files were prepared, having the same format as the two example INCA files. The two test files contained personal numbers (faked) and lateralities from the anonymized local database, while the other columns contained data from the two example files, with the exception of the two columns representing PAD numbers, which were filled with faked values. Each test file contained 4703 lines with valid input for INCA import on the local system.

Things learned from initial tests:

  1. The first test runs failed due to "java.lang.OutOfMemoryError: GC overhead limit exceeded" ("GC" = Java garbage collection). Apache Tomcat 8 was re-configured by increasing Java max heap memory from 1024MB to 2048MB.
  2. When many annotation changes were made, the final "commit" step, when Hibernate updates the database with the changes made in the corresponding items in the program, took a long time. Often the import crashed during this step, due to different reasons.
  3. Processing of the first test file took much longer than the second, as the former contains much more non-blank entries. Sometimes this led to error "java.io.EOFException: Unexpected EOF read on the socket" when trying to read the second test file afterwards.
  4. After a successful initial import, when a lot of INCA annotations were getting values, was tested to re-import the same test files. The expected result was that no INCA annotations should be updated, but the INCA date annotations should, as the test was performed another day than the initial import. However, the test was also performed in order to check if the data processing took longer than for the initial import, as the program now had to retrieve values for a lot of annotations, in order to check if the latter needed to be updated. The tests confirmed the suspicion that this was indeed the case.

Recommendations for changes in the INCA import, based on initial tests:

  1. In order to make the import more stable, data for all files should be input before the data is processed. This should be possible, since the size of the test files (7.5MB and 2.4MB, respectively) is low enough for the contents to be stored in the memory of a modern computer.

comment:17 by olle, 8 years ago

(In [3817]) Refs #525. INCA import hopefully made more stable, by reading data for all input files, before processing the former. Also some other changes in order to increase readability of the code:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. Data for all input files are now read, before the data is processed. The change requires the introduction of a number of extra list items to store input information for the individual files for later processing.
    b. Name of AnnotationType variable changed to "at", since the latter is used in several other programs.

comment:18 by olle, 8 years ago

(In [3818]) Refs #525. INCA import refactored, in order to put code sections in more logical order:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. Code for input of request parameters moved to top of code section.
    b. Mapping of biosource ID to personal number moved outside of loop for processing file data, since the map is independent of the input files.

comment:19 by olle, 8 years ago

(In [3819]) Refs #525. First (experimental) attempt at implementing support for an annotation snapshot manager for INCA annotations:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. A new SnapshotManager object manager is created by calling method getSnapshotManager().
    b. Retrieval of annotation values, where the annotation types have an Annotationtype (note lowercase 't' in "type") representation, is prepared for using a SnapshotManager, but not yet activated (calls are commented out).
    c. A HashMap<Integer,AnnotationTypeFilter> object is used to store a snapshot filter for ID values of used AnnotationType items.
    d. Retrieval of INCA annotation values are now performed by calling new private method Object fetchAnnotationValue(DbControl dc, AnnotationType at, AnnotationSet as, HashMap<Integer,AnnotationTypeFilter> atIdSnapshotFilterHM, SnapshotManager manager, Annotatable item).
    e. New private method Object fetchAnnotationValue(DbControl dc, AnnotationType at, AnnotationSet as, HashMap<Integer,AnnotationTypeFilter> atIdSnapshotFilterHM, SnapshotManager manager, Annotatable item) added. If a snapshot manager and snapshot filter exist, it obtains the value by calling method findAnnotations(dc, item, snapshotFilter, false), otherwise new private method Object fetchAnnotationValue(AnnotationType at, AnnotationSet as) is called. If a new snapshot filter needs to be created, it is stored in HashMap atIdSnapshotFilterHM.
    f. New private method Object fetchAnnotationValue(AnnotationType at, AnnotationSet as) added. It retrieves the annotation value in the same manner as previously used.
    g. Output log messages added for tests, in order to check how far the application has run, if terminated prematurely, and to get time estimates for different parts of the code for performance checks.

comment:20 by olle, 8 years ago

Design discussion:

When INCA data is supplied in multiple files, the latter normally contain rows for the same set of personal numbers and laterality. The majority of these corresponds to uni-lateral cases, where a personal number uniquely defines a case item in the database. This gives a possibility to speed up the case mapping step in the file check stage for INCA files following the first, if unique case ID and laterality values in the SCAN-B database are stored for personal numbers in the first INCA file.

Design update:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. Two new hash maps created for storing database case ID and laterality for given personal number. These hash maps are then first checked for case ID and laterality for a given personal number, before accessing the database/snapshot manager for the values. If the values need to be obtained from the database/snapshot manager, the former are stored for future use in the new hash maps with the personal number as key.
    b. Commands for obtaining laterality and INCA date annotations for a case item are updated to use a snapshot manager.

comment:21 by olle, 8 years ago

(In [3830]) Refs #525. Attempt to speed up file check/case mapping stage for extra INCA files, by storing case ID and database laterality for personal numbers in the first INCA file, when the former are unique. Also more use of annotation snapshot manager:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. Two new hash maps created for storing database case ID and laterality for given personal number. These hash maps are then first checked for case ID and laterality for a given personal number, before accessing the database/snapshot manager for the values. If the values need to be obtained from the database/snapshot manager, the former are stored for future use in the new hash maps with the personal number as key.
    b. Commands for obtaining laterality and INCA date annotations for a case item are updated to use a snapshot manager.

comment:22 by olle, 8 years ago

Design discussion:

  1. The procedure for speeding up the mapping step by utilizing that the INCA input files normally contain lines for the same cases, could also be used for the import step, by importing all INCA annotations for a single case in the same sub-step, instead of doing it file by file. While some speed improvement might be possible, the main improvement would be that this makes it possible to divide the single commit step into a number of commits, where each one concerns a sub-set of the cases. The advantage of using several commits would be to avoid program crashes due to the java heap memory reaching its maximum limit. If something goes wrong during the import step, this would result in some cases having been updated and other not, but a for a single case either all INCA annotations were up-to-date or unchanged. Using several commits while processing file by file might end in some cases having only a part of the INCA annotations up-to-date.
  2. Allowing all INCA annotations for a single case to be committed in the same sub-step, requires collecting INCA data for a case from all input INCA files. In order to make the code more readable, this should be done using a number of inner help classes of the data access object type.
  3. If some headers of columns to be imported exist in more than one file, it is desirable that the corresponding INCA annotation for a case item only is updated once (or update counted once, if the column values are identical in all files).

Design update:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. A new hash map is created for storing annotation type for given annotation type ID.
    b. Code is re-written to collect INCA annotations for a given case from all input INCA files, before import is performed. The import is then performed, one case at a time. In order to make the code more readable, use is made of new inner help classes UnprocessedIncaFile, IncaFile, RawIncaCase, IncaCase, and IncaAnnoItem.
    c. New inner private class UnprocessedIncaFile added. It stores filename, list of header strings, and list of data lines for an INCA input file. Note that not all data lines might contain mapping information allowing import to the SCAN-B database.
    d. New inner private class IncaFile added. It stores filename, list of header strings, list of indexes for columns to be imported, and list of RawIncaCase items for an INCA input file.
    e. New inner private class RawIncaCase added. It stores database ID and INCA import line for a case item.
    f. New inner private class IncaCase added. It stores database ID, list of IncaAnnoItem objects, and list of database ID values for used annotation types for a case item.
    g. New inner private class IncaAnnoItem added. It stores the database ID of the annotation type and the value string to be imported.
    h. Some variable names have been updated, in order to make them more consistent.
    i. Test version writes a log message with time stamp for every 100 case item that is processed for import, in order to check performance/stability.
Last edited 8 years ago by olle (previous) (diff)

comment:23 by olle, 8 years ago

(In [3836]) Refs #525. INCA import re-written to collect INCA data for each case from all input INCA files before import, and then perform the import one case at a time:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. A new hash map is created for storing annotation type for given annotation type ID.
    b. Code is re-written to collect INCA annotations for a given case from all input INCA files, before import is performed. The import is then performed, one case at a time. In order to make the code more readable, use is made of new inner help classes UnprocessedIncaFile, IncaFile, RawIncaCase, IncaCase, and IncaAnnoItem.
    c. New inner private class UnprocessedIncaFile added. It stores filename, list of header strings, and list of data lines for an INCA input file. Note that not all data lines might contain mapping information allowing import to the SCAN-B database.
    d. New inner private class IncaFile added. It stores filename, list of header strings, list of indexes for columns to be imported, and list of RawIncaCase items for an INCA input file.
    e. New inner private class RawIncaCase added. It stores database ID and INCA import line for a case item.
    f. New inner private class IncaCase added. It stores database ID, list of IncaAnnoItem objects, and list of database ID values for used annotation types for a case item.
    g. New inner private class IncaAnnoItem added. It stores the database ID of the annotation type and the value string to be imported.
    h. Some variable names have been updated, in order to make them more consistent.
    i. Test version writes a log message with time stamp for every 100 case item that is processed for import, in order to check performance/stability.

comment:24 by olle, 8 years ago

Functional specification update:

  • It is desirable that the steps needed when preparing an INCA spreadsheet file for import, i.e. creating a file in *.csv format with tab column separators, are as few and simple as possible. The operation described above 2016-02-09 contains 5 steps, where steps 2 and 3 concern replacing internal line feed and tab characters with empty strings. The text parsing procedure can already handle cells with internal line feed characters, as the number of columns in a line is known from the header line. If step 4 is modified to saving the *.csv file with "Text delimiter" set to a double quote character '"', instead of an empty string (blank), it should be possible for the program to identify the internal tab characters and replace them with spaces in the input data. (In the old instruction, internal tab characters were replaced by empty strings, but if the former are used to separate two words, it is safer to replace them with spaces.)

Design update:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca" in the section reading an input INCA *.csv file:
    a. A single helper TrimmedLineItem data access object is created. It will be reused, and allows new private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem) to return both a trimmed line and current status of a flag indicating if the text processing is inside a section enclosed by double quotes.
    b. Boolean flag insideDoubleQuotes indicates if the text processing is inside a section enclosed by double quotes, and is initialized to false at the start of each new file.
    c. Each raw line read from the input file is processed by new private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem), before being split into columns, using tab characters as column separators.
    d. New private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem) added. It replaces tabs with spaces in sections enclosed by double quotes, and returns a TrimmedLineItem object containing the possibly modified line, together with current status of the flag indicating if the text processing is inside a section enclosed by double quotes.
    e. New inner private helper class TrimmedLineItem of data access object type added. It contains a boolean flag indicating if the text processing is inside a section enclosed by double quotes, and a string containing the line for processing.

comment:25 by olle, 8 years ago

(In [3840]) Refs #525. INCA import updated to allow the program to handle internal line feed and tab characters, provided the *.csv file is saved with "Text delimiter" set to a double quote character '"', instead of an empty string (blank):

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca" in the section reading an input INCA *.csv file:
    a. A single helper TrimmedLineItem data access object is created. It will be reused, and allows new private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem) to return both a trimmed line and current status of a flag indicating if the text processing is inside a section enclosed by double quotes.
    b. Boolean flag insideDoubleQuotes indicates if the text processing is inside a section enclosed by double quotes, and is initialized to false at the start of each new file.
    c. Each raw line read from the input file is processed by new private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem), before being split into columns, using tab characters as column separators.
    d. New private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem) added. It replaces tabs with spaces in sections enclosed by double quotes, and returns a TrimmedLineItem object containing the possibly modified line, together with current status of the flag indicating if the text processing is inside a section enclosed by double quotes.
    e. New inner private helper class TrimmedLineItem of data access object type added. It contains a boolean flag indicating if the text processing is inside a section enclosed by double quotes, and a string containing the line for processing.

comment:26 by olle, 8 years ago

(In [3841]) Refs #525. INCA import updated by removing debug output of lines modified by private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem), when reading input files:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca" in the section reading an input INCA *.csv file:
    a. Debug output of lines modified by private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem) removed.

comment:27 by olle, 8 years ago

Updated recommended procedure for creating a tab-separated *.csv file suitable for INCA import into BASE from an *.xlsx INCA export file in spreadsheet format. The instructions are written for Apache OpenOffice Calc 3.4.1 or LibreOffice 5.1.1.3, but should be regarded as guidelines for use of other programs:

  1. Open INCA export file *.xlsx in OpenOffice.org Calc.
  2. Save edited file as *.csv file in tab-delimited format:
    a. Menu "File" -> "Save As...".
    b. In "Save As" dialog, select directory to save created file in.
    c. For "Save as type:" select "Text CSV (.csv) (*.csv)".
    d. For "File name:" change file extension to ".csv", if not already done by "Automatic file name extension".
    e. Click button "Save".
    f. In extra dialog, select "Keep Current Format" (not "Save in ODF Format").
    g. In "Export Text File" dialog, for "Character set" select "Unicode (UTF-8)".
    h. In "Export Text File" dialog, for "Field delimiter" select "{Tab}".
    i. In "Export Text File" dialog, for "Text delimiter" select """ (double quote).
    j. In "Export Text File" dialog, select check box option "Save cell content as shown" (all other check box options unselected).
    k. In "Export Text File" dialog, click button "OK".
  3. Close OpenOffice.org Calc window.

Note: The previous instructions from 2016-02-09 should still produce valid input files, but require more work.

comment:28 by olle, 8 years ago

(In [3842]) Refs #525. INCA import updated to disable step-1 input fields under step-2:

  1. Javascript file import-inca.js in resources/personal/updated in function initializeStep2(response) to disable step-1 input fields.

comment:29 by Nicklas Nordborg, 8 years ago

(In [3846]) References #525: Import data from INCA

Marked the "Inca import" as an experimental feature. It will be disabled by default. To enable it add <inca-import>1</inca-import> inside the <experimental-features> section in reggie-config.xml.

comment:30 by olle, 8 years ago

Design discussion:

  • Tests have shown that imports with a large number of annotation changes (a full INCA import might include >~ 500000), might terminate prematurely during the commit step, as the Java heap memory is exhausted. In order to stabilize the import, a commit can be performed after a fixed number of case items have been processed.
  • Conversion of a value string from an INCA import file to the expected value type of the corresponding INCA annotation type might throw a net.sf.basedb.core.InvalidDataException, if the string content doesn't match the expected type. In order to obtain more information, if this happens, the exception should be caught, and the full contents of the parsed import line for the case in question should be logged.

Design update:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. Commit is now performed after each 100th case item has been processed. Debug output is written to the log file, in order to check the time spent in different parts of the program.
    b. If a net.sf.basedb.core.InvalidDataException is thrown when trying to convert a value string from an INCA file column to the value type of the corresponding INCA annotation type, the exception is now caught, and the full contents of the parsed import line for the case in question is written to the log.

comment:31 by olle, 8 years ago

(In [3854]) Refs #525. INCA import updated in order to try to stabilize the application, and to gain more information if it terminates prematurely due to conversion errors:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. Commit is now performed after each 100th case item has been processed. Debug output is written to the log file, in order to check the time spent in different parts of the program.
    b. If a net.sf.basedb.core.InvalidDataException is thrown when trying to convert a value string from an INCA file column to the value type of the corresponding INCA annotation type, the exception is now caught, and the full contents of the parsed import line for the case in question is written to the log.

comment:32 by olle, 8 years ago

Info:

  • BASE ticket #2000 (Batch API for annotation handling) contains information of interest for the current ticket. BASE ticket #2001 (The annotation importer should only replace existing values when the new values are different) may also be of interest.
Last edited 8 years ago by olle (previous) (diff)

comment:33 by olle, 8 years ago

Bug found:

  • Current implementation of INCA import with multiple commits does not work past the first commit, unless some components of stored annotation type objects are initialized, when the annotation type object is fetched from the database. (Unfortunately, the first test with the new code was performed with input *.csv files with all columns except "PATID" and mapping columns being empty, in an effort to clear all previous INCA annotations in the database. Since all INCA annotations should be removed in this case, the program does only need to check if previous annotations exist, but not their values, in order to determine how many changes have been made.)

Design update:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. In order for stored annotation types to work past a first commit, values for enumerable annotation types, as well as collections "itemTypes" and "options", are initialized when an annotation type is fetched from the database.

comment:34 by olle, 8 years ago

(In [3855]) Refs #525. Bug fixed in INCA import with multiple commits, to make it work past the first commit. It now initializes some components of stored annotation type objects, when the annotation type object is fetched from the database. The program now also re-throws a caught net.sf.basedb.core.InvalidDataException after a log message has been written:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. In order for stored annotation types to work past a first commit, values for enumerable annotation types, as well as collections "itemTypes" and "options", are initialized when an annotation type is fetched from the database.
    b. A caught net.sf.basedb.core.InvalidDataException is now re-thrown after a log message has been written.

comment:35 by olle, 8 years ago

Functional specification update:

  • INCA import should display a report file download button when entering the application, if a report file exists.
  • INCA report file should contain start and end times for the import/test. Currently one time value is included, which corresponds to the end of the import/test.

Design update:

  • In addition to code changes to implement the functional specification updates above, some code updates have been made in order to make the code more complete and increase clarity.
  1. Outermost Ant build file build.xml in / updated to set BASE version to 3.8.0, since the recommended way of adding INCA annotation types is through use of an annotation type importer, that was introduced in that BASE version. (This also opens the possibility to use planned additions to the annotation API in BASE 3.8.0 in the INCA import code.)
  2. Javascript file import-inca.js in resources/personal/ updated:
    a. Function initPage() updated to not hide and disable the button for downloading a report file, but instead call new function checkForReportFile().
    b. New function checkForReportFile() added. It calls servlet IncaServlet with new command "CheckForIncaImportReportFile" in a "Get" request with callback function reportFileDownloadButtonDisplay(response).
    c. New function reportFileDownloadButtonDisplay(response) added. It retrieves a boolean flag indicating an existing report file from the servlet response, and an optional path to the report file. If a report file exists, the path for the file is stored in hidden input field reportFilePath and the button for downloading the report file is shown, otherwise the button is disabled and hidden.
    d. Function initializeStep2(response) updated to call "Wizard.setCurrentStep(2)" at start of the function.
  3. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doGet(HttpServletRequest req, HttpServletResponse resp) updated with new command "CheckForIncaImportReportFile". It calls new private convenience method String fetchReportFilePath() to obtain the path for an optional report file, and returns a JSON object with the path for key "reportFilePath", and a boolean flag indicating if a report file exists for key "incaReportFileExists".
    b. New private convenience method String fetchReportFilePath() added. It returns the path for an optional report file.
    c. The name to use for the report file is now stored in private String INCA_IMPORT_REPORT_FILENAME.
    d. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca". A time stamp is obtained at the start of the method, and is stored in JSONObject jsonIncaFilePropDetails for key "incaImportStart".
    e. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated to call new private convenience method String fetchReportFilePath() to obtain the path for the report file, and retrieves the value for the import start time stamp from JSONObject jsonIncaFilePropDetails. The start and end times of the import/test are now written at the beginning of the report.
    f. References to the report file in the code has been updated to avoid referring to it as a temporary file, since the file is not removed, when to import/test is finished.
Last edited 8 years ago by olle (previous) (diff)

comment:36 by olle, 8 years ago

(In [3856]) Refs #525. INCA import updated:
a. A report file download button is now displayed, when entering the application, if a report file exists.
b. The INCA report file now contains start and end times for the import/test. (Previously one time value was included, which corresponded to the end of the import/test.)
c. In addition, some code updates have been made in order to make the code more complete and increase clarity.

  1. Outermost Ant build file build.xml in / updated to set BASE version to 3.8.0, since the recommended way of adding INCA annotation types is through use of an annotation type importer, that was introduced in that BASE version. (This also opens the possibility to use planned additions to the annotation API in BASE 3.8.0 in the INCA import code.)
  2. Javascript file import-inca.js in resources/personal/ updated:
    a. Function initPage() updated to not hide and disable the button for downloading a report file, but instead call new function checkForReportFile().
    b. New function checkForReportFile() added. It calls servlet IncaServlet with new command "CheckForIncaImportReportFile" in a "Get" request with callback function reportFileDownloadButtonDisplay(response).
    c. New function reportFileDownloadButtonDisplay(response) added. It retrieves a boolean flag indicating an existing report file from the servlet response, and an optional path to the report file. If a report file exists, the path for the file is stored in hidden input field reportFilePath and the button for downloading the report file is shown, otherwise the button is disabled and hidden.
    d. Function initializeStep2(response) updated to call "Wizard.setCurrentStep(2)" at start of the function.
  3. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doGet(HttpServletRequest req, HttpServletResponse resp) updated with new command "CheckForIncaImportReportFile". It calls new private convenience method String fetchReportFilePath() to obtain the path for an optional report file, and returns a JSON object with the path for key "reportFilePath", and a boolean flag indicating if a report file exists for key "incaReportFileExists".
    b. New private convenience method String fetchReportFilePath() added. It returns the path for an optional report file.
    c. The name to use for the report file is now stored in private String INCA_IMPORT_REPORT_FILENAME.
    d. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca". A time stamp is obtained at the start of the method, and is stored in JSONObject jsonIncaFilePropDetails for key "incaImportStart".
    e. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated to call new private convenience method String fetchReportFilePath() to obtain the path for the report file, and retrieves the value for the import start time stamp from JSONObject jsonIncaFilePropDetails. The start and end times of the import/test are now written at the beginning of the report.
    f. References to the report file in the code has been updated to avoid referring to it as a temporary file, since the file is not removed, when to import/test is finished.

comment:37 by olle, 8 years ago

(In [3857]) Refs #525. Bug fix: INCA import updated in javascript, to make it compatible with changes in parameter names in servlet IncaServlet:

  1. Javascript file import-inca.js in resources/personal/ updated in function downloadReportFile() to use parameter name "reportFilePath" instead of "tmpFilePath" for the report file path, when calling servlet IncaServlet with command "DownloadIncaImportReportFile".

comment:38 by olle, 8 years ago

Design update:

  • Inca import should be updated in the commit step to use the AnnotationBatcher API introduced in BASE 3.8.0 in BASE Ticket #2000 (Batch API for annotation handling). This will decrease use of heap memory and improve commit speed.
  • The new AnnotationBatcher cannot be used in the same session as standard database requests for the same item. Since values of a number of case annotations like laterality etc. are needed in the test part preceding the import part, a new DbControl item has to be created for use in the import part.
  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. A list of database ID values for INCA annotation types is stored after the database query.
    b. Before the import step, dc.commit() is called for the DbControl item used in the input file test steps, after which a new DbControl item is created for use with the AnnotationBatcher batcher item. INCA annotation type lists and hash maps, that are to be used in the import step, are re-created using annotation type items created from the stored ID list using the new DbControl item. The INCA annotation types plus the INCA export and import date annotation types are added to the batcher.
    c. For each case mapped to the INCA import, batcher is set to use the current case item, then loaded with the INCA annotations to be updated (using the setValue() method), after which the INCA export and import date annotation are loaded.
    d. A single command dc.commit() is then called.
  2. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated by removing private methods Object fetchAnnotationValue(AnnotationType at, AnnotationSet as) and Object fetchAnnotationValue(DbControl dc, AnnotationType at, AnnotationSet as, HashMap<Integer,AnnotationTypeFilter> atIdSnapshotFilterHM, SnapshotManager manager, Annotatable item), as they are no longer used.
Last edited 8 years ago by olle (previous) (diff)

comment:39 by olle, 8 years ago

(In [3858]) Refs #525. Inca import updated in the commit step to use the AnnotationBatcher API introduced in BASE 3.8.0 in BASE Ticket #2000 (Batch API for annotation handling). This will decrease use of heap memory and improve commit speed. The new AnnotationBatcher cannot be used in the same session as standard database requests for the same item. Since values of a number of case annotations like laterality etc. are needed in the test part preceding the import part, a new DbControl item has to be created for use in the import part:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. A list of database ID values for INCA annotation types is stored after the database query.
    b. Before the import step, dc.commit() is called for the DbControl item used in the input file test steps, after which a new DbControl item is created for use with the AnnotationBatcher batcher item. INCA annotation type lists and hash maps, that are to be used in the import step, are re-created using annotation type items created from the stored ID list using the new DbControl item. The INCA annotation types plus the INCA export and import date annotation types are added to the batcher.
    c. For each case mapped to the INCA import, batcher is set to use the current case item, then loaded with the INCA annotations to be updated (using the setValue() method), after which the INCA export and import date annotation are loaded.
    d. A single command dc.commit() is then called.
  2. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated by removing private methods Object fetchAnnotationValue(AnnotationType at, AnnotationSet as) and Object fetchAnnotationValue(DbControl dc, AnnotationType at, AnnotationSet as, HashMap<Integer,AnnotationTypeFilter> atIdSnapshotFilterHM, SnapshotManager manager, Annotatable item), as they are no longer used.

comment:40 by Nicklas Nordborg, 8 years ago

Milestone: Reggie v4.xReggie v4.4

comment:41 by Nicklas Nordborg, 8 years ago

(In [3864]) References #525: Import data from INCA

Remove unused code. This should not be merged to the trunk since it has already been replaced with other functionality.

comment:42 by olle, 8 years ago

(In [3867]) Refs #525. INCA import updated in report file management. The report file path is only needed on the servlet side, so all references to it in JSP/Javascript are removed:

  1. JSP file import-inca.jsp in resources/personal/ updated by removing unused hidden input field.
  2. Javascript file import-inca.js in resources/personal/ updated by removing unused references to report file path.
  3. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doGet(HttpServletRequest req, HttpServletResponse resp) updated for command "CheckForIncaImportReportFile" to not return report file path.
    b. Protected method void doGet(HttpServletRequest req, HttpServletResponse resp) updated for command "DownloadIncaImportReportFile" to call private method String fetchReportFilePath() to obtain report file path.
    c. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" to not return report file path.

comment:43 by olle, 8 years ago

Design update:

  • Use of snapshot manager when getting annotation values resulted in significantly different import times, depending on whether the snapshot cache had to be updated or not. The snapshot manager is therefore no longer used.
  • INCA importer should check if a potential import value has a type corresponding to the value type of the annotation type, it is to be imported to. If not, data for the case corresponding to the line with the offending value should be skipped in all import files.
  1. Javascript file import-inca.js in resources/personal/ updated in function initializeStep2(response) to include line with number of data lines with bad values in database consistency check tables for import files.
  2. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca":
    a. Snapshot manager removed.
    b. List added for storing case ID values for lines with bad values.
    c. Loop over import files updated if full check is to be performed by storing info on bad values, and storing case ID values for lines with bad values.
    d. Import step updated when collecting data for each case ID, to skip cases, corresponding to lines with bad data in any of the import files.
    e. Log output during case mapping step reduced to every 1000 cases, instead of every 100 cases.
    f. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated to include the number of data lines with bad values found for each import file, and details about any found bad value.

comment:44 by olle, 8 years ago

(In [3875]) Refs #525. INCA import updated:
a. Use of snapshot manager when getting annotation values resulted in significantly different import times, depending on whether the snapshot cache had to be updated or not. The snapshot manager is therefore no longer used.
b. INCA importer now checks if a potential import value has a type corresponding to the value type of the annotation type, it is to be imported to. If not, data for the case corresponding to the line with the offending value is skipped in all import files.

  1. Javascript file import-inca.js in resources/personal/ updated in function initializeStep2(response) to include line with number of data lines with bad values in database consistency check tables for import files.
  2. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca":
    a. Snapshot manager removed.
    b. List added for storing case ID values for lines with bad values.
    c. Loop over import files updated if full check is to be performed by storing info on bad values, and storing case ID values for lines with bad values.
    d. Import step updated when collecting data for each case ID, to skip cases, corresponding to lines with bad data in any of the import files.
    e. Log output during case mapping step reduced to every 1000 cases, instead of every 100 cases.
    f. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated to include the number of data lines with bad values found for each import file, and details about any found bad value.

comment:45 by olle, 8 years ago

(In [3882]) Refs #525. INCA importer updated in the bad value check for INCA annotation types of integer value type, to check if an enumeration is specified, and if so, check if the supplied value is included in the enumeration. If not, data for the case corresponding to the line with the offending value is skipped in all import files, and a note on the offending value is included in the report file.

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" to check if an INCA annotation type of integer value type specifies an enumeration, and if so, check if the supplied value is included in the enumeration. If not, data for the case corresponding to the line with the offending value is skipped in all import files. The offending value together with the enumeration list is added to the data stored for use by the report file.
    b. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated to include the optional enumeration list in the data reported for found bad values.

comment:46 by olle, 8 years ago

(In [3886]) Refs #525. INCA importer updated in the bad value check for INCA annotation types, to not specify an optional enumeration in the report file, if the offending value is of the wrong value type, or if no enumeration exists (previously value null was reported in these cases):

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" in the bad value check, to set enumeration to null, if the offending value is of the wrong value type.
    b. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated to only include an optional enumeration list in the data reported for a found bad value, if the enumeration list differs from null.

comment:47 by olle, 8 years ago

Design discussion:

  • It has been decided that future INCA input files should be converted to a tab-separated *.csv file directly at INCA, using a specially designed program, taking the "raw" INCA output and the SCAN-B request file (referred to as "INCA export") as input. See wiki page INCA XML to CSV converter and Ticket #881 (Implement INCA XML to CSV converter) for more information.
  • The new INCA export procedure will affect the INCA importer, as several properties of the INCA import file will change:
  1. The number of columns will change (normally it will increase).
  2. The headers of some columns that existed in previous INCA files have changed. Specifically, some columns used to map lines in the INCA file to SCAN-B case items are affected.
  3. The management of internal new line and tab characters are now handled by the conversion program, and the INCA importer has to be adapted to this, in order for the imported values to be consistent with other BASE exporters. The conversion program will encode newline, tabs, and backslash characters to \n, \t, and \\, respectively. The INCA importer should decode these values back to the original characters, when an INCA annotation is updated.

Changed column headers used for case mapping:

Column contents Old header New header
Temporary patient ID PATID PAT_ID
Personal number PersonalNo PERSNR

The new INCA file contains 255 columns (excluding the PAT_ID and PERSNR mapping columns) without a corresponding INCA annotation type.

Missing INCA columns in new file:

Header
U070EndoAnn '
U070EndoArom
U070EndoBehPg
U070EndoEjAkt
U070EndoSlutDat
U070EndoTam
Last edited 8 years ago by olle (previous) (diff)

comment:48 by olle, 8 years ago

(In [3888]) Refs #525. INCA importer updated to accept alternative headers "PAT_ID" and "PERSNR" for temporary patient ID and personal number, respectively:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Alternative header strings "PAT_ID" and "PERSNR" added to list of headers for unimported columns.
    b. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" when searching for mapping columns , to test alternative header columns "PAT_ID" and "PERSNR", if no temporary patient ID column was found for the original header.

comment:49 by olle, 8 years ago

(In [3889]) Refs #525. INCA importer updated to use an object of the BASE TabCrLfEncodeDecoder class to decode escaped characters "\n", "\t", and "\\" back to the original special characters before updating an INCA annotation:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Unused imports removed.
    b. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" in import section to use an object of the BASE TabCrLfEncodeDecoder class to decode escaped characters "\n", "\t", and "\\" back to the original special characters before updating an INCA annotation.

comment:50 by olle, 8 years ago

(In [3895]) Refs #525. INCA importer updated to accept alternative headers for temporary patient ID and personal number independently of each other. "PATID" is tested first for temporary patient ID and "PersonalNo" for personal number; if a column is found for an INCA variable, the other alternative is not tested. An optional hyphen in the personal number is now removed before mapping to case items, since the personal numbers in the SCAN-B database do not contain hyphens.

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca":
    a. Alternative headers for temporary patient ID and personal number are now tested independently of each other. "PATID" is tested first for temporary patient ID and "PersonalNo" for personal number; if a column is found for an INCA variable, the other alternative is not tested.
    b. An optional hyphen in the personal number is now removed before mapping to case items.

comment:51 by olle, 8 years ago

(In [3900]) Refs #525. INCA importer updated by refactoring file upload:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by obtaining list of unprocessed INCA files by calling new private method List<UnprocessedIncaFile> fetchUnprocessedIncaFiles(HttpServletRequest req). Number of lines with line feeds, too many columns, and too few columns, respectively, in each file are now obtained from the corresponding UnprocessedIncaFile object, instead of from local lists.
    b. New private method List<UnprocessedIncaFile> fetchUnprocessedIncaFiles(HttpServletRequest req) added. It uploads files posted in an HttpServletRequest and returns a list of UnprocessedIncaFile objects (one per uploaded file).
    c. Inner private class UnprocessedIncaFile updated with integer attributes for number of lines with line feeds, too many columns, and too few columns, respectively, together with public accessor methods.

comment:52 by olle, 8 years ago

(In [3901]) Refs #525. INCA importer updated by refactoring mapping of personal numbers to biosource id:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by obtaining hash map of personal number to biosource id by calling new private method HashMap<String,Integer> fetchPersonalNumberBioSourceIdHashMap(DbControl dc).
    b. New private method HashMap<String,Integer> fetchPersonalNumberBioSourceIdHashMap(DbControl dc) added. It returns a hash map mapping personal number to biosource id.

comment:53 by olle, 8 years ago

(In [3902]) Refs #525. INCA importer updated by removing unused variable List<String> excludePnoList:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca" by removing variable List<String> excludePnoList, that is not used (variable List<Integer> excludeLineList is used instead).

comment:54 by olle, 8 years ago

(In [3903]) Refs #525. INCA importer updated by refactoring finding key column indexes:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by obtaining key column indexes by calling new private method JSONObject fetchKeyColumnIndexes(List<String> headerList, JSONObject jsonIncaFileProp).
    b. New private method JSONObject fetchKeyColumnIndexes(List<String> headerList, JSONObject jsonIncaFileProp) added. It updates an input INCA file property JSONObject with information on key column indexes, obtained from a list of column headers.

comment:55 by olle, 8 years ago

(In [3908]) Refs #525. INCA importer updated by refactoring collection of potential INCA import lines for an unprocessed INCA file:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by obtaining list of potential INCA import lines for an unprocessed INCA file by calling new private method List<PotentialIncaImportLine> fetchPotentialIncaImportLines(int tempPatIdClmIndex, int personalNoClmIndex, int lateralityDescriptionClmIndex, List<String> lines). Use of a single list of PotentialIncaImportLine objects should also be safer, than to rely on a number of synchronized lists.
    b. New private method List<PotentialIncaImportLine> fetchPotentialIncaImportLines(int tempPatIdClmIndex, int personalNoClmIndex, int lateralityDescriptionClmIndex, List<String> lines) added. It returns a list of PotentialIncaImportLine objects for lines with personal number.
    c. New inner private class PotentialIncaImportLine added. It is a data access object class with string attributes for personal number, laterality, temporary patient ID, and data line, respectively, together with public accessor methods.

comment:56 by olle, 8 years ago

(In [3909]) Refs #525. INCA importer updated by removing unused or redundant variables numPatientIdWithMoreThanTwoLines and numPatientIdWithManyIdenticalLateralityLines:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by removing unused or redundant variables numPatientIdWithMoreThanTwoLines and numPatientIdWithManyIdenticalLateralityLines.
    b. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated by getting number of personal numbers with identical laterality lines from correct JSON key "numPersonalNoWithManyIdenticalLateralityLines", instead of previously used "numPatientIdWithManyIdenticalLateralityLines" (the numbers were equal).

comment:57 by olle, 8 years ago

(In [3910]) Refs #525. INCA importer updated by refactoring internal laterality check of potential INCA import lines for an unprocessed INCA file:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by performing internal laterality check of potential INCA import lines for an unprocessed INCA file by calling new private method InternalLateralityCheckResult internalLateralityCheck(List<PotentialIncaImportLine> potentialIncaImportLines).
    b. New private method InternalLateralityCheckResult internalLateralityCheck(List<PotentialIncaImportLine> potentialIncaImportLines) added. It performs an internal laterality check on a list of potential INCA import lines, and returns an InternalLateralityCheckResult object with results of the check.
    c. New inner private class InternalLateralityCheckResult added. It is a data access object class with attributes for JSONArray jsonPatientIdWithMoreThanTwoLines, JSONArray jsonPatientIdWithManyIdenticalLateralityLines, and List<Integer> excludeLineList, respectively, together with public accessor methods.

comment:58 by olle, 8 years ago

(In [3915]) Refs #525. INCA importer updated by refactoring database mapping and data value check of potential INCA import lines for an unprocessed INCA file:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by performing database mapping and data value check of potential INCA import lines for an unprocessed INCA file by calling new private method LineDatabaseMappingResult lineDatabaseMapping(DbControl dc, List<PotentialIncaImportLine> potentialIncaImportLines, ...).
    b. New private method LineDatabaseMappingResult lineDatabaseMapping(DbControl dc, List<PotentialIncaImportLine> potentialIncaImportLines, HashMap<String,Integer> pnoBioSourceIdHM, HashMap<String,String> incaLateralityHM, List<Integer> excludeLineList, List<Integer> importHeaderIndexList, List<String> headerList, HashMap<String,AnnotationType> incaAnnoNameAnnoTypeHM, int fileNo) added. It performs a database mapping and data value check on a list of potential INCA import lines, and returns a LineDatabaseMappingResult object with results of the check.
    c. New inner private class LineDatabaseMappingResult added. It is a data access object class with attributes for HashMap<Integer,Integer> rawLineNumberCaseIdHM, List<Integer> excludeLineList, List<Integer> excludeCaseIdList, int numPersonalNoWithoutDatabaseReference, int numPatientLateralitiesWithoutDatabaseReference, JSONArray jsonPatientIdForPersonalNoWithoutDatabaseReference, JSONArray jsonPatientLateralitiesWithoutDatabaseReference, and JSONArray jsonBadValueLines, together with public accessor methods.

comment:59 by olle, 8 years ago

(In [3917]) Refs #525. INCA importer updated by using a sample query to obtain sample item[s] for a patient item, in order to gain some speed increase (~70-80% of the original time for mapping cases to patients):

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Private method LineDatabaseMappingResult lineDatabaseMapping(DbControl dc, List<PotentialIncaImportLine> potentialIncaImportLines, ...) updated to use a sample query to obtain sample item[s] for a patient item, instead of using Case.findByPatient(dc, patient).

comment:60 by olle, 8 years ago

Design update:

  • Reggie API has been updated with support for a progress reporter in Ticket #883 (Add support for progress reporting to the Reggie wizard API). The INCA importer should use this, as many operations take a long time to finish.
  1. JSP file import-inca.jsp in resources/personal/ updated by adding a <div id="wizard-progess"></div> div tag below the <div id="wizard-status"></div> tag.
  2. Javascript file import-inca.js in resources/personal/ updated in calls of Wizard.showLoadingAnimation(...) by adding a second argument 'inca-import-progress' with the name of the progress reporter.
  3. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by creating a SimpleProgressReporter item progress and storing it with the chosen name in the current session control. Calls to progress.display(...) are made at regular intervals. In order to make the progress percent values as representative to the truth as possible, the fraction of time spent for mapping and value checking was estimated from test runs, and is adjusted, depending on whether just a data check is performed, or if it is to be followed by an import.
    b. Private method LineDatabaseMappingResult lineDatabaseMapping(DbControl dc, List<PotentialIncaImportLine> potentialIncaImportLines, ...) updated with new arguments int numFiles, SimpleProgressReporter progress, float progressTestFraction, and int progressOffset, which are used to calculate progress percentage values and reporting them.

comment:61 by olle, 8 years ago

(In [3920]) Refs #525. INCA importer updated to use progress reporter:

  1. JSP file import-inca.jsp in resources/personal/ updated by adding a <div id="wizard-progess"></div> div tag below the <div id="wizard-status"></div> tag.
  2. Javascript file import-inca.js in resources/personal/ updated in calls of Wizard.showLoadingAnimation(...) by adding a second argument 'inca-import-progress' with the name of the progress reporter.
  3. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by creating a SimpleProgressReporter item progress and storing it with the chosen name in the current session control. Calls to progress.display(...) are made at regular intervals. In order to make the progress percent values as representative to the truth as possible, the fraction of time spent for mapping and value checking was estimated from test runs, and is adjusted, depending on whether just a data check is performed, or if it is to be followed by an import.
    b. Private method LineDatabaseMappingResult lineDatabaseMapping(DbControl dc, List<PotentialIncaImportLine> potentialIncaImportLines, ...) updated with new arguments int numFiles, SimpleProgressReporter progress, float progressTestFraction, and int progressOffset, which are used to calculate progress percentage values and reporting them.

comment:62 by olle, 8 years ago

(In [3921]) Refs #525. INCA importer updated in progress reporter to report progress during patient mapping, and to make progress reporting more continuous:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" for progress reporter to report progress during patient mapping, and to make progress reporting more continuous. Variable float progressBiosourceMappingFraction is used to store an estimate of the fraction of time spent mapping database patient items to personal numbers.
    b. Private method HashMap<String,Integer> fetchPersonalNumberBioSourceIdHashMap(DbControl dc) updated with new arguments SimpleProgressReporter progress, float progressBiosourceMappingFraction, and int progressOffset, which are used to calculate progress percentage values and reporting them.

comment:63 by olle, 8 years ago

(In [3922]) Refs #525. INCA importer updated in progress reporter to report progress at start and end of actual import phase:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca" for progress reporter to report progress at start and end of actual import phase.

comment:64 by olle, 8 years ago

Design discussion:

  • The functionality added in change set [3840] 2016-04-13, where the INCA importer was updated to allow the program to handle internal line feed and tab characters, provided the *.csv file is saved with "Text delimiter" set to a double quote character '"', instead of an empty string (blank), was kept in the code when the latter was adapted in change set [3889] etc. 2016-04-28 to be used with a single INCA input file obtained by feeding a special program with a raw INCA XML file and a SCAN-B *.csv request file. The idea was to have an INCA importer that could be used with both kind of files, as replacing tabs with spaces inside sections within double quotes should not change anything, if internal tabs had already been replaced by "\t" by the special conversion program.

    Unfortunately, inspection of new INCA files has shown the (pretty non-standard) custom of prefixing an entry value in a comment with a single double quote '"', leading to all tabs separating columns being replaced by spaces until the next double quote is found, resulting in chaos when lines are concatenated to obtain the number of columns in the header line. As this custom would not have worked with the original *.csv file creation procedure, unless original double quotes were escaped, the functionality added in change set [3840] will be removed, as new INCA import files will be created using the special program at INCA.

Design update:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" when reading an input INCA *.csv file to no longer replace tabs with spaces in sections enclosed by double quotes.
    b. Private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem) removed, since it is no longer needed.
    c. Inner private helper class TrimmedLineItem of data access object type removed, since it is no longer needed.
Last edited 8 years ago by olle (previous) (diff)

comment:65 by olle, 8 years ago

(In [3923]) Refs #525. INCA importer updated to no longer replace tabs with spaces in sections enclosed by double quotes, since it is not needed using the new special program to produce INCA *.csv input files, and can cause problems, if an odd number of double quotes are used inside a column entry:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" when reading an input INCA *.csv file to no longer replace tabs with spaces in sections enclosed by double quotes.
    b. Private method TrimmedLineItem tabDoubleQuoteTrim(TrimmedLineItem trimmedLineItem) removed, since it is no longer needed.
    c. Inner private helper class TrimmedLineItem of data access object type removed, since it is no longer needed.

comment:66 by olle, 8 years ago

(In [3924]) Refs #525. INCA importer updated in JSP file to only allow a single *.csv file to be selected for import:

  1. JSP file import-inca.jsp in resources/personal/ updated by no longer allowing multiple files to be selected in the "importfile" file selection field. Various text strings modified to be consistent with selection of a single import file.

comment:67 by olle, 8 years ago

(In [3925]) Refs #525. INCA importer updated to only use a single *.csv file for import. Unused functionality is removed, in order to simplify the code and user interface:

  1. Javascript file import-inca.js in resources/personal/ updated:
    a. Function initializeStep2(response) updated to obtain a JSONObject fileProp from JSON key "incaFileProperties", instead of a JSONArray filePropArray from JSON key "incaFilePropertiesArray".
    b. Function createTableHeader() updated to have no argument.
    c. Functions fetchTableRowStatus(...), createTableRow(...), and createTableRow2(...) updated to take first argument JSONObject instead of a JSONArray.
  2. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by using the first (= the only) input file for import. Data are now stored for later use in JSONObjects instead of JSONArrays.
    b. Private method LineDatabaseMappingResult lineDatabaseMapping(DbControl dc, List<PotentialIncaImportLine> potentialIncaImportLines, ...) updated by removing arguments int fileNo and int numFiles, since they are no longer needed for the progress report calculation.
    c. Private method String createIncaImportReportFile(JSONArray jsonIncaFilePropDetailsArr, List<String> missingIncaHeadersList, String message) updated by exchanging first argument JSONArray jsonIncaFilePropDetailsArr for JSONObject jsonIncaFilePropDetails.

comment:68 by olle, 8 years ago

(In [3926]) Refs #525. INCA importer updated by removing or commenting out debug output to server log file, unless the information concerns a caught exception or other error, not reported otherwise:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated by removing or commenting out debug output to server log file, unless the information concerns a caught exception or other error, not reported otherwise.

comment:69 by olle, 8 years ago

(In [3927]) Refs #525. INCA importer updated in method creating INCA import report file:

  1. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated:
    a. Protected method void doPost(HttpServletRequest req, HttpServletResponse resp) updated for command "ImportInca" by not expecting private method createIncaImportReportFile(JSONObject jsonIncaFilePropDetails, List<String> missingIncaHeadersList, String message) to return a string with the report file path (the information was never used).
    b. Private method createIncaImportReportFile(JSONObject jsonIncaFilePropDetails, List<String> missingIncaHeadersList, String message) updated to no longer return a string with the report file path, but instead have type void. The file contents also updated by only referring to a single INCA file.

comment:70 by olle, 8 years ago

(In [3928]) Refs #525. INCA importer updated in preparation for release of first version:

  1. JSP file index.jsp in resources/ updated by removing entries "experimental not-implemented" from class description for "inca-import" <span> tag.
  2. Java servlet class/file IncaServlet.java in src/net/sf/basedb/reggie/servlet/ updated in protected method void doPost(HttpServletRequest req, HttpServletResponse resp) for command "ImportInca" by moving code for creating a DbControl item and checking role permissions to the top, in order to increase similarity with other servlets.
Last edited 8 years ago by olle (previous) (diff)

comment:71 by olle, 8 years ago

Resolution: fixed
Status: assignedclosed

Ticket closed as first version of INCA importer has been implemented.

Note: See TracTickets for help on using tickets.