Opened 10 months ago

Last modified 10 months ago

#1562 closed task

Implement a wizard or procedure for registering FASTQ files for WGS — at Version 18

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.52
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

We are sending samples to an external lab for WGS. When we receive the FASTQ files with the result we need a simple procedure or wizard that is NOT including several manual steps and batch imports from Excel files.

Right now we have aliquots from DNA (normal and tumor) and we need to create items and annotations:

  • Library items
    • Pipeline
    • ExternalOperator
  • MergedSequences items
    • Pipeline
    • DataFilesFolder
    • READS
  • FASTQ file items
    • FlowCellID
    • FlowCellType
    • LaneNumber
    • READS
    • ReadLength
    • SequencingRunNumber
    • SerialNumber

Most information can be extracted from the FASTQ files. We are not sure if the FASTQ files have merged data from different sequencing runs. In that case we may want to split them per flow cell and lane so that they are compatible with existing data.

UPDATE It is better to make this procedure a bit more like the FASTQ import from the RNAseq side. This means that there is a first wizard that will read information from a CSV-file (provided by the external operator) that we use to create child Library and DemuxedSequences items. Then there is a second wizard that processes the FASTQ files and put them in the right place and create the MergedSequences item. Automatica processing and failures can be handled just as we are used to from the RNAseq pipeline (eg. delete the MergedSequences item). So we also need:

  • DemuxedSequences items
    • Pipeline
    • DataFilesFolder
    • RawFASTQ

Change History (18)

comment:1 by Nicklas Nordborg, 10 months ago

In 7533:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Started to implement a wizard. It will ask for a file that should contain information about the DNA aliquots we should register FASTQ files for. There is no real functionality behind it yet. This may change in future if we can somehow performa a file scan for FASTQ files and determine aliquot names automatically.

Once we have the aliquot names it should be possible to create child Library and MergedSequences items and hopefully also the related FASTQ file items.

comment:2 by Nicklas Nordborg, 10 months ago

In 7534:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Started to implement the FASTQ registration. The general outline is done but a lot of details remain and depend on the the actual delivery of the FASTQ files.

comment:3 by Nicklas Nordborg, 10 months ago

In 7535:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Added some checks that aliquots are from the expected pipelines (DNA/Tumor/WGS and DNA/Normal/WGS).

The script seems to work, but so far the path to the FASTQ files are hardcoded. The results are not yet handled.

comment:4 by Nicklas Nordborg, 10 months ago

In 7536:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Added a container for the import script that include the pgzip python package which support multi-threaded compression for improving the speed.

comment:5 by Nicklas Nordborg, 10 months ago

In 7537:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Creating and linking resulting FASTQ files.

comment:6 by Nicklas Nordborg, 10 months ago

In 7538:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Started a refactoring where the register and import wizard should be split into two separate wizards. The first wizard will register the delivered FASTQ files and create child items (library, merged, etc). The second wizard will start the job for importing the FASTQ files to their correct location. The split is needed beacuse the first wizard need to work on a batch of items and will probably use a sample sheet as input. The import step works on one sample at a time and should support restarting in case of failure. This would not have been possible with only a batch import.

comment:7 by Nicklas Nordborg, 10 months ago

In 7539:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Major changes in the first registration wizard. This will now scan a file server (WgsImportGateway) for existing manifest files. It will only be used for creating child items and adding the merged sequences items to the item list for actual import processing of the FASTQ files.

Code has been moved to a new servlet (WgsImportServlet) but has not yet been removed from the FastqServlet.

comment:8 by Nicklas Nordborg, 10 months ago

In 7540:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Added checks for MD5 values and store the MD5 as part of the filename in the RawFastQ annotation.

comment:9 by Nicklas Nordborg, 10 months ago

In 7541:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Added the possibility to view the manifest file in a popup.

comment:10 by Nicklas Nordborg, 10 months ago

In 7542:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Started with the import wizard. It will load merged items in the queue but creating jobs is not yet implemented.

comment:11 by Nicklas Nordborg, 10 months ago

In 7543:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Implemented a check for existing FASTQ files. It should not be possible to start the import until the FASTQ files have been delivered.

comment:12 by Nicklas Nordborg, 10 months ago

In 7544:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Added a size check for the FASTQ files. If they differ a lot it may be due to a download that not yet completed or was interrupted.

comment:13 by Nicklas Nordborg, 10 months ago

In 7545:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

The import wizard has been updated to create jobs for the selected merged items. It seems to work as it should.

comment:14 by Nicklas Nordborg, 10 months ago

In 7546:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

MD5 check should be in binary mode (but I don't think it matter since the md5sum we use is always using binary mode). Improved error handling in case the MD5 doesn't match.

comment:15 by Nicklas Nordborg, 10 months ago

In 7547:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

The menifest registration wizard has been modified to create DemuxedSequences instead of MergedSequences. This will make it easier to handle failed imports that need to be re-executed since we can use the same pattern with deleting items and restarting from the demux. This is similar to how the FASTQ import works in the RNAseq pipeline.

comment:16 by Nicklas Nordborg, 10 months ago

In 7548:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

The import wizard has now been updated to use demuxed sequences instead of merged.

comment:17 by Nicklas Nordborg, 10 months ago

In 7549:

References #1562: Implement a wizard or procedure for registering FASTQ files for WGS

Extract the project reference number by taking the name of the top-level directory. This is saved as the ExternalRef annotation on the library items so that we can keep track each batch.

The FASTQ file names are now stored with just the name and MD5. The path to the directory is stored in the DataFilesFolder annotation.

comment:18 by Nicklas Nordborg, 10 months ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.