#1473 closed task (fixed)

Implement alignment for the DNA WGS pipeline

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.46
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

See #1464 for background.

The first alignment step is done separately for each sample. bwa-mem2 should be used for alignment from FASTQ files. After the alignment several samtools steps are used for sorting/merging, etc. The end result should be a single BAM file with aligned and unaligned sequences. We will probably produce some statistics also and some values will be imported as annotations into BASE.

Change History (38)

comment:1 by Nicklas Nordborg, 20 months ago

In 7081:

References #1473: Implement alignment for the DNA WGS pipeline

Added container definition with bwa-mem2.

comment:2 by Nicklas Nordborg, 20 months ago

In 7082:

References #1473: Implement alignment for the DNA WGS pipeline

Added a new section "DNA-seq secondary analysis" to the start page.

comment:3 by Nicklas Nordborg, 20 months ago

In 7083:

References #1473: Implement alignment for the DNA WGS pipeline

Added an item list for merged sequences that should be aligned with bwa-mem2. The counter is on the start page is linked to the number of members in the list.

comment:4 by Nicklas Nordborg, 20 months ago

In 7084:

References #1473: Implement alignment for the DNA WGS pipeline

Started to implement the wizard for starting bwa-mem2 alignment. The interface should work but it is not submitting any jobs or creating any script or child items.

comment:5 by Nicklas Nordborg, 20 months ago

In 7085:

References #1473: Implement alignment for the DNA WGS pipeline

Added some configuration options to reggie-config.xml and created a JOB subtype for BWA-mem.

comment:6 by Nicklas Nordborg, 20 months ago

In 7086:

References #1473: Implement alignment for the DNA WGS pipeline

Started to implement job submission for BWA-mem. A script is generated but it doensn't do anything yet except checking some parameters.

comment:7 by Nicklas Nordborg, 20 months ago

In 7087:

References #1473: Implement alignment for the DNA WGS pipeline

The job submission code now creates a text file (fastq_info.txt) with names of existing FASTQ files linked to the merged item. The text file has two columns with paired R1 and R2 files.

comment:8 by Nicklas Nordborg, 20 months ago

In 7088:

References #1473: Implement alignment for the DNA WGS pipeline

Started to implement the actual script for aligning with bwa-mem2.
The major parts of the pipeline should be working and produce a single alignment.bam file. There is no statistics (except what markdup creates) and no result files are copied back to the project archive.

comment:9 by Nicklas Nordborg, 20 months ago

In 7089:

References #1473: Implement alignment for the DNA WGS pipeline

Copy result files back to the project archive. File links are created in BASE.

Added some statistics that is also imported as annotations.

comment:10 by Nicklas Nordborg, 20 months ago

In 7090:

References #1473: Implement alignment for the DNA WGS pipeline

Added Picard and GATK to the container since we will need if for genotyping and maybe also other statistics.

comment:11 by Nicklas Nordborg, 20 months ago

In 7091:

References #1473: Implement alignment for the DNA WGS pipeline

Added genotype QC step to the script.

comment:12 by Nicklas Nordborg, 20 months ago

In 7092:

References #1473: Implement alignment for the DNA WGS pipeline

First version of the manual confirmation wizard. It is very similar to the confirmation for Hisat except that things that don't apply have been removed.

comment:13 by Nicklas Nordborg, 20 months ago

In 7093:

References #1473: Implement alignment for the DNA WGS pipeline

Added a read group parameter that is similar to what we use in the RNA-seq pipeline.

comment:14 by Nicklas Nordborg, 20 months ago

In 7098:

References #1473: Implement alignment for the DNA WGS pipeline

Updated the configuration file to use the selected reference index: GRCh38_full_analysis_set_plus_decoy_hla.fa

comment:15 by Nicklas Nordborg, 20 months ago

In 7099:

References #1473: Implement alignment for the DNA WGS pipeline

Added two new annotations LaneNumber and ReadNumber. They should be used on FASTQ files. Also enabled some already existing annotations to be used on FASTQ files: READS, SerialNumber, SequencingRunNumber and FlowCellID.

This is information that we typically can extract from the header in FASTQ files and we intend to use the flow cell and lane information for the PU read-group tag.

comment:16 by Nicklas Nordborg, 20 months ago

In 7100:

References #1473: Implement alignment for the DNA WGS pipeline

Better implementation for read group values now. The script will set the ID: tag as a unique number (1, 2, 3...) for each pair. The sample and library name are used for the SM: and LB: tags. If there is information about flow cell and lane on the FASTQ files this goes into the PU: tag. Ohter tags, for example, CN: and PL: can be specified in the ReadGroupStatic environment variable in reggie-config.xml.

comment:17 by Nicklas Nordborg, 20 months ago

In 7101:

References #1473: Implement alignment for the DNA WGS pipeline

LaneNumber and ReadNumber updated to be enumerations 1..8 and 1..2.

comment:18 by Nicklas Nordborg, 20 months ago

In 7102:

References #1473: Implement alignment for the DNA WGS pipeline

Remove '##contig' entries from the qc_genotype.vcf file since they add a lot of lines to the file and they are not needed for our QC functionality.

comment:19 by Nicklas Nordborg, 19 months ago

In 7103:

References #1473: Implement alignment for the DNA WGS pipeline

Replacing 'picard-slim' with full 'picard' since the CollectInsertSizeMetrics tool requires R. Turned out that R also need 'which' so added that also to some other definition file templates.

comment:20 by Nicklas Nordborg, 19 months ago

In 7104:

References #1473: Implement alignment for the DNA WGS pipeline

Added two new annotation types for storing information about optical duplicates.

comment:21 by Nicklas Nordborg, 19 months ago

In 7105:

References #1473: Implement alignment for the DNA WGS pipeline

Added picard CollectInsertSizeMetrics to the statistics step in the bwa-mem2.sh script.

Re-arranged output folders so that statistics that we want to save are created in the 'stats' directory.

Import information about optical duplicates and fragment sizes.

comment:22 by Nicklas Nordborg, 19 months ago

In 7106:

References #1473: Implement alignment for the DNA WGS pipeline

Added some more picard steps to the analysis:

  • CollectWgsMetrics
  • CollectAlignmentSummaryMetrics
  • CollectQualityYieldMetrics


Some values are imported as annotations but we may want some more.

comment:23 by Nicklas Nordborg, 19 months ago

In 7107:

References #1473: Implement alignment for the DNA WGS pipeline

Implemented support for setting parameters that depend on the sequencer platform. We currently need different values for the -d parameter to samtools markdup. 2500 for NovaSeq and 100 for HiSeq.

Since we don't have flow cell items we once again expect this annotation to be set on the FASTQ file. If there are multiple FASTQ files we use the value from the first one. If the annotation is missing on all FASTQ files we assume that they are from NovaSeq.

comment:24 by Nicklas Nordborg, 19 months ago

In 7108:

References #1473: Implement alignment for the DNA WGS pipeline

Added ReadLength annotation that should be used on FASTQ files. We need the read length in the picard CollectWgsMetrics step. If no read length exists we use 150 as the default.

comment:25 by Nicklas Nordborg, 19 months ago

In 7109:

References #1473: Implement alignment for the DNA WGS pipeline

Changed number of threads for most samtools steps to 4.

comment:26 by Nicklas Nordborg, 19 months ago

In 7110:

References #1473: Implement alignment for the DNA WGS pipeline

All statistics steps are now started in the background so that they can run at the same time.

comment:27 by Nicklas Nordborg, 19 months ago

In 7111:

References #1473: Implement alignment for the DNA WGS pipeline

Use the fast algorithm when collecting wgs metrics. It saves about 1/3 of the time (or ~45min). Counts and numbers are not exactly the same. There are minor differences in the last decimals.

comment:28 by Nicklas Nordborg, 19 months ago

In 7112:

References #1473: Implement alignment for the DNA WGS pipeline

Making a copy instead of a link to the source FASTQ files seems to make the alignment faster. It should also be less sensitive to network instability.

Adjusting progress percentages.

comment:29 by Nicklas Nordborg, 19 months ago

In 7113:

References #1473: Implement alignment for the DNA WGS pipeline

Added PF_BASES and PF_Q30_BASES annotations.

comment:30 by Nicklas Nordborg, 19 months ago

In 7114:

References #1473: Implement alignment for the DNA WGS pipeline

Updated the confirmation wizard with some more information.

comment:31 by Nicklas Nordborg, 19 months ago

In 7119:

References #1473: Implement alignment for the DNA WGS pipeline

Added auto-confirmation functionality.

comment:32 by Nicklas Nordborg, 19 months ago

In 7120:

References #1473: Implement alignment for the DNA WGS pipeline

Display one decimal in most of the values in the confirmation wizard.

comment:33 by Nicklas Nordborg, 19 months ago

In 7124:

References #1473: Implement alignment for the DNA WGS pipeline

Fixed an issue that was causing the generated path in the BASE file system to be incorrect for Blood items that have a '.b' instead of a digit after the main SCAN-B id.

comment:34 by Nicklas Nordborg, 19 months ago

In 7131:

References #1473: Implement alignment for the DNA WGS pipeline

Added 'Bwa-mem2' to the AlignmentType annotation and use that for filtering software.

comment:35 by Nicklas Nordborg, 19 months ago

In 7132:

References #1473: Implement alignment for the DNA WGS pipeline

Added a description for the auto-confirm option about limits. Chagned the mean coverage limit to 25.

comment:36 by Nicklas Nordborg, 19 months ago

In 7133:

References #1473: Implement alignment for the DNA WGS pipeline

Removed 'beta' from the container name.

comment:37 by Nicklas Nordborg, 19 months ago

In 7134:

References #1473: Implement alignment for the DNA WGS pipeline

Do not display warnings for debug jobs.

comment:38 by Nicklas Nordborg, 19 months ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.