Opened 9 years ago

Last modified 8 years ago

#887 closed task

Release export wizard — at Version 6

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: Reggie v4.5
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

Implement a wizard for creating all files that should be included in a release. Th wizard should take an item list with raw bioassays as input and produce a lot of files. Unless noted, all files are tab-separated text files.

  • Transcript data (in folder dataTables/transcriptDataTable):
    • tidmatrix.features.txt: Array design features with some annotations. The first line is a header line:
      • id, geneSymbol, refSeq, protAcc, description, chr, entrez.
      • Rows are sorted by ID.
      • All raw bioassays in the input list must use the same array design.
    • tidmatrix_data.txt: FPKM values for all raw bioassays. Each row represents a feature and each column a raw bioassay.
      • The first line is a header line with raw bioassay names.
      • The first column contains the feature ID.
      • Same order of rows as the tidmatrix.features.txt.
    • tidmatrix_FPKM_conf_hi.txt, tidmatrix_FPKM_conf_lo.txt, tidmatrix_FPKM_status.txt: More data files similar to the tidmatrix_data.txt file but with the FPKM_conf_hi, FPKM_conf_lo and FPKM_status values.
  • Gene data (in folder dataTables/geneDataTable):
    • genematrix_data.txt: Sum of FPKM values per gene symbol.
      • The first line is a header line with raw bioassay names.
      • The first column is the gene symbol (in no particular alphabetical order).
    • is.NM.gene.txt: TRUE/FALSE flag for each gene indicating if the refSeq ID starts with NM_ or not.
      • No header line.
      • First column is the line number (in this file, add +1 for getting the line number in genematrix_data.txt).
      • Second column is TRUE or FALSE.
  • Cohort data (in folder cohortTables): A set of tab-separated files with data for each raw bioassay and the parent items it is derived from. Each file starts with a header line. Each row contains data for one raw bioassay. The first column (rba) is always the name of the raw bioassay. Columns ending with .A. are annotation columns. Date values are formatted as YYYY-MM-DD unless otherwise noted.
    • cohortRawbioassay.txt: Data from the raw bioassay level. Columns:
      • ID: Internal ID in BASE
      • Name: Name of raw bioassay
      • Platform: Name of platform (Sequencing)
      • Raw.data.type: Name of raw data type (cufflinks)
      • Has.data: Flag indicating if there is raw data for this raw bioassay or not (TRUE/FALSE)
      • Db.spots: Number of raw data entries
      • Array.design: Name of the array design
      • Software: Name of the software used to generate the raw data
      • Import.date: Date the raw data was created
      • AnalysisResult..A.
      • DataFilesFolder..A.
      • FPKM.tracking.file..F.: Path to the isoforms.fpkm_tracking file in the BASE file system
    • cohortAligned.txt: Data from the AlignedSequences parent item. Columns:
      • ID: Internal ID in BASE
      • Name: Name of item
      • Type: Type of item (AlignedSequences)
      • Software: Name of the software used for alignment
      • Registered: Date the item was registered in BASE
      • AnalysisResult..A.
      • DataFilesFolder..A.
      • ALIGNED_PAIRS..A.
      • READ_PAIRS_EXAMINED..A.
      • READ_PAIR_DUPLICATES..A.
    • cohortMasked.txt: Data from the MaskedSequences parent item. Columns:
      • ID: Internal ID in BASE
      • Name: Name of item
      • Type: Type of item (MaskedSequences)
      • Software: Name of the software used for masking
      • Registered: Date the item was registered in BASE
      • PM_READS..A.
    • cohortMerged.txt: Data from the MergedSequences parent item. Columns:
      • ID: Internal ID in BASE
      • Name: Name of item
      • Type: Type of item (MergedSequences)
      • Physical.bioassays: Name of the physical bioassay (flow cell) used for sequencing. Comma-separated list if there is more than one.
      • Software: Name of the software used for merging
      • Registered: Date the item was registered in BASE
      • `AnalysisResult..A.
      • DataFilesFolder..A.
      • READS..A.
      • PF_READS..A.
      • ADAPTER_READS..A.
      • PT_READS..A.
      • FragmentSizeAvg..A.
      • FragmentSizeStdev..A.
    • cohortSequencing.txt: Data from the SequencingRun parent item. Columns:
      • TODO
    • cohortLibrary.txt: Data from the Library parent item. Columns:
      • ID: Internal ID in BASE
      • Name: Name of item
      • Type: Type of item (Library)
      • Protocol: Name of the library preparation protocol
      • Created: Date the library was created
      • Tag: Name of the barcode used by the library
      • Bioplate: Name of the library plate
      • Biowell.row: Row coordinate on the library plate (A-H)
      • Biowell.column: Column coordinate on the library plate (1-12)
      • QubitConc..A.
    • cohortRNA.txt: Data from the RNA parent item. Columns:
      • TODO
    • cohortLysate.txt: Data from the Lysate parent item. Columns:
      • TODO
    • cohortSample.txt: Data from the Specimen parent item. Columns:
      • TODO
    • cohortCase.txt: Data from the Case parent item (except INCA data). Columns:
      • TODO
    • cohortPatient.txt: Data from the Patient parent item. Columns:
      • TODO
    • cohortStained.txt: Data from the Stained parent item. Columns:
      • TODO
    • cohortINCA.txt: Data from parent items (eg. Case) that have been imported from the INCA registry. Columns:
      • TODO
    • cohortSummaryTable.txt: A single table collecting some of the most useful information from the other tables.
  • Subtype data (in folder cohortTables/subtypeTables): Information generated by the R report scripts. We do not currently store this information in BASE, so it needs to be discussed how this should be done. The report plug-in could for example import the data from the R scripts as annotations.
  • README files
    • TODO

Change History (6)

comment:1 by Nicklas Nordborg, 9 years ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 9 years ago

Description: modified (diff)

For efficient calculations it is desirable to process the data gene symbol by gene symbol. Thus, the data must come sorted in gene symbol order.

comment:3 by Nicklas Nordborg, 9 years ago

Description: modified (diff)

comment:4 by Nicklas Nordborg, 9 years ago

Status: newassigned

comment:5 by Nicklas Nordborg, 9 years ago

Description: modified (diff)

comment:6 by Nicklas Nordborg, 9 years ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.