Opened 8 years ago

Last modified 8 years ago

#887 closed task

Release export wizard — at Version 3

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: Reggie v4.5
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

Implement a wizard for creating all files that should be included in a release. Th wizard should take an item list with raw bioassays as input and produce a lot of files. Unless noted, all files are tab-separated text files.

  • Transcript data (in folder dataTables/transcriptDataTable):
    • tidmatrix.features.txt: Array design features with some annotations. The first line is a header line:
      • id, geneSymbol, refSeq, protAcc, description, chr, entrez.
      • Rows are sorted by ID.
      • All raw bioassays in the input list must use the same array design.
    • tidmatrix_data.txt: FPKM values for all raw bioassays. Each row represents a feature and each column a raw bioassay.
      • The first line is a header line with raw bioassay names.
      • The first column contains the feature ID.
      • Same order of rows as the tidmatrix.features.txt.
    • tidmatrix_FPKM_conf_hi.txt, tidmatrix_FPKM_conf_lo.txt, tidmatrix_FPKM_status.txt: More data files similar to the tidmatrix_data.txt file but with the FPKM_conf_hi, FPKM_conf_lo and FPKM_status values.
  • Gene data (in folder dataTables/geneDataTable):
    • genematrix_data.txt: Sum of FPKM values per gene symbol.
      • The first line is a header line with raw bioassay names.
      • The first column is the gene symbol (in no particular alphabetical order).
    • is.NM.gene.txt: TRUE/FALSE flag for each gene indicating if the refSeq ID starts with NM_ or not.
      • No header line.
      • First column is the line number (in this file, add +1 for getting the line number in genematrix_data.txt).
      • Second column is TRUE or FALSE.
  • Cohort data (in folder cohortTables): A set of tab-separated files with data for each raw bioassay and the parent items it is derived from. Each file starts with a header line. Each row contains data for one raw bioassay. The first column (rba) is always the name of the raw bioassay.
    • cohortRawbioassay.txt: Data from the raw bioassay level. Columns:
      • ID: Internal ID in BASE
      • Name: Name of raw bioassay
      • Platform: Name of platform (Sequencing)
      • Raw.data.type: Name of raw data type (cufflinks)
      • Has.data: Flag indicating if there is raw data for this raw bioassay or not (TRUE/FALSE)
      • Db.spots: Number of raw data entries
      • Array.design: Name of the array design
      • Software: Name of the software used to generate the raw data
      • Import.date: Date the raw data was created (in YYYY-MM-DD format)
      • AnalysisResult..A.: Successful/Failed
      • DataFilesFolder..A.: Path to folder in project archive file server where data files are located
      • FPKM.tracking.file..F.: Path to the isoforms.fpkm_tracking file in the BASE file system
    • cohortAligned: Data from the AlignedSequences parent item. Columns:
      • TODO
    • cohortMasked.txt: Data from the MaskedSequences parent item. Columns:
      • TODO
    • cohortMerged.txt: Data from the MergedSequences parent item. Columns:
      • TODO
    • cohortSequencing.txt: Data from the SequencingRun parent item. Columns:
      • TODO
    • cohortLibrary.txt: Data from the Library parent item. Columns:
      • TODO
    • cohortRNA.txt: Data from the RNA parent item. Columns:
      • TODO
    • cohortLysate.txt: Data from the Lysate parent item. Columns:
      • TODO
    • cohortSample.txt: Data from the Specimen parent item. Columns:
      • TODO
    • cohortCase.txt: Data from the Case parent item (except INCA data). Columns:
      • TODO
    • cohortPatient.txt: Data from the Patient parent item. Columns:
      • TODO
    • cohortStained.txt: Data from the Stained parent item. Columns:
      • TODO
    • cohortINCA.txt: Data from parent items (eg. Case) that have been imported from the INCA registry. Columns:
      • TODO
    • cohortSummaryTable.txt: A single table collecting some of the most useful information from the other tables.
  • Subtype data (in folder cohortTables/subtypeTables): Information generated by the R report scripts. We do not currently store this information in BASE, so it needs to be discussed how this should be done. The report plug-in could for example import the data from the R scripts as annotations.
  • README files
    • TODO

Change History (3)

comment:1 by Nicklas Nordborg, 8 years ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 8 years ago

Description: modified (diff)

For efficient calculations it is desirable to process the data gene symbol by gene symbol. Thus, the data must come sorted in gene symbol order.

comment:3 by Nicklas Nordborg, 8 years ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.