Context Navigation

← Previous Ticket
Next Ticket →

#887 closed task

Release export wizard — at Version 3

Reported by:	Nicklas Nordborg	Owned by:	Nicklas Nordborg
Priority:	critical	Milestone:	Reggie v4.5
Component:	net.sf.basedb.reggie	Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

Implement a wizard for creating all files that should be included in a release. Th wizard should take an item list with raw bioassays as input and produce a lot of files. Unless noted, all files are tab-separated text files.

Transcript data (in folder dataTables/transcriptDataTable):
- tidmatrix.features.txt: Array design features with some annotations. The first line is a header line:
  - id, geneSymbol, refSeq, protAcc, description, chr, entrez.
  - Rows are sorted by ID.
  - All raw bioassays in the input list must use the same array design.
- tidmatrix_data.txt: FPKM values for all raw bioassays. Each row represents a feature and each column a raw bioassay.
  - The first line is a header line with raw bioassay names.
  - The first column contains the feature ID.
  - Same order of rows as the tidmatrix.features.txt.
- tidmatrix_FPKM_conf_hi.txt, tidmatrix_FPKM_conf_lo.txt, tidmatrix_FPKM_status.txt: More data files similar to the tidmatrix_data.txt file but with the FPKM_conf_hi, FPKM_conf_lo and FPKM_status values.

Gene data (in folder dataTables/geneDataTable):
- genematrix_data.txt: Sum of FPKM values per gene symbol.
  - The first line is a header line with raw bioassay names.
  - The first column is the gene symbol (in ~~no particular~~ alphabetical order).
- is.NM.gene.txt: TRUE/FALSE flag for each gene indicating if the refSeq ID starts with NM_ or not.
  - No header line.
  - First column is the line number (in this file, add +1 for getting the line number in genematrix_data.txt).
  - Second column is TRUE or FALSE.

Cohort data (in folder cohortTables): A set of tab-separated files with data for each raw bioassay and the parent items it is derived from. Each file starts with a header line. Each row contains data for one raw bioassay. The first column (rba) is always the name of the raw bioassay.
- cohortRawbioassay.txt: Data from the raw bioassay level. Columns:
  - ID: Internal ID in BASE
  - Name: Name of raw bioassay
  - Platform: Name of platform (Sequencing)
  - Raw.data.type: Name of raw data type (cufflinks)
  - Has.data: Flag indicating if there is raw data for this raw bioassay or not (TRUE/FALSE)
  - Db.spots: Number of raw data entries
  - Array.design: Name of the array design
  - Software: Name of the software used to generate the raw data
  - Import.date: Date the raw data was created (in YYYY-MM-DD format)
  - AnalysisResult..A.: Successful/Failed
  - DataFilesFolder..A.: Path to folder in project archive file server where data files are located
  - FPKM.tracking.file..F.: Path to the isoforms.fpkm_tracking file in the BASE file system
- cohortAligned: Data from the AlignedSequences parent item. Columns:
  - TODO
- cohortMasked.txt: Data from the MaskedSequences parent item. Columns:
  - TODO
- cohortMerged.txt: Data from the MergedSequences parent item. Columns:
  - TODO
- cohortSequencing.txt: Data from the SequencingRun parent item. Columns:
  - TODO
- cohortLibrary.txt: Data from the Library parent item. Columns:
  - TODO
- cohortRNA.txt: Data from the RNA parent item. Columns:
  - TODO
- cohortLysate.txt: Data from the Lysate parent item. Columns:
  - TODO
- cohortSample.txt: Data from the Specimen parent item. Columns:
  - TODO
- cohortCase.txt: Data from the Case parent item (except INCA data). Columns:
  - TODO
- cohortPatient.txt: Data from the Patient parent item. Columns:
  - TODO
- cohortStained.txt: Data from the Stained parent item. Columns:
  - TODO
- cohortINCA.txt: Data from parent items (eg. Case) that have been imported from the INCA registry. Columns:
  - TODO
- cohortSummaryTable.txt: A single table collecting some of the most useful information from the other tables.

Subtype data (in folder cohortTables/subtypeTables): Information generated by the R report scripts. We do not currently store this information in BASE, so it needs to be discussed how this should be done. The report plug-in could for example import the data from the R scripts as annotations.

README files
- TODO

Change History (3)

comment:1 by Nicklas Nordborg, 9 years ago

Description:	modified (diff)

comment:2 by Nicklas Nordborg, 9 years ago

Description:	modified (diff)

For efficient calculations it is desirable to process the data gene symbol by gene symbol. Thus, the data must come sorted in gene symbol order.

comment:3 by Nicklas Nordborg, 9 years ago

Description:	modified (diff)

Note: See TracTickets for help on using tickets.

Download in other formats: