Release export wizard
— at Version 5
Implement a wizard for creating all files that should be included in a release. Th wizard should take an item list with raw bioassays as input and produce a lot of files. Unless noted, all files are tab-separated text files.
- Transcript data (in folder
dataTables/transcriptDataTable
):
tidmatrix.features.txt
: Array design features with some annotations. The first line is a header line:
id
, geneSymbol
, refSeq
, protAcc
, description
, chr
, entrez
.
- Rows are sorted by ID.
- All raw bioassays in the input list must use the same array design.
tidmatrix_data.txt
: FPKM values for all raw bioassays. Each row represents a feature and each column a raw bioassay.
- The first line is a header line with raw bioassay names.
- The first column contains the feature ID.
- Same order of rows as the
tidmatrix.features.txt
.
tidmatrix_FPKM_conf_hi.txt
, tidmatrix_FPKM_conf_lo.txt
, tidmatrix_FPKM_status.txt
: More data files similar to the tidmatrix_data.txt
file but with the FPKM_conf_hi
, FPKM_conf_lo
and FPKM_status
values.
- Gene data (in folder
dataTables/geneDataTable
):
genematrix_data.txt
: Sum of FPKM values per gene symbol.
- The first line is a header line with raw bioassay names.
- The first column is the gene symbol (in
no particular alphabetical order).
is.NM.gene.txt
: TRUE/FALSE flag for each gene indicating if the refSeq ID starts with NM_
or not.
- No header line.
- First column is the line number (in this file, add +1 for getting the line number in
genematrix_data.txt
).
- Second column is
TRUE
or FALSE
.
- Cohort data (in folder
cohortTables
): A set of tab-separated files with data for each raw bioassay and the parent items it is derived from. Each file starts with a header line. Each row contains data for one raw bioassay. The first column (rba
) is always the name of the raw bioassay. Columns ending with .A.
are annotation columns. Date values are formatted as YYYY-MM-DD
unless otherwise noted.
cohortRawbioassay.txt
: Data from the raw bioassay level. Columns:
ID
: Internal ID in BASE
Name
: Name of raw bioassay
Platform
: Name of platform (Sequencing)
Raw.data.type
: Name of raw data type (cufflinks)
Has.data
: Flag indicating if there is raw data for this raw bioassay or not (TRUE/FALSE)
Db.spots
: Number of raw data entries
Array.design
: Name of the array design
Software
: Name of the software used to generate the raw data
Import.date
: Date the raw data was created
AnalysisResult..A.
DataFilesFolder..A.
FPKM.tracking.file..F.
: Path to the isoforms.fpkm_tracking
file in the BASE file system
cohortAligned.txt
: Data from the AlignedSequences
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (AlignedSequences)
Software
: Name of the software used for alignment
Registered
: Date the item was registered in BASE
AnalysisResult..A.
DataFilesFolder..A.
ALIGNED_PAIRS..A.
READ_PAIRS_EXAMINED..A.
READ_PAIR_DUPLICATES..A.
cohortMasked.txt
: Data from the MaskedSequences
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (MaskedSequences)
Software
: Name of the software used for masking
Registered
: Date the item was registered in BASE
PM_READS..A.
cohortMerged.txt
: Data from the MergedSequences
parent item. Columns:
cohortSequencing.txt
: Data from the SequencingRun
parent item. Columns:
cohortLibrary.txt
: Data from the Library
parent item. Columns:
cohortRNA.txt
: Data from the RNA
parent item. Columns:
cohortLysate.txt
: Data from the Lysate
parent item. Columns:
cohortSample.txt
: Data from the Specimen
parent item. Columns:
cohortCase.txt
: Data from the Case
parent item (except INCA data). Columns:
cohortPatient.txt
: Data from the Patient
parent item. Columns:
cohortStained.txt
: Data from the Stained
parent item. Columns:
cohortINCA.txt
: Data from parent items (eg. Case) that have been imported from the INCA registry. Columns:
cohortSummaryTable.txt
: A single table collecting some of the most useful information from the other tables.
- Subtype data (in folder
cohortTables/subtypeTables
): Information generated by the R report scripts. We do not currently store this information in BASE, so it needs to be discussed how this should be done. The report plug-in could for example import the data from the R scripts as annotations.
For efficient calculations it is desirable to process the data gene symbol by gene symbol. Thus, the data must come sorted in gene symbol order.