Release export wizard
— at Version 29
Implement a wizard for creating all files that should be included in a release. Th wizard should take an item list with raw bioassays as input and produce a lot of files. Unless noted, all files are tab-separated text files.
- Transcript data (in folder
dataTables/transcriptDataTable
):
tidmatrix.features.txt
: Array design features with some annotations. The first line is a header line:
id
, geneSymbol
, refSeq
, protAcc
, description
, chr
, entrez
.
- Rows are sorted by ID.
- All raw bioassays in the input list must use the same array design.
tidmatrix_data.txt
: FPKM values for all raw bioassays. Each row represents a feature and each column a raw bioassay.
- The first line is a header line with raw bioassay names.
- The first column contains the feature ID.
- Same order of rows as the
tidmatrix.features.txt
.
tidmatrix_FPKM_conf_hi.txt
, tidmatrix_FPKM_conf_lo.txt
, tidmatrix_FPKM_status.txt
: More data files similar to the tidmatrix_data.txt
file but with the FPKM_conf_hi
, FPKM_conf_lo
and FPKM_status
values.
- Gene data (in folder
dataTables/geneDataTable
):
genematrix_data.txt
: Sum of FPKM values per gene symbol.
- The first line is a header line with raw bioassay names.
- The first column is the gene symbol (in
no particular alphabetical order).
is.NM.gene.txt
: TRUE/FALSE flag for each gene indicating if the refSeq ID starts with NM_
or not.
- No header line.
- First column is the line number (in this file, add +1 for getting the line number in
genematrix_data.txt
).
- Second column is
TRUE
or FALSE
.
- Cohort data (in folder
cohortTables
): A set of tab-separated files with data for each raw bioassay and the parent items it is derived from. Each file starts with a header line. Each row contains data for one raw bioassay. The first column (rba
) is always the name of the raw bioassay. Columns ending with .A.
are annotation columns. Date values are formatted as YYYY-MM-DD
unless otherwise noted.
cohortRawbioassay.txt
: Data from the raw bioassay level. Columns:
ID
: Internal ID in BASE
Name
: Name of raw bioassay
Platform
: Name of platform (Sequencing)
Raw.data.type
: Name of raw data type (cufflinks)
Has.data
: Flag indicating if there is raw data for this raw bioassay or not (TRUE/FALSE)
Db.spots
: Number of raw data entries
Array.design
: Name of the array design
Software
: Name of the software used to generate the raw data
Import.date
: Date the raw data was created
AnalysisResult..A.
DataFilesFolder..A.
FPKM.tracking.file..F.
: Path to the isoforms.fpkm_tracking
file in the BASE file system
cohortAligned.txt
: Data from the AlignedSequences
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (AlignedSequences)
Software
: Name of the software used for alignment
Registered
: Date the item was registered in BASE
AnalysisResult..A.
DataFilesFolder..A.
ALIGNED_PAIRS..A.
READ_PAIRS_EXAMINED..A.
READ_PAIR_DUPLICATES..A.
FRACTION_DUPLICATION..A.
FragmentSizeAvg..A.
FragmentSizeStdev..A.
cohortMasked.txt
: Data from the MaskedSequences
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (MaskedSequences)
Software
: Name of the software used for masking
Registered
: Date the item was registered in BASE
PM_READS..A.
cohortMerged.txt
: Data from the MergedSequences
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (MergedSequences)
Physical.bioassays
: Name of the physical bioassay (flow cell) used for sequencing. Comma-separated list if there is more than one.
Software
: Name of the software used for merging
Registered
: Date the item was registered in BASE
AnalysisResult..A.
DataFilesFolder..A.
READS..A.
PF_READS..A.
ADAPTER_READS..A.
PT_READS..A.
FragmentSizeAvg..A.
FragmentSizeStdev..A.
cohortSequencing.txt
: Data from the SequencingRun
parent item. Columns:
cohortLibrary.txt
: Data from the Library
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (Library)
Protocol
: Name of the library preparation protocol
Created
: Date the library was created
Tag
: Name of the barcode used by the library
Bioplate
: Name of the library plate
Biowell.row
: Row coordinate on the library plate (A-H)
Biowell.column
: Column coordinate on the library plate (1-12)
QubitConc..A.
cohortRNA.txt
: Data from the RNA
/RNAQC
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (RNA)
Original.quantity..µg.
: µg RNA that was extracted
NDConc..A.
ND260by230..A.
ND260by280..A.
QiacubeDate..A.
QiacubeRunNo..A.
QiacubePosition..A.
RNAQC_last
: RIN/RQS value from the latest quality control
cohortLysate.txt
: Data from the Lysate
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (Lysate)
Created
: Date the lysate was created
Original.quantity..µg.
: µg Lysate that was extracted
Parent.items
: Name and used quantity from the specimen
MultPieces..A.
PartitionDate..A.
cohortSample.txt
: Data from the Specimen
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (Specimen)
Original.quantity..µg.
: Weight of sample received
Created
: Date the sample was created (operation date)
ArrivalDate..A.
BiopsyType..A.
SpecimenType..A.
Laterality..A.
NofDeliveredTubes..A.
NofPieces..A.
OperatorDeliveryComment..A.
OperatorPartitionComment..A.
SamplingDateTime..A.
RNALaterDateTime..A.
LinkedSpecimen..A.
cohortCase.txt
: Data from the Case
parent item (except INCA data). Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (Case)
Consent..A.
ConsentDate..A.
Laterality..A.
cohortPatient.txt
: Data from the Patient
parent item. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (Patient)
Gender..A.
Samples
: Comma-separated list with all child item names (cases and blood)
cohortStained.txt
: Data from the Stained
parent item. To find the correct item we need to descend from the specimen level (Specimen -> Histology -> Stained
) and select the one which has GoodStain=TRUE
. Columns:
ID
: Internal ID in BASE
Name
: Name of item
Type
: Type of item (Case)
Created
Registered
Bioplate
Biowell.row
Biowell.column
GoodStain..A.
ScoreComplete..A.
ScoreInvasiveCancer..A.
ScoreInsituCancer..A.
ScoreLymphocytes..A.
ScoreStroma..A.
ScoreFat..A.
ScoreNormal..A.
cohortINCA.txt
: Data from parent items (eg. Case) that have been imported from the INCA registry. Columns:
IncaExportDate..A.
- All other annotation types in the
INCA
category. The INCA_
prefix is removed. No ..A.
is added in the header.
cohortSummaryTable.txt
: A single table collecting some of the most useful information from the other tables. See the summary columns text file. The order of the columns in the file is not correct. They should have the same order as in the main data files.
- Subtype data (in folder
cohortTables/subtypeTables
): Information generated by the R report scripts. We do not currently store this information in BASE, so it needs to be discussed how this should be done. The report plug-in could for example import the data from the R scripts as annotations.
For efficient calculations it is desirable to process the data gene symbol by gene symbol. Thus, the data must come sorted in gene symbol order.