id summary reporter owner description type status priority milestone component resolution keywords cc 887 Release export wizard Nicklas Nordborg Nicklas Nordborg "Implement a wizard for creating all files that should be included in a release. Th wizard should take an item list with raw bioassays as input and produce a lot of files. Unless noted, all files are tab-separated text files. * Transcript data (in folder `dataTables/transcriptDataTable`): - `tidmatrix.features.txt`: Array design features with some annotations. The first line is a header line: * `id`, `geneSymbol`, `refSeq`, `protAcc`, `description`, `chr`, `entrez`. * Rows are sorted by internal ID (see [comment:33 comment below] for more information) * All raw bioassays in the input list must use the same array design. - `tidmatrix_data.txt`: FPKM values for all raw bioassays. Each row represents a feature and each column a raw bioassay. * The first line is a header line with raw bioassay names. * The first column contains the feature ID. * Same order of rows as the `tidmatrix.features.txt`. - `tidmatrix_FPKM_conf_hi.txt`, `tidmatrix_FPKM_conf_lo.txt`, `tidmatrix_FPKM_status.txt`: More data files similar to the `tidmatrix_data.txt` file but with the `FPKM_conf_hi`, `FPKM_conf_lo` and `FPKM_status` values. * Gene data (in folder `dataTables/geneDataTable`): - `genematrix_data.txt`: Sum of FPKM values per gene symbol. * The first line is a header line with raw bioassay names. * The first column is the gene symbol (in no particular order). - `is.NM.gene.txt`: TRUE/FALSE flag for each gene indicating if the refSeq ID starts with `NM_` or not. * No header line. * Rows must be in the same order as in `genematrix_data.txt`. * First column is the line number (in this file, add +1 for getting the line number in `genematrix_data.txt`). * Second column is `TRUE` or `FALSE`. * Third column is the gene symbol. * Cohort data (in folder `cohortTables`): A set of tab-separated files with data for each raw bioassay and the parent items it is derived from. Each file starts with a header line. Each row contains data for one raw bioassay. The first column (`rba`) is always the name of the raw bioassay. Columns ending with `.A.` are annotation columns. Date values are formatted as `YYYY-MM-DD` unless otherwise noted. - `cohortRawbioassay.txt`: Data from the raw bioassay level. Columns: * `ID`: Internal ID in BASE * `Name`: Name of raw bioassay * `Platform`: Name of platform (Sequencing) * `Raw.data.type`: Name of raw data type (cufflinks) * `Has.data`: Flag indicating if there is raw data for this raw bioassay or not (TRUE/FALSE) * `Db.spots`: Number of raw data entries * `Array.design`: Name of the array design * `Software`: Name of the software used to generate the raw data * `Import.date`: Date the raw data was created * `AnalysisResult..A.` * `DataFilesFolder..A.` * `FPKM.tracking.file..F.`: Path to the `isoforms.fpkm_tracking` file in the BASE file system - `cohortAligned.txt`: Data from the `AlignedSequences` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (!AlignedSequences) * `Software`: Name of the software used for alignment * `Registered`: Date the item was registered in BASE * `AnalysisResult..A.` * `DataFilesFolder..A.` * `ALIGNED_PAIRS..A.` * `READ_PAIRS_EXAMINED..A.` * `READ_PAIR_DUPLICATES..A.` * `FRACTION_DUPLICATION..A.` * `FragmentSizeAvg..A.` * `FragmentSizeStdev..A.` - `cohortMasked.txt`: Data from the `MaskedSequences` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (!MaskedSequences) * `Software`: Name of the software used for masking * `Registered`: Date the item was registered in BASE * `PM_READS..A.` - `cohortMerged.txt`: Data from the `MergedSequences` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (!MergedSequences) * `Physical.bioassays`: Name of the physical bioassay (flow cell) used for sequencing. Comma-separated list if there is more than one. * `Software`: Name of the software used for merging * `Registered`: Date the item was registered in BASE * `AnalysisResult..A.` * `DataFilesFolder..A.` * `READS..A.` * `PF_READS..A.` * `ADAPTER_READS..A.` * `PT_READS..A.` * `FragmentSizeAvg..A.` * `FragmentSizeStdev..A.` - `cohortSequencing.txt`: Data from the `SequencingRun` parent item. Columns: * TODO - `cohortLibrary.txt`: Data from the `Library` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (Library) * `Protocol`: Name of the library preparation protocol * `Created`: Date the library was created * `Tag`: Name of the barcode used by the library * `Bioplate`: Name of the library plate * `Biowell.row`: Row coordinate on the library plate (A-H) * `Biowell.column`: Column coordinate on the library plate (1-12) * `QubitConc..A.` - `cohortRNA.txt`: Data from the `RNA`/`RNAQC` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (RNA) * `Original.quantity..µg.`: µg RNA that was extracted * `NDConc..A.` * `ND260by230..A.` * `ND260by280..A.` * `QiacubeDate..A.` * `QiacubeRunNo..A.` * `QiacubePosition..A.` * `RNAQC_last`: RIN/RQS value from the latest quality control - `cohortLysate.txt`: Data from the `Lysate` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (Lysate) * `Created`: Date the lysate was created * `Original.quantity..µg.`: µg Lysate that was extracted * `Parent.items`: Name and used quantity from the specimen * `MultPieces..A.` * `PartitionDate..A.` - `cohortSample.txt`: Data from the `Specimen` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (Specimen) * `Original.quantity..µg.`: Weight of sample received * `Created`: Date the sample was created (operation date) * `ArrivalDate..A.` * `BiopsyType..A.` * `SpecimenType..A.` * `Laterality..A.` * `NofDeliveredTubes..A.` * `NofPieces..A.` * `OperatorDeliveryComment..A.` * `OperatorPartitionComment..A.` * `SamplingDateTime..A.` * `RNALaterDateTime..A.` * `LinkedSpecimen..A.` - `cohortCase.txt`: Data from the `Case` parent item (except INCA data). Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (Case) * `Consent..A.` * `ConsentDate..A.` * `Laterality..A.` - `cohortPatient.txt`: Data from the `Patient` parent item. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (Patient) * `Gender..A.` * `Samples`: Comma-separated list with all child item names (cases and blood) - `cohortStained.txt`: Data from the `Stained` parent item. To find the correct item we need to descend from the specimen level (`Specimen -> Histology -> Stained`) and select the one which has `GoodStain=TRUE`. Columns: * `ID`: Internal ID in BASE * `Name`: Name of item * `Type`: Type of item (Case) * `Created` * `Registered` * `Bioplate` * `Biowell.row` * `Biowell.column` * `GoodStain..A.` * `ScoreComplete..A.` * `ScoreInvasiveCancer..A.` * `ScoreInsituCancer..A.` * `ScoreLymphocytes..A.` * `ScoreStroma..A.` * `ScoreFat..A.` * `ScoreNormal..A.` - `cohortINCA.txt`: Data from parent items (eg. Case) that have been imported from the INCA registry. Columns: * `IncaExportDate..A.` * All other annotation types in the `INCA` category. The `INCA_` prefix is removed. No `..A.` is added in the header. - `cohortSummaryTable.txt`: A single table collecting some of the most useful information from the other tables. See the [attachment:ticket:887:summary_columns.txt summary columns] text file. The order of the columns in the file is not correct. They should have the same order as in the main data files. * Subtype data (in folder `cohortTables/subtypeTables`): Information generated by the R report scripts. We do not currently store this information in BASE, so it needs to be discussed how this should be done. The report plug-in could for example import the data from the R scripts as annotations. * README files - TODO" task closed critical Reggie v4.5 net.sf.basedb.reggie fixed