Load Data - Genometric/GeMSE GitHub Wiki

The following video explains how to load data in GeMSE.

Interface

Load Data

GeMSE operates on genomic features present in an interval-based format, such as ChIP-seq peaks. GeMSE takes its input in data (e.g., see this file), and metadata (e.g., see this file). Data and metadata are two files in the same directory with the metadata file having the same name as the data filename with a .meta postfix. For instance, the data file DEX_005nM_01h_GR.bed has DEX_005nM_01h_GR.bed.meta metadata file. The files are discussed in the following.

Metadata

The metadata files provide explanatory information of the data. The metadata files are optional, and they are automatically loaded if present (if they follow the aforementioned naming convention); if not present, data still will be loaded but without explanatory information.

The metadata is a plain text file, where each line in the file is an attribute-value pair. For instance:

Treatment	DEX_5nM 1h
Antibody or target protein	GR (sc-1003)

where Treatment and Antibody or target protein are the attributes, and respectively DEX_5nM 1h and GR (sc-1003) are their values. See this file as a complete example.

A user does not need to specify a metadata file, or load it separately. If a metadata file is available and following the aforementioned naming convention, then GeMSE loads the metadata files automatically when loading data files.

Data

The data is a set of intervals each abstracting a characteristic of a region on the genome (e.g., ChIP-seq peaks in BED files, or mutation in VCF files). GeMSE loads data from tab-delimited files such as BED, ENCODE broadPeak and narrowPeak, GTF, or an arbitrary CSV file. For instance, see this file. The data files can be generally grouped in two categories: BED-like and GTF. GeMSE reads three required columns chromosome, start, and stop. However, depending on the file type GeMSE reads various values; discussed as follows.

BED-like; e.g., BED, broadPeak, narrowPeak, and CSV.

For such file types, GeMSE requires at least one numerical column. When such file types are selected, GeMSE shows the following window.

Load CSV file
Figure 1.
  • Section 1 shows the first few lines (15 lines) of the selected sample.
  • Section 2 provides an option of specifying the number of header lines which should not be read.
  • Section 3 provides features for the user to configure the parser on reading any of the numerical columns--parser requires the chromosome, start, and stop to be respectively the first, second, and the third columns. In the shown example, the forth column contains names of the p-values; which is not a numeric attribute. To avoid this column, the checkbox Value is numeric, and read the column should be unchecked. The buttons Previous Column and Next Column switches between columns. In this example, Next Column allows configuring the parser for the fifth column, which is the X-squared. This column has the numeric attribute of the regions, hence it should be read by checking the Value is numeric, and read the column checkbox. Assigning a label to each column is optional. However, since this label is used in GeMSE to refer to the data read from the column, it is recommended to provide a short explanatory label.

The settings are used for loading all the selected samples. Samples in different formats are loaded separately. Having configured the settings, clicking on OK button GeMSE caches the selected samples. Then it reports the results in a window as the following.

Load CSV file
Figure 2.

This window provides an overview of Loaded Samples, and any error messages in the Details tab. The Load Samples tab shows the number of lines in each sample, and the number of loaded lines (Feature count). If the numbers in the two columns are different, then the line number of the dropped line and a reason for dropping it is available in the Details tab. See the following figure.

Load CSV file
Figure 3.

For instance, line #1 of sample 1.bed is dropped for two reasons: invalid start\stop column values, and invalid Column 5, which is expected to be x-squared (a numeric value) while it is X-squared (a string value). The line #1 is a header line. Since the number of header lines on Section 2 was set to 0, GeMSE attempted parsing it. However, since the _line #1' does not adhere to the defined requirements, GeMSE dropped the line. Accordingly, any line in the sample (at any part of it) which do not adhere to the requirements (e.g., the required chromosome, start, and stop columns) and the user-defined parameters, will not be read. The ignored lines, and the reason for ignoring them is reported.

The temporarily cached samples are loaded in GeMSE by clicking on the YES button.

GTF

GeMSE reads GTF files in standard format, which is:

Column number Attribute Read Description
1 chromosome with or without chr prefix
2 source data source
3 feature feature type name, e.g. Gene
4 start start position of the feature
5 stop end position of the feature
6 score a floating point value
7 strand defined as + (forward) or - (reverse)
8 frame 0, 1 or 2 indicates the base of the feature which is the first base of a codon
9 attribute a semicolon-separated list of attribute-value pairs

(see the Ensembl page for the GTF format details)

Attributes with numeric values at column #9, are used as values for GenoMetric Space cells, and attributes with string values are used for labeling and descriptive information in pattern extraction section.

Once the selected GTF files are loaded, the results are displayed on a windows as shown on Figure 2.



Homogeneous and heterogeneous files

A prerequisite in GeMSE is: all the loaded files refer to common positions on the genome with different values. Therefore, all the files have same number of feature, with same chromosome, start and stop attributes, and in the same order. GeMSE refers to such files as homogeneous files, and the files that do not meet this requirement are considered heterogeneous files.

When loading samples, GeMSE checks for the heterogeneity of the files. If homogeneous, then it loads the cached files. If heterogeneous, then it shows the following window.

Load CSV file
Figure 4.

Using a reference sample (e.g., Ensemble reference genes for hg19 assembly), GeMSE can convert heterogeneous files to homogeneous files. Clicking on OK on the above window, GeMSE shows the following window.

Load CSV file
Figure 5.

Select a reference sample using the Browse button. Then select an attribute of the heterogeneous files from the Attribute to aggregate dropdown, and specify an Aggregation function. Clicking on Load button, GeMSE first loads the reference sample, then it converts the heterogeneous samples to homogeneous samples, then it shows the load results on a window as shown on Figure 2.

⚠️ **GitHub.com Fallback** ⚠️