Data Vocabulary - QuarkNet-HEP/cima-wzh GitHub Wiki

CIMA Data Vocabulary

The following conventions for identifying data and and variables within CIMA is aspirational in the sense that it is ignored and often contradicted in the actual code. Identifying and correcting these instances to bring them into conformity with what's below should be considered TODO items.

Data organization

The smallest unit of data CIMA handles is an event, meaning a single CMS particle collision event. The event data represents the tracks of particles produced during a single proton-proton collision in the CMS detector of CERN's Large Hadron Collider.

The maintainer of iSpy has organized events into datafiles that each contain data for 100 events. The CIMA administrator will assign one or more datafiles to a Masterclass location for participants to analyize its events.

The datafiles are further organized into datagroups. There are five datagroups, labeled by the integers 5, 10, 25, 50, and 100. The datagroup label indicates how many datafiles are assigned to that datagroup. This is done to adjust event statistics among Masterclasses of different sizes: smaller Masterclasses can be assigned data from a small datagroup, which is designed to have a higher proportion of interesting events (if it didn't, smaller Masterclass might end up seeing nothing but background, which is neither interesting nor educational).

Summary: There are 5 datagroups: (5,10,25,50,100). Datagroup 'N' contains N datafiles. Each datafile contains 100 events.

Data labels

datagroups

The integer label of the five datagroups (5,10,25,50,100) is known as the datagroup index. The datagroup index uniquely identifies the datagroup.

datafiles

Within a datagroup N, the N datafiles are assigned an integer datafile index (1, 2, ..., N). By itself, the datafile index is not sufficient to uniquely identify the datafile. For example, datagroup 5 contains datafiles with datafile index values (1, 2, 3, 4, 5), while datagroup 10 contains datafiles with datafile index values (1, 2, 3, 4, 5, ..., 10). Even though the numbers 1-5 are repeated, these are not the same datafiles.

To uniquely identify a datafile, we use the notation "(datagroup index).(datafile index)". For example, datagroup 5 contains the five datafiles (5.1, 5.2, 5.3, 5.4, 5.5), while datagroup 10 contains the ten datafiles (10.1, 10.2, 10.3, 10.4, 10.5, ..., 10.10), and we can clearly distinguish datafile 5.1 from datafile 10.1.

events

Within each datafile, each of the 100 events is assigned an event index between 1 and 100. Again, these labels do not uniquely identify events; the (1, ..., 100) events in datafile 5.1 are not the same data as the (1, ..., 100) events in datafile 10.1.

To uniquely identify an event, we use the notation "(datagroup index).(datafile index)-(event index)". For example, datafile 10.1 contains the events (10.1-001, 10.1-002, ... 10.1-100); datafile 10.2 contains the events (10.2-001, 10.2-002, ... 10.2-100), and so on.

NB: In the current code, the event index is in at least some cases called the event number.