Running via xRootD and DAS - TC01/Treemaker GitHub Wiki

Running via xRootD and DAS

xrootd is a piece of software used by CMS to access files located at other remote sites or storage locations from anywhere on the compute grid. DAS is a service for querying published datasets and looking up metadata about them, including the remote sites they are located on and the full list of files that comprise the dataset. Between the two, it's possible to, knowing only the "name" of a dataset, run on any CMS dataset stored anywhere on the grid from anywhere else on the grid.

This page documents how Treemaker can interact with both services, as of Treemaker v1.2.

Running via xRootD

If you either have xrootd installed or are on e.g. the cmslpc cluster or somewhere else with the CMSSW framework available, you probably have access to xrootd.

Thus, in order to run via xrootd, all you need to do is modify the configuration accordingly. Instead of giving a local path for the directory option, instead give one of the following form:

directory = root://server//path/

Note that the main usage of xrootd support in Treemaker is to specify a directory that can only be accessed over xrootd. The current state of Treemaker xrootd support is to assume that all your ROOT files are still stored in a directory, somewhere-- just somewhere that isn't locally available and must be accessed using the XRD protocol instead. Thus this may not be what you want (you should probably take a look at the DAS support documented below, instead).

Accessing cmslpc's EOS via xrootd

Recently, the EOS instance running at Fermilab was unmounted from the condor compute nodes, due to heavy usage. It's still mounted on the login nodes, but this means that if you wanted to use, say, treemaker-condor to process some Ntuples stored in EOS, you would have no way of accessing them. At least, until now.

You can still access the EOS files over the xrootd service running on cmseos.fnal.gov. Have a look at the following example. This directory refers to some 8 TeV ntuples that I made back in 2015. The following are equivalent ways to access the directory:

directory = root://cmseos.fnal.gov///eos/uscms/store/user/bjr/ntuples/gstar/Gstar_Semilep_1500GeV/normal/Gstar_Semilep_1500GeV_WMLM/

and:

directory = /eos/uscms/store/user/bjr/ntuples/gstar/Gstar_Semilep_1500GeV/normal/Gstar_Semilep_1500GeV_WMLM/

The latter will work perfectly fine, if EOS is mounted. (Try lsing it from one of the cmslpc login nodes). But to run on the compute nodes, you would need to use the former.

DAS Query Integration

Unless you are trying to run over data stored in the cmslpc EOS instance via condor (the batch system at cmslpc), this is probably what you want to do. Treemaker now uses a fork of the Python CMS DAS client to take the name of a dataset and the DBS instance it's stored in and derive the list of files to run over. While the files themselves get ran over using the xrootd support discussed in the above section, you, the user, only need to know the canonical name of the dataset.

Syntax

The following syntax is used to define a DAS job in a Treemaker config file:

directory = das://prod/global:/Name/Of/Dataset

The prod/global is the name of the DBS instance. There are several such instances; in addition to prod/global, your files are probably most likely to be in one of prod/phys{01,02,03}. (Usually prod/phys03, these days, it seems). There is also a prod/caf.

The /Name/Of/Dataset is the full name of the published dataset. Usually, the first string ("Name", here) is a more general description of what the dataset is, the second ("Of", here) has details like who created it, when, and with what versions of what software, and the third ("Dataset", here) is the type of dataset-- something like "MINIAOD" or "AOD" or "LHE" or "USER"-- "USER" usually means that these are user-created ntuples.

The das:// at the beginning simply tells Treemaker that this is a dataset, rather than a directory.

Usage

You can use the DAS web interface or the command-line das_client.py client to make DAS queries, and figure out what datasets you want to run on. Or, if you are making ntuples as well, you can find out by running crab status what the published name of a dataset ended up being.

Then, once you have the name of a dataset, you can create Treemaker config files (using treemaker-config, if you like) and run them with treemaker. Everything should just work!