Running via xRootD and DAS - TC01/Treemaker GitHub Wiki
Running via xRootD and DAS
xrootd is a piece of software used by CMS to access files located at other remote sites or storage locations from anywhere on the compute grid. DAS is a service for querying published datasets and looking up metadata about them, including the remote sites they are located on and the full list of files that comprise the dataset. Between the two, it's possible to, knowing only the "name" of a dataset, run on any CMS dataset stored anywhere on the grid from anywhere else on the grid.
This page documents how Treemaker can interact with both services, as of Treemaker v1.2.
Running via xRootD
If you either have xrootd installed or are on e.g. the cmslpc cluster or somewhere else with the CMSSW framework available, you probably have access to xrootd.
Thus, in order to run via xrootd, all you need to do is modify the configuration
accordingly. Instead of giving a local path for the directory
option,
instead give one of the following form:
directory = root://server//path/
Note that the main usage of xrootd support in Treemaker is to specify a directory that can only be accessed over xrootd. The current state of Treemaker xrootd support is to assume that all your ROOT files are still stored in a directory, somewhere-- just somewhere that isn't locally available and must be accessed using the XRD protocol instead. Thus this may not be what you want (you should probably take a look at the DAS support documented below, instead).
Accessing cmslpc's EOS via xrootd
Recently, the EOS instance running at Fermilab was unmounted from
the condor compute nodes, due to heavy usage. It's still mounted on the login nodes, but
this means that if you wanted to use, say, treemaker-condor
to process some Ntuples
stored in EOS, you would have no way of accessing them. At least, until now.
You can still access the EOS files over the xrootd service running on cmseos.fnal.gov. Have a look at the following example. This directory refers to some 8 TeV ntuples that I made back in 2015. The following are equivalent ways to access the directory:
directory = root://cmseos.fnal.gov///eos/uscms/store/user/bjr/ntuples/gstar/Gstar_Semilep_1500GeV/normal/Gstar_Semilep_1500GeV_WMLM/
and:
directory = /eos/uscms/store/user/bjr/ntuples/gstar/Gstar_Semilep_1500GeV/normal/Gstar_Semilep_1500GeV_WMLM/
The latter will work perfectly fine, if EOS is mounted. (Try ls
ing it from one
of the cmslpc login nodes). But to run on the compute nodes, you would need to use the
former.
DAS Query Integration
Unless you are trying to run over data stored in the cmslpc EOS instance via condor (the batch system at cmslpc), this is probably what you want to do. Treemaker now uses a fork of the Python CMS DAS client to take the name of a dataset and the DBS instance it's stored in and derive the list of files to run over. While the files themselves get ran over using the xrootd support discussed in the above section, you, the user, only need to know the canonical name of the dataset.
Syntax
The following syntax is used to define a DAS job in a Treemaker config file:
directory = das://prod/global:/Name/Of/Dataset
The prod/global
is the name of the DBS instance. There are several such
instances; in addition to prod/global, your files are probably most likely to be
in one of prod/phys{01,02,03}. (Usually prod/phys03, these days, it seems).
There is also a prod/caf.
The /Name/Of/Dataset
is the full name of the published dataset. Usually,
the first string ("Name", here) is a more general description of what the dataset is,
the second ("Of", here) has details like who created it, when, and with what
versions of what software, and the third ("Dataset", here) is the type of dataset--
something like "MINIAOD" or "AOD" or "LHE" or "USER"-- "USER" usually means that
these are user-created ntuples.
The das://
at the beginning simply tells Treemaker that this is a dataset,
rather than a directory.
Usage
You can use the DAS web interface or the command-line
das_client.py
client to make DAS queries, and figure out what datasets you
want to run on. Or, if you are making ntuples as well, you can find out by running
crab status
what the published name of a dataset ended up being.
Then, once you have the name of a dataset, you can create Treemaker config files
(using treemaker-config
, if you like) and run them with treemaker
.
Everything should just work!