DIRAC and GridPP: user metadata - gridpp/dirac-getting-started GitHub Wiki

##Introduction The user metadata functionality of the DIRAC system allows you to add information about the data to the files that you upload to the grid. To be precise, metadata values can be associated with the the Logical File Name (LFN) registered in the DIRAC File Catalog (DFC). One may then query the DFC to find LFNs matching search criteria based on these metadata values.

The basic DIRAC metadata functionality is covered in this DIRAC tutorial. Much of the metadata functionality can be accessed via the command line using the File Catalog Client:

$ dirac-dms-filecatalog-cli
Starting FileCatalog client

File Catalog Client $Revision: 1.17 $Date: 
            
FC:/>

However, when dealing with large numbers of files and sophisticated metadata schema, you'll probably want to use some sort of scripting language to manage your data. Fortunately, we can use the DIRAC Python API to do this. This guide will take you through the basics of using this to upload files, add metadata, and find files with metadata queries using a real-world example from the CERN@school programme.

##Setup DIRAC You'll need a grid certificate, VO membership and a DIRAC UI to run these examples. You can find further instructions on the home page. Make sure you know which sites (e.g. LCG.Glasgow.uk) you can run jobs on, and which Storage Elements (SEs) you have access to. You can find out the latter with the dirac-wms-show-se-status command:

$ dirac-wms-show-se-status
SE                       ReadAccess  WriteAccess  RemoveAccess  CheckAccess 
===============================================================================
ProductionSandboxSE      Active      Active       Active        Active      
BIRMINGHAM-disk          Active      Active       Active        Active      
GLASGOW-disk             Active      Active       Active        Active      
QMUL-disk                Active      Active       Active        Active      
LIVERPOOL-disk           Active      Active       Active        Active

You'll also need to create a proxy with dirac-proxy-init -g <VO name>_user -M, for example:

$ dirac-proxy-init -g cernatschool_user -M
Generating proxy... 
Enter Certificate password:
Uploading proxy for cernatschool_user... 
Proxy generated: 
subject      : /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie/CN=proxy
issuer       : /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie
identity     : /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie
timeleft     : 23:59:59
DIRAC group  : cernatschool_user
path         : /tmp/x509up_u500
username     : t.whyntie
properties   : NormalUser 

Proxies uploaded: 
 DN                                                           | Group             | Until (GMT) 
 /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie | cernatschool_user | 2015/03/23 12:01

##Setup the CERN@school example code Rather than use a trivial example to demonstrate the DIRAC user metadata functionality, we have used code from the CERN@school Collaboration to show how one might implement a metadata schema using a real-world example. Specifically, we will be using frames of data from the Timepix hybrid silicon pixel detector used with CERN@school.

A frame of Timepix data contains charge measurements from a 256 x 256 grid of pixels, stored in a simple ASCII text file in the format:

X\tY\tC\n # pixel 1
X\tY\tC\n # pixel 2
#etc.

Each data file has an associated detector settings file (.dsc) that contains much of the metadata we will want to store (frame start time, end time, acquisition time, etc.). However, we will also use a number of Python wrapper classes to process the data files and extract additional metadata (such as the number of hit pixels in a frame).

You can clone the code used to do this with the following command:

$ cd $WORKINGDIR # note that this doesn't have to be your dirac directory.
$ git clone https://github.com/gridpp/dirac-getting-started.git

To check everything is working, you can run the unit tests as follows:

$ python cernatschool/test_dataset.py
.
----------------------------------------------------------------------
Ran 1 test in 0.033s

OK

and likewise for cernatschool/test_frame.py, cernatschool/test_kluster.py and cernatschool/test_pixel.py. Alternatively, nose may be used to perform all of the tests in one go:

$ nosetests 
....
----------------------------------------------------------------------
Ran 4 tests in 1.278s

OK

You may need to install the numpy, scipy and matplotlib Python modules if these are not already on your system. Also note that the DIRAC bashrc script clears the $PYTHONPATH environment variable, so you may need something like the provided setup.sh to make these findable by Python. The logging output, describing what's going on in the code, may be viewed in the log_*.txt files that are generated when the unit tests are run.

##Add the user metadata indices to the DFC As detailed in the DIRAC tutorial, to be able to query the metadata associated with your data you need to create a metadata index for each item of metadata. These can be assigned to files or directories in the DFC, and are created via the File Catalog Client.

We will add the metadata indices for the frame data files using the File Catalog Client. First, though, check which indices have been added already with the following command:

$ dirac-dms-filecatalog-cli
Starting FileCatalog client

File Catalog Client $Revision: 1.17 $Date: 
            
FC:/>meta show
 FileMetaFields : {}
 DirectoryMetaFields : {}
FC:/>exit

The empty Python dictionaries (curly brackets) indicate that no metadata indices have been added yet. File indices are added with the following command:

FC:/>meta index -f <name> <type>

where <name> is the metadata index name and <type> is the index type - float, int, string or date. If they aren't there already, add the following metadata indices with the following commands:

FC:/>meta index -f chipid string
FC:/>meta index -f hv float
FC:/>meta index -f ikrum int
FC:/>meta index -f lat float
FC:/>meta index -f lon float
FC:/>meta index -f alt float
FC:/>meta index -f occ int
FC:/>meta index -f occ_pc float
FC:/>meta index -f n_pixel int
FC:/>meta index -f n_kluster int
FC:/>meta index -f n_gamma int
FC:/>meta index -f n_non_gamma int
FC:/>meta index -f ismc int
FC:/>meta index -f start_time int
FC:/>meta index -f end_time int
FC:/>meta index -f acqtime float

You can check these have been successfully added to the DFC with the meta show command in the File Catalog Client:

FC:/>meta show
 FileMetaFields : {'n_gamma': 'INT', 'n_pixel': 'INT', 'lat': 'FLOAT', 'lon': 'FLOAT', 'start_time': 'INT', 'acqtime': 'FLOAT', 'occ': 'INT', 'chipid': 'VARCHAR(128)', 'hv': 'FLOAT', 'ikrum': 'INT', 'n_non_gamma': 'INT', 'n_kluster': 'INT', 'end_time': 'INT', 'ismc': 'INT', 'occ_pc': 'FLOAT', 'alt': 'FLOAT'}
 DirectoryMetaFields : {}
FC:/>exit

##Upload the sample data We now need some data to play with. Rather than do this manually with the DIRAC Data Management System (DMS) commands (as described here), we will use a Python script that uses the CERN@school wrapper classes and DIRAC Python API to submit a job that puts files supplied in the InputSandbox to your Storage Element (SE) of choice.

A script has been supplied with the repo you cloned earlier for doing this. You will need to make the following changes to get it working with your setup:

  • Set the jobnum to a previously unused value to prevent multiple copies if you run the script more than once (i.e. if you make a mistake... it happens!).
  • Change the outputSE parameter of the j.setOutputData call to a Storage Element you have access to.
  • Change the outputPath parameter to set which subdirectory in your DFC user directory the files will be uploaded to. The base directory for this will always be /<vo name>/user/<first letter of username>/<username>/.
  • Change the j.setDestination argument to send your job to a different site, or remove to pick one according to the usual WMS rules.

Once you have configured the upload.py script appropriately for your setup, you can submit the job with:

$ python upload.py

and monitor the progress via the web portal. You may wish to make a note of the Job ID if you want to retrieve the job output with the dirac-wms-job-get-output command.

Once the job has finished running, you can check if the files have been uploaded correctly with the File Catalog Client:

FC:/>cd cernatschool.org/user/t/t.whyntie/diractest003/
FC:/cernatschool.org/user/t/t.whyntie/diractest003>ls
B06-W0212_1371575424-293207.txt
B06-W0212_1371575425-337648.txt
B06-W0212_1371575426-414549.txt
B06-W0212_1371575427-489662.txt
B06-W0212_1371575428-551945.txt
FC:/cernatschool.org/user/t/t.whyntie/diractest003>meta get B06-W0212_1371575424-293207.txt
No metadata found
FC:/>exit

As that last command should show, there is no user metadata associated with these LFNs. Let's do something about that now.

##Add the metadata The add_metadata.py script will recreate the user metadata from the wrapper classes and add it to the LFNs you've just uploaded via the FileCatalogClient class. (As far as I'm aware, there is no way to add the metadata when uploading the file via a job - the files need to exist in the DFC and the job will take a little time to run.) Depending on what you did in the previous step, you will need to change:

  • The jobnum value - to match the value from used in the job submission before;
  • The lfn_base value - to match your VO, user name, and upload directory.

The script will output the results of the queries (setting the metadata). Once run, you can check it has worked by launching the File Catalog Client and retrieving the metadata indices for each file:

$ dirac-dms-filecatalog-cli 
Starting FileCatalog client

File Catalog Client $Revision: 1.17 $Date: 
            
FC:/>cd /cernatschool.org/user/t/t.whyntie/diractest003/
FC:/>meta get B06-W0212_1371575424-293207.txt
             n_pixel : 735
             n_gamma : 12
                 lon : -0.142515
             acqtime : 1.0
                 occ : 735
              chipid : B06-W0212
                  hv : 18.0
               ikrum : 1
         n_non_gamma : 22
           n_kluster : 34
            end_time : 1371575425
                 lat : 51.5099
                 alt : 34.02
              occ_pc : 0.0112152
          start_time : 1371575424

##Perform a metadata query Now that you've got some data uploaded to the grid with some metadata, you can now easily find it. Manual queries can be performed via the File Catalog Client. For example, to find the LFNs of frames with a start time (UNIX timestamp) greater than a certain value, use:

FC:/>cd /
FC:/>find . start_time>1371575425
Query: {'start_time': {'>': 1371575425}}
/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575426-414549.txt
/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575428-551945.txt
/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575427-489662.txt
QueryTime 0.00 sec
FC:/>

However, it's far more useful to be able to perform these queries within a Python script - for example, if you're retrieving a list of LFNs for the job's InputData based on certain metadata. Again, we've created a script with some example queries you can use. This can be run with:

$ python perform_query.py 

#############################################
* GridPP and DIRAC: user metadata - queries *
#############################################

* Metadata query: {'n_pixel': {'>': 700}}
Found: '/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575424-293207.txt'.
Found: '/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575428-551945.txt'.
Found: '/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575427-489662.txt'.

These are the three frames that have more than 700 hit pixels in them. The next step is to use this list of LFNs as the input to another grid job via the setInputSandbox method.

##Useful links

⚠️ **GitHub.com Fallback** ⚠️