DIRAC and GridPP: user metadata - gridpp/dirac-getting-started GitHub Wiki
##Introduction The user metadata functionality of the DIRAC system allows you to add information about the data to the files that you upload to the grid. To be precise, metadata values can be associated with the the Logical File Name (LFN) registered in the DIRAC File Catalog (DFC). One may then query the DFC to find LFNs matching search criteria based on these metadata values.
The basic DIRAC metadata functionality is covered in this DIRAC tutorial. Much of the metadata functionality can be accessed via the command line using the File Catalog Client:
$ dirac-dms-filecatalog-cli
Starting FileCatalog client
File Catalog Client $Revision: 1.17 $Date:
FC:/>
However, when dealing with large numbers of files and sophisticated metadata schema, you'll probably want to use some sort of scripting language to manage your data. Fortunately, we can use the DIRAC Python API to do this. This guide will take you through the basics of using this to upload files, add metadata, and find files with metadata queries using a real-world example from the CERN@school programme.
##Setup DIRAC
You'll need a grid certificate, VO membership and a DIRAC UI to run these examples. You can find further instructions on the home page. Make sure you know which sites (e.g. LCG.Glasgow.uk
) you can run jobs on, and which Storage Elements (SEs) you have access to. You can find out the latter with the dirac-wms-show-se-status
command:
$ dirac-wms-show-se-status
SE ReadAccess WriteAccess RemoveAccess CheckAccess
===============================================================================
ProductionSandboxSE Active Active Active Active
BIRMINGHAM-disk Active Active Active Active
GLASGOW-disk Active Active Active Active
QMUL-disk Active Active Active Active
LIVERPOOL-disk Active Active Active Active
You'll also need to create a proxy with dirac-proxy-init -g <VO name>_user -M
, for example:
$ dirac-proxy-init -g cernatschool_user -M
Generating proxy...
Enter Certificate password:
Uploading proxy for cernatschool_user...
Proxy generated:
subject : /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie/CN=proxy
issuer : /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie
identity : /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie
timeleft : 23:59:59
DIRAC group : cernatschool_user
path : /tmp/x509up_u500
username : t.whyntie
properties : NormalUser
Proxies uploaded:
DN | Group | Until (GMT)
/C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie | cernatschool_user | 2015/03/23 12:01
##Setup the CERN@school example code Rather than use a trivial example to demonstrate the DIRAC user metadata functionality, we have used code from the CERN@school Collaboration to show how one might implement a metadata schema using a real-world example. Specifically, we will be using frames of data from the Timepix hybrid silicon pixel detector used with CERN@school.
A frame of Timepix data contains charge measurements from a 256 x 256 grid of pixels, stored in a simple ASCII text file in the format:
X\tY\tC\n # pixel 1
X\tY\tC\n # pixel 2
#etc.
Each data file has an associated detector settings file (.dsc
) that contains much of the metadata we will want to store (frame start time, end time, acquisition time, etc.). However, we will also use a number of Python wrapper classes to process the data files and extract additional metadata (such as the number of hit pixels in a frame).
You can clone the code used to do this with the following command:
$ cd $WORKINGDIR # note that this doesn't have to be your dirac directory.
$ git clone https://github.com/gridpp/dirac-getting-started.git
To check everything is working, you can run the unit tests as follows:
$ python cernatschool/test_dataset.py
.
----------------------------------------------------------------------
Ran 1 test in 0.033s
OK
and likewise for cernatschool/test_frame.py
, cernatschool/test_kluster.py
and cernatschool/test_pixel.py
. Alternatively, nose may be used to perform all of the tests in one go:
$ nosetests
....
----------------------------------------------------------------------
Ran 4 tests in 1.278s
OK
You may need to install the numpy
, scipy
and matplotlib
Python modules if these are not already on your system. Also note that the DIRAC bashrc
script clears the $PYTHONPATH
environment variable, so you may need something like the provided setup.sh
to make these findable by Python. The logging
output, describing what's going on in the code, may be viewed in the log_*.txt
files that are generated when the unit tests are run.
##Add the user metadata indices to the DFC As detailed in the DIRAC tutorial, to be able to query the metadata associated with your data you need to create a metadata index for each item of metadata. These can be assigned to files or directories in the DFC, and are created via the File Catalog Client.
We will add the metadata indices for the frame data files using the File Catalog Client. First, though, check which indices have been added already with the following command:
$ dirac-dms-filecatalog-cli
Starting FileCatalog client
File Catalog Client $Revision: 1.17 $Date:
FC:/>meta show
FileMetaFields : {}
DirectoryMetaFields : {}
FC:/>exit
The empty Python dictionaries (curly brackets) indicate that no metadata indices have been added yet. File indices are added with the following command:
FC:/>meta index -f <name> <type>
where <name>
is the metadata index name and <type>
is the index type - float
, int
, string
or date
. If they aren't there already, add the following metadata indices with the following commands:
FC:/>meta index -f chipid string
FC:/>meta index -f hv float
FC:/>meta index -f ikrum int
FC:/>meta index -f lat float
FC:/>meta index -f lon float
FC:/>meta index -f alt float
FC:/>meta index -f occ int
FC:/>meta index -f occ_pc float
FC:/>meta index -f n_pixel int
FC:/>meta index -f n_kluster int
FC:/>meta index -f n_gamma int
FC:/>meta index -f n_non_gamma int
FC:/>meta index -f ismc int
FC:/>meta index -f start_time int
FC:/>meta index -f end_time int
FC:/>meta index -f acqtime float
You can check these have been successfully added to the DFC with the meta show
command in the File Catalog Client:
FC:/>meta show
FileMetaFields : {'n_gamma': 'INT', 'n_pixel': 'INT', 'lat': 'FLOAT', 'lon': 'FLOAT', 'start_time': 'INT', 'acqtime': 'FLOAT', 'occ': 'INT', 'chipid': 'VARCHAR(128)', 'hv': 'FLOAT', 'ikrum': 'INT', 'n_non_gamma': 'INT', 'n_kluster': 'INT', 'end_time': 'INT', 'ismc': 'INT', 'occ_pc': 'FLOAT', 'alt': 'FLOAT'}
DirectoryMetaFields : {}
FC:/>exit
##Upload the sample data
We now need some data to play with. Rather than do this manually with the DIRAC Data Management System (DMS) commands (as described here), we will use a Python script that uses the CERN@school wrapper classes and DIRAC Python API to submit a job that puts files supplied in the InputSandbox
to your Storage Element (SE) of choice.
A script has been supplied with the repo you cloned earlier for doing this. You will need to make the following changes to get it working with your setup:
- Set the
jobnum
to a previously unused value to prevent multiple copies if you run the script more than once (i.e. if you make a mistake... it happens!). - Change the
outputSE
parameter of thej.setOutputData
call to a Storage Element you have access to. - Change the
outputPath
parameter to set which subdirectory in your DFC user directory the files will be uploaded to. The base directory for this will always be/<vo name>/user/<first letter of username>/<username>/
. - Change the
j.setDestination
argument to send your job to a different site, or remove to pick one according to the usual WMS rules.
Once you have configured the upload.py
script appropriately for your setup, you can submit the job with:
$ python upload.py
and monitor the progress via the web portal. You may wish to make a note of the Job ID if you want to retrieve the job output with the dirac-wms-job-get-output
command.
Once the job has finished running, you can check if the files have been uploaded correctly with the File Catalog Client:
FC:/>cd cernatschool.org/user/t/t.whyntie/diractest003/
FC:/cernatschool.org/user/t/t.whyntie/diractest003>ls
B06-W0212_1371575424-293207.txt
B06-W0212_1371575425-337648.txt
B06-W0212_1371575426-414549.txt
B06-W0212_1371575427-489662.txt
B06-W0212_1371575428-551945.txt
FC:/cernatschool.org/user/t/t.whyntie/diractest003>meta get B06-W0212_1371575424-293207.txt
No metadata found
FC:/>exit
As that last command should show, there is no user metadata associated with these LFNs. Let's do something about that now.
##Add the metadata
The add_metadata.py
script will recreate the user metadata from the wrapper classes and add it to the LFNs you've just uploaded via the FileCatalogClient class. (As far as I'm aware, there is no way to add the metadata when uploading the file via a job - the files need to exist in the DFC and the job will take a little time to run.) Depending on what you did in the previous step, you will need to change:
- The
jobnum
value - to match the value from used in the job submission before; - The
lfn_base
value - to match your VO, user name, and upload directory.
The script will output the results of the queries (setting the metadata). Once run, you can check it has worked by launching the File Catalog Client and retrieving the metadata indices for each file:
$ dirac-dms-filecatalog-cli
Starting FileCatalog client
File Catalog Client $Revision: 1.17 $Date:
FC:/>cd /cernatschool.org/user/t/t.whyntie/diractest003/
FC:/>meta get B06-W0212_1371575424-293207.txt
n_pixel : 735
n_gamma : 12
lon : -0.142515
acqtime : 1.0
occ : 735
chipid : B06-W0212
hv : 18.0
ikrum : 1
n_non_gamma : 22
n_kluster : 34
end_time : 1371575425
lat : 51.5099
alt : 34.02
occ_pc : 0.0112152
start_time : 1371575424
##Perform a metadata query Now that you've got some data uploaded to the grid with some metadata, you can now easily find it. Manual queries can be performed via the File Catalog Client. For example, to find the LFNs of frames with a start time (UNIX timestamp) greater than a certain value, use:
FC:/>cd /
FC:/>find . start_time>1371575425
Query: {'start_time': {'>': 1371575425}}
/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575426-414549.txt
/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575428-551945.txt
/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575427-489662.txt
QueryTime 0.00 sec
FC:/>
However, it's far more useful to be able to perform these queries within a Python script - for example, if you're retrieving a list of LFNs for the job's InputData
based on certain metadata. Again, we've created a script with some example queries you can use. This can be run with:
$ python perform_query.py
#############################################
* GridPP and DIRAC: user metadata - queries *
#############################################
* Metadata query: {'n_pixel': {'>': 700}}
Found: '/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575424-293207.txt'.
Found: '/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575428-551945.txt'.
Found: '/cernatschool.org/user/t/t.whyntie/diractest003/B06-W0212_1371575427-489662.txt'.
These are the three frames that have more than 700 hit pixels in them. The next step is to use this list of LFNs as the input to another grid job via the setInputSandbox
method.
##Useful links