access_wml548sms20120926 - ACCESS-NRI/accessdev-Trac-archive GitHub Wiki

Running sms on NCI/BOM supercomputer

Author: Wenming LU, E-mail: [email protected], Tel: 03 96694528

Updates

lwenming 2012 OCT 02 Initial creation of the page

lwenming 2013 JUL 18 SMS available on raijin

lwenming 2013 OCT 08 SMS available on ngamai

SMS Background Information

SMS is a scheduling system in which users are able to run a series of jobs in a predefined timing, dependencies and so on. Please refer to the link

http://www.ecmwf.int/publications/manuals/sms/

for the detailed information of SMS.

In this wiki, we assume that the readers have had some knowledge on SMS already and just particularly focus on discussing how to run SMS server and test an experimental suite on NCI/BOM,

  • sms module
  • Starting and Stopping SMS Server on NCI/BOM supercomputer (raijin and ngamai; raijin and ngamai are supercomputers in NCI and BOM, respectively)
  • Using the GUI client interface xcdp
  • Playing a test suite into SMS
  • Managing the test suite through xcdp

sms module

There is a module, sms, available on raijin which provide a user friendly environment to run SMS server and suites.

For using the sms module on raijin, please run the following commands,

module use ~access/modules

module load sms

On ngamai due to that BOM's operational centre NMOC is running different version of SMS on ngamai, the module is renamed to smsre standing for research SMS,

module use ~access/modules

module load smsre

Running SMS server

After loading the sms module, type

sms_start.sh

to start your own SMS server. You should be able to run SMS server on any raijin/ngamai main node.

Note: Please make consistent use of the main node in which the SMS server is running and the the variable SMSHOST/SMS_HOST in your suite definition file and SMS scripts.

To stop the SMS server, type

sms_stop.sh

Note here it is very important that you do use the script sms_stop.sh to terminate the SMS server. sms_stop.sh will properly release the port numbers used by SMS server back to system and terminate the SMS server. If you use other method to stop the SMS server, such as kill, you will encounter problems restarting SMS server because the port numbers have not been released yet and sms_start.sh can not attach SMS server to the default port numbers specified (900000+$UID).

SMS client interface xcdp

Please note that all pictures are taken from the NCI decommissioned superciompter vayu; these pictures should be same on any SMS host machine

xcdp is an x-window based GUI tool and very easy to work with. Type

xcdp

to start xcdp. Once started, you will see a window as below and go to the menu Edit->Preferences...,

xcdp0.JPG, 100% xcdp1.JPG, 100%

On the pop-up window, select the tab Servers, to edit your SMS server details in here,

xcdp2.JPG, 90% xcdp3.JPG, 90%

Here are details of how to set up those items,

Item Value Comment
Name ngamai02_lwenming or raijin2_wml548 Anything but better be meaningful, my convention is $hostname_$USER
Host ngamai02 or raijin2 SMSHOST
Number 906674 SMS_PROG (900000+$UID, specified in sms_start.sh); in my case, e.g.906674

Close the pop-up window and go back to main window menu Servers, and you are able to see all SMS servers defined in the previous step; click on the server you specified and in the xcdp main body the server will appear,

xcdp4.JPG, 100% xcdp5.JPG,100%

First SMS suite: mytest

Making sure the sms module has been loaded. Then type,

sms_setup.sh

This command does the following things,

  • Create a folder sms at $HOME
  • Copy include files to $HOME/sms: access_sms_include for tasks on $SMSHOST (raijin/ngamai in here); access_nci_include for jobs on NCI supercomputer (raijin/ngamai as well in here)
  • Copy the test suite mytest to $HOME/sms/suite; the definition file and SMS scripts will be tailored according to your own environment, such as $USER, $HOST, $PROJECT etc

We are now ready to send the test suite mytest to the SMS server,

play_suite.sh mytest

Right click the server box on xcdp and choose suites on the pop-up window, and select the suite mytest; finally click the green button to refresh the server status. You should be able to see the suite mytest within the SMS server. Type,

begin_suite.sh mytest

Then mytest will be made ready to send tasks to supercomputers (The colour of mytest should be changed from dark gray to blue). In case you need to replay the suite, just do

cancel_suite.sh mytest

followed by play and begin to replay the suite.

Managing suite mytest on xcdp

The suite mytest has the structure as follows,

  • mytest #SUITE
  • -> test1 #FAMILY
  • -> test_local #local TASK; running on SMSHOST
  • -> test_nci #remote TASK; running on PBSHOST/SGEHOST, ie., raijin/ngamai
  • -> admin #FAMILY
  • -> clean #local TASK

In our SMS structure, there are two machines:

  • SMSHOST: machine running SMS server
  • PBSHOST/SGEHOST: machine running jobs in PBS/SGE queuing systems

In practice, SMSHOST could be same as PBSHOST/SGEHOST. However, remote run will be still be executed as if SMSHOST and PBSHOST/SGEHOST are two different machines. In NCI environment, you may choose either accessdev, accessprod or even raijin/ngamai as SMSHOST but raijin/ngamai is always PBSHOST\SGEHOST.

All tasks in mytest have been tested successfully if SMS is set up properly,

  • test_local: local run on SMSHOST; touches a file from_local_to_local.$$ in $HOME/smsout/mytest
  • test_nci: remote run on PBS/SGE; touches a file from_nci_to_local.$$ in $HOME/smsout/mytest
  • clean: local run on SMSHOST; clean up outputs in $HOME/smsout/mytest

THE END OF THIS WIKI PAGE

Attachments