Rose_Rose and Cylc FAQ - ACCESS-NRI/accessdev-Trac-archive GitHub Wiki

PageOutline

Frequently Asked Questions about Rose and Cylc

Committing Suites into Rosie

Q: How do I commit a suite "A" into rosie on accessdev?

A: 1. Run rosie create, you will be prompted an editor for inserting suite information. Type y for yes after exit the editor. You will see a message like

[INFO] au-aa153: created at 
       svn+ssh://accessdev.nci.org.au/home/access-svn/roses_au_svn/a/a/1/5/3
[INFO] au-aa153: 
       local copy created at /home/548/wml548/roses/au-aa153
  1. Copy everything in suite "A" to suite au-aa153.
  2. Run fcm status to check status of all files and run fcm add file to add files you are committing to repository.
  3. Run fcm commit to put the suite into svn.
  4. Run rosie checkout au-aa153 to check out the suite by users.

Q: With rosie copy or rosie create, and getting "gvim -f" error

rose.popen.RosePopenError`: gvim\ -f /local/dp9/zxs548/tmp/tmpDOlOS_ 
# return-code=1, stderr= [Errno 2] No such file or directory: 'gvim -f'

A: Please edit the following lines in your $HOME/.metomi/rose.conf,

[external] 
editor="gvim -f" 
geditor="gvim -f" 

to

[external] 
editor=gvim -f 
geditor=gvim -f 

or just delete your $HOME/.metomi/rose.conf to use the system default /usr/local/rose/etc/rose.conf.

Viewing Suites

Q: How do I view a list of Rose suites on accessdev?

A: Run the command rosie go. By default this shows the suites you have checked out. You can also search by username or search with string "au" to show everything in the local repository or "u" for the MOSRS repository.

Q: How do I view a list of my running Cylc suites on accessdev?

A: Run the command cylc scan for a text list or cylc gsummary for a GUI. By default this shows the suites you are curretly running. Double clicking on one will pop up a seperate Cylc window for this suite.

Running Suites

Q: How do I validate a suite.rc ?

**A:**When designing/testing a new suite, you often validate the definition file suite.rc to remove any errors before testing jobs on HPC. The best way to validate a progressive suite.rc is to run the suite in a simulation mode without really sending jobs to super computer, simply using

rose suite-run -- --mode=simulation

Q: How do I view log files from running a suite?

A: Log files are located on accessdev in $HOME/cylc-run/SUITE/log/ where SUITE is the name of the suite. Alternatively log files can be viewed using the Rose Bush web interface https://accessdev.nci.org.au/rose-bush/

Q: I made changes to my suite that is running, how do I load these changes without stopping the suite?

A: Run the command rose suite-run --reload

Q: How do I allow other users to monitor my suite that is running?

A: While the suite is running run the command rose monitor --allow SUITE where SUITE is the name of the suite. Other users can then monitor this suite by running rose monitor --user USER SUITE where USER is the login id of the user who is running the suite.

Q: How do I run the model under the totalview debugger

A: See access/TotalviewCylc

Q: How do I reduce my suite's disk usage

A: See access/MinimisingDiskUse

Rose and cylc version updates

See access/RoseCylcVersions for information on what happens when versions are updated and how to run with specific versions.

Restarting after an accessdev reboot

accessdev may be rebooted as part of the periodic NCI maintenance. This will be announced in advance but will interrupt any running rose/cylc suites.

The simplest solution is to stop your suites on in advance with cylc stop SUITEID and do a restart afterwards.

If you don't stop your suite beforehand the reboot will kill the controlling cylc process of any suites you have running. When raijin restarts, held jobs will complete but suites will then stop because they can’t communicate with the cylc process on accessdev.

In order to continue the job run cylc restart SUITEID. This will first check the status of jobs on raijin so won’t rerun anything unnecessarily. With cylc6 suites you may also need to remove the file ~/.cylc/ports/SUITEID on accessdev.

If you have a long running suite that was started with a version of rose/cylc that is no longer the current default, you should specify the original versions, e.g.

CYLC_VERSION=6.7.2 ROSE_VERSION=2015.11.0 cylc restart SUITEID

If you’re not sure you can check the CYLC_VERSION and ROSE_VERSION in the processed suite.rc file, ~/cylc-run/SUITEID/suite.rc. Note this is the processed version in cylc-run, not the original one in ~/roses.

Error Messages

Q: When getting authentication failure at any pre-build task in Cylc on any machine $HPC (could be raijin or accessdev)

> reason: Username: svn: OPTIONS of 
> ‘https://access-svn.nci.org.au/svn/cmip5/trunk/bin': authorisation 
> failed:…

A: Please try on $HPC

svn ls https://access-svn.nci.org.au/svn/cmip5

You need to enter your password and please do the following to make sure the stored password readable only by you,

`chmod 600 ~/.subversion/auth/svn.simple/*` 

Please change the svn location to suit your case.

Q: Getting the error message ...Killed rose $* when running a suite

> [FAIL] ssh -oBatchMode=yes raijin.nci.org.au bash --login -c 
> \'ROSE_VERSION=2014-05\ /projects/access/bin/rose\ suite-run\ -v\ -v\  
> --name=au-aa147\ --run=run\  
> --remote=uuid=909dcfca-27af-45a7-8e4f-b838ab69fff8,root-dir-share=/sho  
> rt/$PROJECT/$USER,root-dir-work=/short/$PROJECT/$USER\' #  
> return-code=137, stderr= [FAIL] /projects/access/bin/rose: line 10:  
> 18963 Killed rose $*  

A: Please check your quota on raijin with lquota to make sure you do not exceed your limit on raijin $HOME.

Q: Cylc is able to submit jobs to raijin but gets as error message like

Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
ERROR: remote command failed 255
Received signal ERR
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
ERROR: remote command failed 255

A: Please check the ssh communication between raijin-raijin, raijin-accessdev, accessdev-accessdev, and accessdev-raijin. To check that you need to run on accessdev,

ssh raijin ls

the communication is fine if the content of your raijin $HOME is printed out. Then try to run

ssh raijin cylc

If you are asked to input your password, just quit the test and run on accessdev,

remote-job-submission

This will set up Cylc be able to send jobs to raijin compute nodes.

Q: Rose gives error message about a suite still running when you start a suite

E.g.,

[FAIL] Suite "access_x_vn7.6_4km" may still be running.

[FAIL] Host "localhost" has port-file:

[FAIL]~wml548/.cylc/ports/access_x_vn7.6_4km

[FAIL] Try "rose suite-shutdown access_x_vn7.6_4km" first?"

A: Go to ~wml548/.cylc/ports/ and delete access_x_vn7.6_4km on accessdev then rerun the rose suite-run in /home/548/wml548/roses/access_x_vn7.6_4km. If this still does not work, apply the following solutions.

Q: Rose gives error message "[FAIL] [Errno 16] Device or resource busy: 'log.20140415T053044Z/suite/.nfs0000000000e458190000057e'" when running rose suite-run

Note: The solutions here may apply in the general cases when a suite cannot be run or restarted

A: This message usually appears

  • When you have a process like an editor opening one of the log files or job scripts from the particular suite. Closing these processes and re-running rose suite-run should work.
  • Try running "rose suite-shutdown" or "cylc control stop $suite" if previous step does not work.
  • Run ps -ef | grep $suite and kill the job related to $suite by kill -9 $JOBNO.

Q: My log files from raijin don't appear on accessdev

A: There may be a delay in the generation of STDOUT and STDERR log files created from a PBS job. Cylc is configured to retry several times after a delay. Note that files larger than 2 MB will not be automatically retrieved, though this can be overridden in your suite. If you're using a version of cylc earlier than 6.9.1, event hooks should call the wrapper script rose task-hook2 instead of rose task-hook to ensure log files are pulled back from raijin to accessdev.

Q: I'm receiving an error about not finding the rose or cylc command

A: Check to see if your suite.rc file contains the following lines in the initial scripting section

module use ~access/modules
module load rose
module load cylc

You may also need to check your .bashrc file to see if there are any conflicts with modules there or your environment variable $PATH.