access_NewSun_010 - ACCESS-NRI/accessdev-Trac-archive GitHub Wiki


#!html
<h1  style="text-align: center; color: green"> CAWCR-BoM ACCESS NWP Ngamai Migration Working Group</h1>

CAWCR-BoM ACCESS-NWP Ngamai Porting Working Group Meeting Notes

Meeting 10: Wednesday 18th September 2013, 9E Meeting Room

Present: Joerg Henrichs, Ed Habjan, Ilia Bermous, Jim Fraser, Zhihong Li, Wenming Lu, Joan Fernon, Chris Tingwell, Michael Naughton, Robin Bowen, Asri Sulaiman, Yi Xiao

Apologies: Martin Dix


Agenda

  • List from previous meeting notes
  • Task List
  • AOB

AG1

AR1

  • NMOC suite now ran to Mid August.

  • Rapidly filling up disk space.

  • Strategies being developed to handle disk space

    • Use Frames
    • Delete some pi files
    • Switch off saving staging files
    • rsync to backup disks.
  • Generally running smoothly.

    • Require new "Frame" executable to handle hourly data.
    • Martin to provide new executable.
  • MARS archiving not yet done - waiting for MARS server to be available

  • Script tidy-ups on-going.

  • Runs of Xiao's suite have been discontinued.

    • Verification have been done
    • Diskspace issues if runs continues
    • NMOC suite have started.
  • Verification of results from NMOC's suite is not yet done

  • MARS issues will impact plotting tasks.

MARS / SAM

  • Robin reported that there are serious bottlenecks in handling data generated on ngamai
    • MARS7 <-> sam connection may take weeks/months to resolve
    • Problem affect MARS7. MARS1/2 is ok
  • SAM issues may impact operational requirements
    • operational needs is greater than current capacity.
  • There is also a tape shortage which may impact general SAM usage at some stage.

AC1

  • NMOC's ACCESS-C will start next week

  • Wenming's ACCESS-C have been stopped

    • email on status sent out.
    • Results looks quite close.
      • Minor systematic difference
      • The difference was at SE Queensland - small systematic diff of 0.6 in one direction.
      • Comparison suite was nested in solar suite, New one was nested in Xiao's ngamai suite.
      • Difference is puzzling, but acceptable. No need to investigate further for now.
      • Review further when NMOC's suite start running.
  • Robin reported that Chris Bridge have got Verify working on ngamai.

    • Suggest verify ACCESS-C July results when ( namelist settings are identical and ) there is only machine difference.

ATC1

  • ACCESS-TC suite on ngamai is an updated version of ATC1 suite on solar
  • Currently working properly on ngamai
  • Verification (Pewa, Kong-Rey plots) showed significant improvement over ATC10 version
  • No need to port ATC10 version.

NGAMAI ISSUES

  • Xiao encountered issues with obs processing task after suspend/resume was introduced on ngamai
    • Multiple of 12 cores with "node exclusive" was also introduced around the same time
  • Another change on ngamai is change in node allocation
    • CAWCR and NMOC jobs are initiated from different ends of available nodes
    • Chance of getting the same node with re-submission is now much higher.
    • A node monitoring script has been started by James.
  • re-submit solves problem
  • Other suites are no longer running, so just notice this problem with ACCESS-TC
  • NMOC suite do not encounter similar issue?
  • There have been issues which went unreported since they are usually solved through re-submission
  • Everyone are called to report all issues to ngamai_help, since it may help identify bad nodes
  • Wenming to report similar issues if encountered by ACCESS-C
  • Nagios and ganglia application running on ngamai. Displays available.

Run time variation

  • Joerg have taken close look at system parameters
    • There are issues associated with linux kernel handling of "Huge table"
      • This is a new feature introduce in Linux 6
      • Introduce system overhead to perform defragmentation when there is mixing between large/small memory jobs.
    • A set of 45 nodes have been reset with different kernel parameter
  • Ilia's test of the special queue utilising the test nodes have achieved very consistent elapse time (0.5% variation) but (second run?) 30% slower.
    • To test further with new executable.
    • Ilia have been using UM7.5 R12 job
    • Joerg have been using an N320 job - will move to using the same test job as Ilia's.
  • Ed said Oracle have turned off "transparent huge pages" in the special nodes
    • Oracle test runs have give consistent good results
    • Difference between 1st and 2nd run in job with 2 runs is unexpected
    • Performance is also dependant on job-mix.
    • Further tuning possible.
  • ACTION: increase "special nodes" from 45 to 90 to allow more testing/larger jobs.

Re-Configuration executable

  • Joerg working on test script.

Executable build procedures and documentation

  • A meeting has been held between sub-group Xiao, Martin, Asri and Ilia to finalise approach

  • Documentation page is under construction

  • Issue with job for building Global executable labelled "Regional" by UMUI has been succesfully investigated.

  • PRG_ENV to be used to specify programming environments such as compiler version, openMPI versions etc

    • To be used for both build and run job
    • Require update of UMUI to ensure that environment settings are sourced at the beginning of both type of jobs.
  • VAR build encounter ksh problem

    • Some difference between behaviour of ksh on solar & ngamai
    • Peter Steinle also encountered ksh problem
  • SCS cgi monitor still not working due to perl issues.

    • ACTION: rab to follow up.

UMUI, SVN and TRAC

  • Work on UIs on ngamai has progressed further.

  • UMUI, OPSUI, VARUI now active

  • SCSUI to be added

  • Jobs database migration strategy being considered to minimise confusion

  • Planning for SVN and TRAC system progressing

  • UM mirror of repository on access-svn now active on ngamai

  • Initial work on SVN and TRAC migration underway. Actual migration planned for early October.

  • Users can use svn on ngamai now

    • Initially, need to do once-only userid/password setup for accessing svn servers such as access-svn
    • This should be documented and users advised.

AOB

  • set up of rose/cylc on ngamai required for new suites.

    • Require additional Python libraries
    • Xiao to look at installation
  • "at" have been installed on ngamai to cater for UMUI build job requirement.

  • Compilation on ngamai computing nodes is currently not possible primarily due to intel compiler license server running on different machine. Also some tools such as gcc have not been installed on computing nodes

  • Need for compilation on computing nodes is not high priority, but

    • Effort to allow compilation on computing node should continue in the background
      • Allow similar work environment as on raijin
      • Allow combining build and run job.
      • Simplify job setups.
  • Robin to evaluate

TASK LIST

***** NEXT MEETING: Wed 2nd October, 11am, 9E Meeting Room. *****

[ 23-24/9/2013 ] azs, first cut. [ 24/9/2013 ] mjn, minor wording change. [ 11/10/2013 ] azs, minor update from Ilia's feedback.

⚠️ **GitHub.com Fallback** ⚠️