Meeting paper meeting 01 17 2017 - ATLAS-Titan/misc GitHub Wiki

People present:

  1. Shantenu
  2. Sergey
  3. Matteo
  4. Ming
  5. Alessio

Discussion:

The discussion focused on three different topics:

  1. The performances of Next generation executor (NGE): we commented some experiments. Experiments showed the current limits of NGE in terms of the number of cores and computational units that it can handle. A brief description of the experiment is described at the end of the page.
  2. Agenda for the meeting at Brookhaven: some hints about the discussion that will take place at Brookhaven.
  3. TITAN's batch queue: Sergey, Matteo and Alessio commented some plots about user's behaviour on TITAN batch queue.

Conclusions:

  1. At the moment is not possible to determine the causes that limit NGE. A presentation about main NGE features is required.

  2. Brookhaven meeting should focus on:

    • Paper (what should we talk about and what are the aims for the paper);
    • Description of PanDA as a process (from the moment in which a PanDA task arrives into the system to the moment in which it ends the execution);
    • General feature of NGE;
    • Interfacing NGE to PanDA;
    • TITAN's batch queue; Development of new submission strategies;
    • PanDA analytics (kick-start for a discussion about a single platform in which PanDA information can be stored and analyzed);
  3. The data that have been presented are interesting and counterintuitive once compared with TITAN's mission statement (''jobs should be the largest possible'').


TO DO:

  1. NGE requires further experiments. Experiments should focus on finding the causes of the problem.
  2. Matteo should do a brief presentation about Radical Analytics.
  3. TITAN's batch queue data might be affected by backfilling and other ``not standard'' user privileges. A mail has been sent to Don Maxwell asking for clarifications.

Experiment

Alessio performed some tests focused on finding the current limit of NGE in terms of number of cores and number of CUs that it can handle. Two different workloads have been used: ``/bin/date'' and Gromacs simulation. The first is a bag-of-tasks, each tasks is composed by one CU only. Gromacs' tasks instead are composed by two different type of CUs: the first that prepares the environment and the second that performs the simulation of molecular dynamics. Indeed, the first must be performed before the latter. As further constraint no simulation can start until all the pre-execution have ended their execution. All the workload use one core for each CU. As a consequence, Gromacs workload is composed by 2*N CUs where N is the number of cores required. The plot below depicts the percentage of completed CUs as a function of job size. Although the plot is based on a single run, similar results have been observed on many similar tests.

Percentage of completed jobs