Meeting 01 27 2017 (Brookhaven) - ATLAS-Titan/misc GitHub Wiki
Meeting in Brookhaven 01/27/2017
Agenda:
- Last test with NGE: status and current problems.
- NGE: how to integrate it with PanDA.
- PanDA paper: scope of the paper; what tests should we do? Discussion about the structure of the paper.
- Queue time: Presentation about TITAN's batch queue and user behaviour.
Discussion:
-
Scalability problems have been solved but a new problem came out. At the moment NGE slows down CUs in terms of execution time. We opened a discussion with Andre to understand the problem. We guessed that CUs might not be placed on different cores, thus they are competing for the same resources. Although this might be an explanation, we did not find evidence for this.
-
We imagined an interaction in which NGE submits pilot jobs on TITAN in isolation and then communicates to PAnDA when the job is running. At that point PanDA send jobs on the pilot as it was using backfilling. We imagined a communication that is very basic. A db or a common area where PAnDA writes what CUs must be executed and NGE reads it. If NGE does not receive jobs for a while it terminates its execution on the nodes.
-
The paper must focus on the current achievements of PAnDA on TITAN. Additionally, it must provide a logical scheme about PanDA, ATLAS workflow and the interaction with TITAN. As last it should present NGE as a possible enhancement and show how it can be used on batch queue. Test on batch queue will consider the maximum number of nodes and wall-time available in each bin of TITAN. In this way we try to mimic backfill greedy approach on the batch queue.
-
Plots about user behaviour show counterintuitive behaviour. In particular, the major part of the user ask for a small set of nodes for not more than 120 minutes. This goes in the opposite direction of TITAN's statement "User should ask for the maximum possible". Queue times have been presented both as a function of the bin and as a function of the wall-time. They show promising results for what concern the second bin with small wall-time (< 120 minutes).
Conclusions:
-
NGE still has problems.
-
Interaction between PanDA and NGE must be as simple as possible. A small deamon that allows NGE to read information from a DB would be more than enough to allow the communication with PAnDA.
-
We found an agreement for the experiment and the structure of the paper. Time for experiment is very tight.
-
Plots about queue time will be very useful for experiments in batch queue. These plots might be inserted in the paper.
TO DO:
- Solve the problem as soon as it is possible.
- Start developing a deamon to make the communication happen (as soon as NGE works properly and experiments are done).
- Blocked until NGE does not work.
- Investigate further.