DAQ Troubleshooting - PADME-Experiment/padme-fw GitHub Wiki

This page describes a few possible troubleshooting procedures which can be used if the usual Clean-up Procedure does not work. New procedures will be added when defined.

Checking the log files for a run

In the runs directory of the DAQ area one can find all configurations and log files produced during a run. E.g. runs/run_0000000_20181218_202212/cfg will contain the configuration files for run run_0000000_20181218_202212 while runs/run_0000000_20181218_202212/log will contain the log files for the same run.

Log files are named as <run>_<process>.log, e.g. run_0000000_20181219_080153_trigger.log is the log file for the trigger process. Looking into these files is usually the first thing to do when there is a problem.

Relevant files:

<run>_trigger.log      Trigger
<run>_merger.log       Merger
<run>_lvl1_<n>.log     Level1 stream <n>. Usually n=00-04
<run>_b<n>_daq.log     ADC board <n> DAQ process. n=00-28
<run>_b<n>_zsup.log    ADC board <n> zero suppression process. n=00-28

Problems with the Merger

Any problem in the DAQ will immediately show up in the event merger log file. As most of the times problems with the merger process are related to problems with the ADC boards, the fastest recovery procedure is to stop the run and then reset all VME crates, reset all NIM crates, and do the Clean-up Procedure.

RunControl stuck while initializing a run or changing setup

In rare occasions (under investigation), the RunControl server gets stuck when executing the init_run or change_setup procedures. In this case the shifter must exit from the RunControl client with CTRL-C, identify the process associated to the RunControl server with the ps command, kill -9 this process, and finally restart the RunControl server.

[daq@l0padme1 DAQ]$ ps -fu daq | grep RunControl
daq       14622      1  0 11:27 ?        00:00:00 /usr/bin/python ./RunControl --server
daq       15288  61101  0 12:30 pts/2    00:00:00 grep RunControl
[daq@l0padme1 DAQ]$ kill -9 14622
[daq@l0padme1 DAQ]$ ./RunControl --server

N.B. the Clean-up Procedure is not needed here.

Trigger initialization problem

If the run initialization procedure fails or times out before showing the trigger ready message, the problem is usually due to the trigger board getting stuck. In this case the shifter should follow the reset procedure described in the No triggers paragraph below.

Board initialization problem

If one of the boards gets stuck, it will not respond to the initialization procedure. In this case the list of adc NN ready messages will be incomplete and the system will not give back control to the operator for a long time (several minutes). If this happens, the best strategy is to follow the procedure described in Exiting from the system below, execute the usual clean-up procedure, then reset the VME crates and finally restart the whole system starting with the RunControl server. This whole procedure should be tried a couple of times before giving up. If after this the initialization keeps failing, it is time to call an expert.

If several boards fail at the initialization stage (message adc <n> fail on the RunControl client) this could be a problem with the clock signal propagation. A clear indication for this is that all boards up to a given one succeed and all (most) the others fail, e.g. 0->4 ok and 5->28 fail.

To verify if this is the problem (and to fix it), first ask for access to the experimental area, then go to the VME crates and look at the PLLLCK light of each ADC board: if this light is ON, clock signal is correctly synchronized, if it is OFF or blinking then there is a problem with the clock signal.

PLLLCK light

You should start this check from the VME crate on the right (close to the entrance door) looking at the left-most board (this is board 0) and then proceeding to the right. Then move to the VME crate to the left (close to the wall) and start again from the left-most board. Finally go the the target area and check the ADC board in the small red crate (board 28). You should find that the PLLLCK light is ON up to a given board and it is OFF or blinking for all the others: if this is the case, the problem is with the clock cable between the last ON board and the first OFF board.

Clock cable

Reconnect the cable as shown in the picture above and verify that now the PLLLCK light of all boards is ON. Repeat the procedure if there are further interruptions along the chain.

When all PLLLCK lights are ON (including that of board 28 in the target area!), you are done: close the experimental area, go back to the control room, execute the Clean-up Procedure (init failures usually leave the DAQ in an unhealthy status), ask the DAFNE shifters for beam, and restart the DAQ checking that now all boards complete initialization correctly.

No triggers

It is possible that the trigger log file stops showing incoming triggers:

[daq@l0padme1 runs]$ tail -f run_0000000_20181219_153625/log/run_0000000_20181219_153625_trigger.log 
- Trigger 11400 0x24901108df3a820   71101032480 0x1  292 2040.125122ms 2s
- Trigger 11500 0x2ad011097ae1ea5   71264247461 0x1  692 2040.187256ms 3s
- Trigger 11600 0x2110110a16870e5   71427453157 0x1   68 2040.071167ms 2s
- Trigger 11700 0x2750110ab22e449   71590667337 0x1  468 2040.177246ms 2s
[no more triggers appear]

The first thing to do is to check with the DAFNE shifter if the beam is ON and the triggers are correctly being sent. If the problem is with the beam, just stop the DAQ, do the Clean-up Procedure, and wait for the beam to be back to normal before restarting the DAQ.

If the beam is in stable conditions and the triggers are being sent, then the problem might be that the Trigger board got stuck. To restart it you must

stop the DAQ
reset the NIM crate following the procedure described in the Reset NIM crates and vetos page
execute the Clean up Procedure
restart the DAQ

If this does not work, then you may try to reset the crate manually following this procedure:

stop the DAQ
ask DAPHNE operators for access
go to the experimental area
locate the NIM crate in the rack on the right (close to the entrance door)
turn the crate OFF with the switch on the bottom right of the front panel
wait a few seconds
turn the crate ON with the same switch
close the experimental area
ask DAPHNE operators for beam
execute the Clean up Procedure
restart the DAQ

PLEASE NOTE that after resetting the NIM crate, either via network or manually, you MUST re-enable the Veto controller boards following the procedure described in the Setup Vetos page.

RunControl crashes during initialization

When using RunControl from a remote location, there is a small possibility that the network link has a glitch while messages are being sent from the server to the client. The effect of this is that the RunControl server crashes without completing the initialization.

If this happens the shifter should:

verify that the RunControl server is not running on the l0padme1 node
execute the Clean up Procedure
restart the DAQ
execute the new_run procedure again