DAQ Troubleshooting - PADME-Experiment/padme-fw GitHub Wiki
This page describes a few possible troubleshooting procedures which can be used if the usual Clean-up Procedure does not work. New procedures will be added when defined.
In the runs
directory of the DAQ
area one can find all configurations and log files produced during a run. E.g. runs/run_0000000_20181218_202212/cfg
will contain the configuration files for run run_0000000_20181218_202212
while runs/run_0000000_20181218_202212/log
will contain the log files for the same run.
Log files are named as <run>_<process>.log
, e.g. run_0000000_20181219_080153_trigger.log
is the log file for the trigger process. Looking into these files is usually the first thing to do when there is a problem.
Relevant files:
<run>_trigger.log Trigger
<run>_merger.log Merger
<run>_lvl1_<n>.log Level1 stream <n>. Usually n=00-04
<run>_b<n>_daq.log ADC board <n> DAQ process. n=00-28
<run>_b<n>_zsup.log ADC board <n> zero suppression process. n=00-28
Any problem in the DAQ will immediately show up in the event merger log file. As most of the times problems with the merger process are related to problems with the ADC boards, the fastest recovery procedure is to stop the run and then reset all VME crates, reset all NIM crates, and do the Clean-up Procedure.
In rare occasions (under investigation), the RunControl server gets stuck when executing the init_run
or change_setup
procedures. In this case the shifter must exit from the RunControl client with CTRL-C
, identify the process associated to the RunControl server with the ps
command, kill -9
this process, and finally restart the RunControl server.
[daq@l0padme1 DAQ]$ ps -fu daq | grep RunControl
daq 14622 1 0 11:27 ? 00:00:00 /usr/bin/python ./RunControl --server
daq 15288 61101 0 12:30 pts/2 00:00:00 grep RunControl
[daq@l0padme1 DAQ]$ kill -9 14622
[daq@l0padme1 DAQ]$ ./RunControl --server
N.B. the Clean-up Procedure is not needed here.
If the run initialization procedure fails or times out before showing the trigger ready
message, the problem is usually due to the trigger board getting stuck. In this case the shifter should follow the reset procedure described in the No triggers paragraph below.
If one of the boards gets stuck, it will not respond to the initialization procedure. In this case the list of adc NN ready
messages will be incomplete and the system will not give back control to the operator for a long time (several minutes). If this happens, the best strategy is to follow the procedure described in Exiting from the system below, execute the usual clean-up procedure, then reset the VME crates and finally restart the whole system starting with the RunControl server. This whole procedure should be tried a couple of times before giving up. If after this the initialization keeps failing, it is time to call an expert.
If several boards fail at the initialization stage (message adc <n> fail
on the RunControl client) this could be a problem with the clock signal propagation. A clear indication for this is that all boards up to a given one succeed and all (most) the others fail, e.g. 0->4 ok and 5->28 fail.
To verify if this is the problem (and to fix it), first ask for access to the experimental area, then go to the VME crates and look at the PLLLCK light of each ADC board: if this light is ON, clock signal is correctly synchronized, if it is OFF or blinking then there is a problem with the clock signal.
You should start this check from the VME crate on the right (close to the entrance door) looking at the left-most board (this is board 0) and then proceeding to the right. Then move to the VME crate to the left (close to the wall) and start again from the left-most board. Finally go the the target area and check the ADC board in the small red crate (board 28). You should find that the PLLLCK light is ON up to a given board and it is OFF or blinking for all the others: if this is the case, the problem is with the clock cable between the last ON board and the first OFF board.
Reconnect the cable as shown in the picture above and verify that now the PLLLCK light of all boards is ON. Repeat the procedure if there are further interruptions along the chain.
When all PLLLCK lights are ON (including that of board 28 in the target area!), you are done: close the experimental area, go back to the control room, execute the Clean-up Procedure (init failures usually leave the DAQ in an unhealthy status), ask the DAFNE shifters for beam, and restart the DAQ checking that now all boards complete initialization correctly.
It is possible that the trigger log file stops showing incoming triggers:
[daq@l0padme1 runs]$ tail -f run_0000000_20181219_153625/log/run_0000000_20181219_153625_trigger.log
- Trigger 11400 0x24901108df3a820 71101032480 0x1 292 2040.125122ms 2s
- Trigger 11500 0x2ad011097ae1ea5 71264247461 0x1 692 2040.187256ms 3s
- Trigger 11600 0x2110110a16870e5 71427453157 0x1 68 2040.071167ms 2s
- Trigger 11700 0x2750110ab22e449 71590667337 0x1 468 2040.177246ms 2s
[no more triggers appear]
The first thing to do is to check with the DAFNE shifter if the beam is ON and the triggers are correctly being sent. If the problem is with the beam, just stop the DAQ, do the Clean-up Procedure, and wait for the beam to be back to normal before restarting the DAQ.
If the beam is in stable conditions and the triggers are being sent, then the problem might be that the Trigger board got stuck. To restart it you must
- stop the DAQ
- reset the NIM crate following the procedure described in the Reset NIM crates and vetos page
- execute the Clean up Procedure
- restart the DAQ
If this does not work, then you may try to reset the crate manually following this procedure:
- stop the DAQ
- ask DAPHNE operators for access
- go to the experimental area
- locate the NIM crate in the rack on the right (close to the entrance door)
- turn the crate OFF with the switch on the bottom right of the front panel
- wait a few seconds
- turn the crate ON with the same switch
- close the experimental area
- ask DAPHNE operators for beam
- execute the Clean up Procedure
- restart the DAQ
PLEASE NOTE that after resetting the NIM crate, either via network or manually, you MUST re-enable the Veto controller boards following the procedure described in the Setup Vetos page.
When using RunControl from a remote location, there is a small possibility that the network link has a glitch while messages are being sent from the server to the client. The effect of this is that the RunControl server crashes without completing the initialization.
If this happens the shifter should:
- verify that the RunControl server is not running on the l0padme1 node
- execute the Clean up Procedure
- restart the DAQ
- execute the new_run procedure again