Troubleshooting - Unidata/LDM GitHub Wiki
Contents
- Can't start the LDM
- The LDM server terminates after executing its configuration-file
- The LDM isn't logging
- The LDM isn't receiving data
- The LDM suddenly disappears on an RHEL/CentOS system
ldmadmin scour
takes too long on an XFS file-system
Can't start the LDM
If the LDM won't start, then it should log the reason why. Check the LDM log file. If that is inconclusive, then start the LDM in interactive mode by explicitly telling it to log to the standard error stream via the -l option. Execute the command
ldmd -l- [-v]
This will prevent the LDM from daemonizing itself and it will log directly to the terminal. The running LDM can be stopped by typing ^C
(control-C).
The LDM server terminates after executing its configuration-file
This could be caused by
- An effectively empty configuration-file
- A commented-out
ALLOW
entry forlocalhost
and127.0.0.1
The LDM isn't logging
Verify that logging doesn't occur with the commands
ulogger This is a test 2>/dev/null
tail -1 ~/var/logs/ldmd.log
If the above prints This is a test
, then LDM logging works.
Verify that the LDM log file is owned and writable by the LDM user:
ls -l ~/var/logs/ldmd.log
If it's not, then make it so:
sudo chown ldm ~/var/logs/ldmd.log
chmod u+w ~/var/logs/ldmd.log
Verify that the disk partition that contains the LDM log file isn't full:
df ~/var/logs/ldmd.log
If it's full, then purge stuff.
If the script ~/bin/refresh_logging
exists and doesn't simply execute the utility hupsyslog(1)
, then logging should now work; otherwise, continue.
Verify that the system logging daemon is running via the command
ps -ef | grep syslog
If it's not running, then start it.
Determine the veracity of the following:
-
The configuration-file for the system logging daemon has an entry for the LDM:
grep local /etc/*syslog.conf | grep ldm
-
The utility hupsyslog(1) is owned by root and setuid:
ls -l ~/bin/hupsyslog
If any of the above are false, then execute the command — as root — make root-actions
in the top-level LDM source-directory.
Stop and restart the system logging daemon and retry the ulogger
command.
If the operating system has SELINUX, verify that it is disabled or in permissive mode:
getenforce
To change from enabled mode to permissive mode, execute — as root — the command
setenforce permissive
To disable SELINUX, edit the file /etc/selinux/config
and set the variable SELINUX
to disabled. Then, reboot the system.
Verify that the disk partition containing the ~/bin
directory does not have the nosuid
attribute:
dev=`df ~/bin | tail -1 | awk '{print $1}'`
mount | grep $dev | grep nosuid
If the nosuid
attribute is enabled, then hupsyslog(1)
will not work. Either that attribute must be disabled or the LDM package must be re-installed on a disk partition that has that attribute disabled.
The LDM isn't receiving data
Are you sure? Verify that the LDM hasn't received the data-products in question by executing — as the LDM user — the following command on the same system as the LDM:
notifyme -v [-f
feedtype] [-p
pattern`] -o 9999999
where feedtype and pattern are a feed specification and extended regular expression, respectively, that match the missing data-products.
If this command indicates that the LDM is unavailable, then start it; otherwise, continue.
Verify that your system clock is correct. If your LDM asks for data-products from the future, then it won't receive anything until that time.
Verify that each upstream LDM that should be sending the data-products is, indeed, receiving those data-products by executing — as the LDM user — the following command on the downstream system:
notifyme -v [-f
feedtype] [-p
pattern] -o 9999999 -h
host`
where feedtype and pattern are as before and host is the name of the upstream LDM host system (you can get this from the relevant REQUEST
entries in the LDM configuration-file).
If the notifyme(1) command indicates that
-
The upstream LDM isn't running (i.e., the connection is refused), then it must be started.
-
The upstream LDM is unreachable, then you have a networking issue. This can be verified by executing the following command on the downstream host:
telnet host 388
Contact your network administrator and show them the
telnet(1)
command. -
The upstream LDM won't honor the request, then the upstream LDM doesn't have a relevant
ALLOW
entry for the downstream LDM in its configuration-file. You'll need to contact the upstream LDM user. -
The upstream LDM is, indeed, receiving the data-products, then either 1) there's something wrong with the associated
REQUEST
entry in the downstream LDM's configuration-file (Does it exist? Is it correct?); or 2) the upstream site is using the NOT field of the relevantALLOW
entry in the upstream LDM's configuration-file to prevent the downstream LDM from receiving the data-products and you'll need to contact the upstream LDM user. -
The upstream LDM is not receiving the data-products, then either change your
REQUEST
entry to a host whose LDM is receiving the data-products or execute this section on the upstream system.
The LDM suddenly disappears on an RHEL/CentOS system
Several LDM users have reported the sudden disappearance of a running LDM on their RHEL/CentOS 6 or 7 systems. There's no warning: nothing in the LDM log file. It just vanishes — as if the superuser had sent it a SIGKILL
with extreme prejudice.
It turns out, that's exactly what happened. Only, it wasn't the superuser per se, but the out-of-memory manager (OOM) acting on behalf of the superuser. The smoking gun is an entry in the system log file from the out-of-memory manager about terminating the LDM process around the time that it disappears.
The current workaround is to tell the out-of-memory manager that the LDM processes are important by assigning the LDM process-group a particular "score". LDM user Daryl Herzmann explains:
So there is a means to set a "score" on each Linux process to inform the oom killer about how it should prioritizing the killing. For RHEL/centos 6+7, this can be done by
echo -1000 > /proc/$PID/oom_score_adj
. For some other Linux flavours, the score should be -17 and the proc file is oom_adj. Google is your friend! A simple crontab(1) entry like so will set this value for ldmd automatically each hour.
1 * * * * root pgrep -f "ldmd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done
Of course, this solution would have a small window of time between a ldm restart and the top of the next hour whereby the score would not be set. There are likely more robust solutions here I am blissfully ignorant of.
The OOM killer can be completely disabled with the following commands. This is not recommended for production environments, because if an out-of-memory condition does present itself, there could be unexpected behavior depending on the available system resources and configuration. This unexpected behavior could be anything from a kernel panic to a hang depending on the resources available to the kernel at the time of the OOM condition.
sysctl vm.overcommit_memory=2
echo "vm.overcommit_memory=2" >> /etc/sysctl.conf
ldmadmin scour
takes too long on an XFS file-system
LDM user Daryl Herzmann encountered this problem. As he explained it
Please, I don't wish to start a war regarding which filesystem is the best here... If you have used XFS (now default filesystem in RHEL7) in the past, you may have suffered from very poor performance with IO related to small files. For me and LDM, this would rear its very ugly head when I wished to
ldmadmin scour
the/data/ folder
. It would take 4+ hours to scour out a days worth of NEXRAD III files. If you looked at output likesysstat
, you would see the process at 100% iowait. I created a thread about this on the redhat community forums[1] and was kindly responded to by one of the XFS developers, Eric Sandeen. He wrote the following:This is because your xfs filesystem does not store the filetype in the directory, and so every inode in the tree must be stat'd (read) to determine the filetype when you use the
-type f
qualifier. This is much slower than just reading directory information. In RHEL7.3, mkfs.xfs will enable filetypes by default. You can do so today withmkfs.xfs -n ftype=1
.
So what he is saying is that you have to reformat your filesystem to take advantage of this setting. So I did some testing and now
ldmadmin scour
takes only 4 minutes to transverse the NEXRAD III directory tree!