Troubleshooting biocache‐service - AtlasOfLivingAustralia/documentation GitHub Wiki

Table of Contents

Intro

biocache-service does not start correctly when they dependencies are not running or reacheable. A typical error message looks like:

{ "message": "Error creating bean with name 'qidCacheDao' defined in file [/var/lib/tomcat7/webapps-records-ws.l-a.site/ROOT/WEB-INF/classes/au/org/ala/biocache/dao/QidCacheDAOImpl.class]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [au.org.ala.biocache.dao.QidCacheDAOImpl]: Constructor threw exception; nested exception is java.lang.NoClassDefFoundError: Could not initialize class au.org.ala.biocache.Config$", "errorType": "Server error" }

This is the ALA equivalent to the blue screen of death :-)

This wiki page tries to help you to fix this startup error with different checks.

is solr running?

In this example of commands, solr is installed in ala-install-test-2 and biocache-service in ala-install-test-1.

Check the solr service status, it should look like:

root@ala-install-test-2:~# service solr status
* solr.service - LSB: Controls Apache Solr as a Service
   Loaded: loaded (/etc/init.d/solr; generated)
   Active: active (exited) since Thu 2021-12-02 10:52:59 UTC; 6h ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/solr.service

Dec 02 11:51:36 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:51:39 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:51:39 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:51:39 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:51:41 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:51:41 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:51:41 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:51:44 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:54:38 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:54:39 ala-install-test-2 systemd[1]: solr.service: Failed to reset devices.list: Operation not permitted

Check if the port is listenning:

root@ala-install-test-2:~# lsof -i:8983
COMMAND   PID USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
java    29838 solr  137u  IPv6 757352331      0t0  TCP ala-install-test-2:8983 (LISTEN

If not, check the memory of the VM and the logs to verify why solr is not running.

If it's running, check if is accessible from the biocache-service VM (in our example ala-install-test-1):

root@ala-install-test-1:~# grep solr.home /data/biocache/config/biocache-config.properties 
solr.home=http://index-es.l-a.site:8983/solr/biocache

and now we'll try to connect in a similar way:

root@ala-install-test-1:~# nc index-es.l-a.site 8983 -v
Connection to index-es.l-a.site 8983 port [tcp/*] succeeded!

If is not reacheable, verify things like the name resolution:

root@ala-install-test-1:~# ping -c 1 index-es.l-a.site 
PING ala-install-test-2 (10.10.10.152) 56(84) bytes of data.
64 bytes from ala-install-test-2 (10.10.10.152): icmp_seq=1 ttl=64 time=0.027 ms

--- ala-install-test-2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.027/0.027/0.027/0.000 ms

and:

root@ala-install-test-1:~# getent ahosts index-es.l-a.site
10.10.10.152    STREAM ala-install-test-2
10.10.10.152    DGRAM  
10.10.10.152    RAW    

and if you have some firewall between the VMs allow the traffic for solr (8983/tcp) and/or zookeeper (2181/tcp).

Are the solr cores created?

You have to verify that the solr cores were created by ala-install:

root@ala-install-test-2:~# ls -l /data/solr/data/
total 20
drwxr-xr-x 4 solr solr 4096 Dec  2 10:53 bie
drwxr-xr-x 4 solr solr 4096 Dec  2 10:53 bie-offline
drwxr-xr-x 4 solr solr 4096 Dec  2 10:53 biocache
-rw-r----- 1 solr solr 2180 Dec  2 10:51 solr.xml
-rw-r----- 1 solr solr  975 Dec  2 10:51 zoo.cfg

You can verify the cores in the solr interface using the solr admin interface that should be protected.

if you use biocache-store, is cassandra running?

You should check similar things with cassandra, in our example is running in ala-install-test-3:

root@ala-install-test-3:~# service cassandra status
* cassandra.service - LSB: distributed storage system for structured data
   Loaded: loaded (/etc/init.d/cassandra; generated)
   Active: active (running) since Thu 2021-12-02 10:50:26 UTC; 6h ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 58 (limit: 4915)
   CGroup: /system.slice/cassandra.service
           `-29012 /usr/bin/java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouc

Dec 02 11:28:44 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:54 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:54 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:57 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:57 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:57 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:59 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:59 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:29:59 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted
Dec 02 11:30:02 ala-install-test-3 systemd[1]: cassandra.service: Failed to reset devices.list: Operation not permitted

Let's see if the port 9042 is listenning:

root@ala-install-test-3:~# lsof -i:9042
COMMAND   PID      USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
java    29012 cassandra   85u  IPv4 757303579      0t0  TCP *:9042 (LISTEN)
java    29012 cassandra   89u  IPv4 759281357      0t0  TCP ala-install-test-3:9042->ala-install-test-1:59370 (ESTABLISHED)
java    29012 cassandra   90u  IPv4 759281428      0t0  TCP ala-install-test-3:9042->ala-install-test-1:59380 (ESTABLISHED)

and also if is reacheable from the biocache-service VM:

root@ala-install-test-1:~# grep cassandra.host /data/biocache/config/*
/data/biocache/config/biocache-config.properties:# cassandra hosts - this should be comma separated list in the case of a cluster
/data/biocache/config/biocache-config.properties:cassandra.hosts=ala-install-test-3
root@ala-install-test-1:~# nc ala-install-test-3 9042 -v
Connection to ala-install-test-3 9042 port [tcp/*] succeeded!

If not, check again dns resolution and firewall rules to allow this tcp traffic.

Other services that biocache-service needs

Let's check is a quick way if other services more than solr and cassandra are up an reacheable:

root@ala-install-test-1:~# for i in $(grep https /data/biocache/config/biocache-config.properties  | cut -d "=" -f 2 | grep -v zip | sort | uniq) ; do echo; echo $i ----; curl --write-out '%{http_code}' --silent --output /dev/null $i; done

https://auth.l-a.site/userdetails ----
302
https://auth.ala.org.au/apikey/ws/check?apikey ----
200
https://collections.l-a.site/ws ----
200
https://collections.l-a.site/ws/citations ----
200
https://collections.l-a.site/ws/collection ----
200
https://dataquality.ala.org.au/ ----
000
https://doi.l-a.site ----
200
https://doi.l-a.site/api/ ----
302
https://doi.l-a.site/doi/ ----
200
https://doi.l-a.site/myDownloads ----
302
https://spatial.l-a.site/geoserver ----
302
https://spatial.l-a.site/ws ----
302
https://spatial.l-a.site/ws/fields ----
200
https://species-ws.l-a.site ----
200
https://images.l-a.site ----
200
https://lists.l-a.site ----
302
https://logger.l-a.site/service/logger/ ----
200
https://records.l-a.site ----
200
https://records.l-a.site/download/doi?doi ----
302
https://records-ws.l-a.site ----
200
https://records-ws.l-a.site/biocache-download ----
301
https://records-ws.l-a.site/biocache-media/ ----
404

If you see some 500 error or more 404 errors, verify that services.

Start order

If you suffer so power outage and your VMs restart, sometimes biocache-service starts before their dependencies are up, failling to start. This also happens if you have many services in the same VM and biocache-service starts before others. In this case, try to restart only this biocache-service. In this case a simple touch should be enough:

root@ala-install-test-1:~# touch /var/lib/tomcat8/webapps-records-ws.l-a.site/ROOT.war 

Increase the biocache log level

Edit /data/biocache/config/log4j.xml and increase the log levels to get a more verbose logs.

image

⚠️ **GitHub.com Fallback** ⚠️