Troubleshooting - AtlasOfLivingAustralia/documentation Wiki

Table of Contents

Intro

This document tries to document some common problems with LA software and some common tools & strategies to try to solve them. Also it tries to document some common questions/answers like the typical "Is cassandra running?".

Basic service checks, is tomcat/solr/cassandra/apache/cas/mysql/whatever running correctly?

A simple check like service tomcat7 status or service cas status etc can give you a minimal info about some service and if is running or not.

Check the follow services are running in your servers:

If you are using CAS, verify also these services are up and running correctly:

The new image-service > 0.9.2 is also a systemd service, so: service image-service status. Also images depends on the elasticsearch service. Look its status with: service elasticsearch status.

If some of these services are not running (or you want to restart them) you can:

Some service isn't running?

journalctl -f -u SOME_SERVICE can provide you some additional log message if something goes wrong. Also you can verify the logs.

Basic ports checks, is tomcat/solr/cassandra/apache/cas/mysql/whatever listening correctly?

Locally

Some easy check is to check if that service is listening in some port from the virtual machine. You can use a tool like lsof locally to detect if some service is listening to some port.

Common ports:

Some usage sample:


# lsof -p $(pgrep -f solr) | grep LISTEN
java    23141 solr   45u     IPv6          203903883         0t0       TCP localhost:7983 (LISTEN)
java    23141 solr  118u     IPv6          203930236         0t0       TCP *:8983 (LISTEN)
# lsof -i :9042
COMMAND   PID      USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
java    14626 cassandra   81u  IPv4 909890103      0t0  TCP 172.16.16.52:9042->172.16.16.246:43622 (ESTABLISHED)
java    14626 cassandra   83u  IPv4 909890111      0t0  TCP 172.16.16.52:9042->172.16.16.246:43624 (ESTABLISHED)
java    14626 cassandra  308u  IPv4  22633942      0t0  TCP *:9042 (LISTEN)

Recommendation: Use some monitoring software to periodically auto-check that some service is running and to notify you about service issues. Do these checks externally and internally. An extra process supervision like monit can be also useful to relaunch some dead service for you.

From the outside

If the services are working and listening as expected you can check if your machines and services communicate correctly each others, and you don't have, for instance firewall issues.

Tool like netcat, curl or even links (a text browser) can be useful, to test if some port is open from some VM to other VM, or some web page is accessible, etc. Samples:

netcat -v -z a.b.c.d 8983
Connection to a.b.c.d 8983 port [tcp/*] succeeded!

If something goes wrong you can use tools like tcptraceroute (apt install tcptraceroute) to find who is filtering your traffic.

Some strategies to solve a service problem

Many times, when you have some service problem, we are trying to test the whole thing.

Imagine a chain of elements like BrowserClient->A->B->C->D->Server that fails (partially or totally).

Or a real chain like:

BrowserClient -> Internet -> CDN -> Proxy (nginx) -> collectory (tomcat) ->
  -> biocache-hub (tomcat) -> biocache-service (tomcat) -> solr & cassandra

and something does not work as expected. This problem can be a detail (some field does not return the value you expect), or more severe (like a total biocache search failure).

We can use different strategies to try to find what is going on.

Divide and conquer strategy

Try to check first small parts of the chain, like D->Server and later, if works, C->D->Server etc to detect where the problem is. When you think that the problem is solved, tests the complete system.

Using a metaphor: When we have an electrical machine that stops to work, the first thing is verify if there is power in the socket, later the electrical plug, later the cable, and so on... but to try to test the whole machine is like more difficult to know where the problem is.

Log checking aka "get your microscope"

Other strategy to detect the source of a problem is to check LA logs carefully looking for some Error/exception/etc. Other option if this is not enough is to increase the level of logs in the servers (look for log4j configuration files).

It's just me?

Some good question to ask is if this functionality is working in other portals. Try to make similar queries or try to use the faulty functionality in other bigger portal like: https://ala.org.au and compare the results.

Did we change something recently?

Sometimes a new re-configuration via ansible or an update or a power-outage or a reboot can be a source of problems. So to keep track of your changes in your inventories using git can be very useful.

Recommendation: Use etckeeper to see changes in /etc during time. You can check what changed after re-running ansible. As etckeeper only tracks changes in ... bingo... /etc, you can use plain git in your /data/*/config/ directories to track ansible changes in your configurations and detect new variables introduced by ansible. For instance, after configuring CAS via ansible (or other new service in your node) it's easy to detect our configuration changes via git diff and detect also a missed variable in our inventories:

[email protected]:/data/ala-hub/config# git diff
diff --git a/ala-hub-config.properties b/ala-hub-config.properties
index f7e6ca2..d7f0a96 100644
--- a/ala-hub-config.properties
+++ b/ala-hub-config.properties
@@ -6,17 +6,17 @@ grails.resources.work.dir=/data/ala-hub/cache

 # CAS Config
 serverName=https://biocache.somedomain
-security.cas.casServerName=
+security.cas.casServerName=https://auth.somedomain
 security.cas.appServerName=https://biocache.somedomain
-security.cas.casServerUrlPrefix=/cas
-security.cas.casServerLoginUrl=/cas/login
-security.cas.casServerLogoutUrl=/cas/logout
-security.cas.loginUrl=/cas/login
-security.cas.logoutUrl=/cas/logout
-security.cas.uriFilterPattern=
+security.cas.casServerUrlPrefix=https://auth.somedomain/cas
+security.cas.casServerLoginUrl=https://auth.somedomain/cas/login
+security.cas.casServerLogoutUrl=https://auth.somedomain/cas/logout
+security.cas.loginUrl=https://auth.somedomain/cas/login
+security.cas.logoutUrl=https://auth.somedomain/cas/logout
+security.cas.uriFilterPattern=/admin.*,/alaAdmin.*,/download.*

Are our service components and data compatible?

Sometimes there is an incompatibility between software version that makes that, for instance, biocache-hub, biocache-service and biocache-store generated indexes are not compatible, so the queries don't work, etc.

Search LA github & slack channel for your error

Sometimes is useful to find for a error log or message in ALA github looking for a issue.

A query like: https://github.com/search?q=org%3AAtlasOfLivingAustralia+%22Unable+to+retrieve+email+from+User+Principal%22 search all github repos for this particular quoted message. This is a fast way to find a message in code or a error message in some issue searching in all the ALA code repos.

A faster way is to add a custom search engine in your chrome/chromium browser like this:

Screenshot of how to configure a ALA github repos searcher in chrome

so you can type: ala TAB some query (or some "quoted query") in your chrome browser bar to fast search all ALA repositories.

Look for chrome://settings/searchEngines or right click in the chrome url bar -> Edit Search engines.

Recommendation: You can add a similar search engine for GBIF repos in github, if for example, your LA node uses IPT and you want to search fast for some IPT error. Also useful to add nexus searcher for:

XY problem?

Sometimes we try to find a solution or to find an answer from the community thinking in a solution instead of focusing in the problem itself.

Quoting http://xyproblem.info/, "The problem occurs when people get stuck on what they believe is the solution and are unable step back and explain the issue in full".

When you are stuck in some error, get distance and think if you are focusing the issue correctly.

Tools & Services

Ansible

Initial ssh access

The easiest way to run LA ansible inventories is via a passwordless ssh key and sudo. [Here] you can check a good tutorial of how to configure it. So some basic checks you should test:

You can test your initial VM passwordless setup with some ssh command like:

ssh -i ~/.ssh/MyKey.pem [email protected] sudo ls /root

to see if ansible will run without too much pain.

Default property values

Also a common issue running LA ansible inventories is not configure some var. So you can look for default vars in some role:

grep -ri orgName ansible/roles/biocache*
# or look for defaults values
grep -r "default(" ansible/roles/biocache-service

With an example: imagine the bie-hub service have some wrong url in your generated /data/ala-bie/config/ala-bie-config.properties in your bie-hub machine. Image that is the collectory url that is pointing to .au url instead of using your collectory url. You can do something like:

grep -r url ansible/roles/bie-hub/templates | grep collectory

that returns:

ansible/roles/bie-hub/templates/bie-hub-config.properties:collectory.baseURL={{ collectory_url | default('https://collections.ala.org.au') }}

to know the name of the ansible variable. So you can add collectory_url var to your inventory with the correct url of your collectory.

Debugging ansible

If something goes wrong running ansible with -vvvv is a good way to get a detailed output of the failed command and output.

Also check more ways to debug not so verbose: https://docs.ansible.com/ansible/latest/user_guide/playbooks_debugger.html

Cassandra

Cassandra does not start?

For instance you can get some error like this running ansible:

Running ansible, I reached this step, with the following error: TASK [biocache-db : ensure cassandra 3.x is running] ***************************
fatal: [demo.livingatlas.org]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 127.0.0.1:9042"}

Or after a machine reboot, your cassandra does not start.

Many times this is because the machine hostname does not match with the cassandra configuration. This is typical if you change the hostname of a machine after running ansible LA inventories or if you virtualization system does not persist correctly your selected machine hostname (this happens often using openstack). Just ensure that /etc/hosts and hostname has the same name that the one used in your ansible cassandra inventory.

Sometimes the service cassandra start does not work because the service was not shutdown correctly (if for instance the VM get out of disk). Try to use before service cassandra stop or service cassandra restart.

Cassandra test queries

A typical occurrence search looking for some occurrence id looks like:

# cqlsh
Connected to ALA at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> use occ;
cqlsh:occ> expand on;
Now Expanded output is enabled
cqlsh:occ> select * from occ where rowkey='4497d99a-21b4-40b0-b709-3f6762c36a53';

@ Row 1
(...)
 day_p                                     | 20
 decimalLatitude                           | 42.83
 decimalLatitude_p                         | 42.83
 decimalLatitudelatitude                   | null
 decimalLongitude                          | -1.72
 decimalLongitude_p                        | -1.72
 defaultValuesUsed                         | false
(...)
cqlsh:occ> exit

So there you can detect things like if some field like the collectionUid_p is not null or the institution to detect wrong data mappings.

Solr

Accessing solr from the outside

For administration tasks of solr, tunnel through the firewall using SSH by running ssh -Nf -L 8983:127.0.0.1:8983 solrnode.yourdomain.org

and then visiting http://localhost:8983/solr in a browser

Solr Suggester error: Suggester was not build

Access to your server with an url like

Solr test queries

The solr web interface allows to query via a form. This is a good way to test for some indexing of biocache or bie problem.

A basic query like q=*:* to search all occurrences is a good simple test of the health of the index selecting the core you are interested in.

Later you can do more advances queries to check is some field is NULL, or NOT NULL, etc.

Example of basic query in solr to check if taxon_concept_lsid has correct values (NOT NULL):

Example of basic query in solr to check if taxon_concept_lsid has correct values (NOT NULL)

Example of solr query a specific taxon_concept_lsid value:

Example of solr query a specific taxon_concept_lsid value

Sometimes If you want to debug some lucene index (for instance after a biocache-store re-index) a command line tool like clue its useful. Same previous query from the command line using clue:

Same previous query from the command line using clue

More info: https://github.com/javasoze/clue/

APIs

If you do some queries in cassandra and solr and you don't see nothing strange, you can do the same tests via the API, to detect if the API works as expected or not.

See the API wiki page for more info.

nginx

nginx vhost ansible issues

If you try to deploy several servers in the same VM, see this section describing how to setup nginx vhost correctly.

Biocache

Start error because a circular dependency

If you enable the collectory cache in biocache-service with caches.collections.enabled=true (or caches_collections_enabled=true in you inventory, and you have the collectory and biocache-service in the same server, you can suffer a failed start because you have a circular dependency trying to boot both service at the same time. You'll see in your tomcat's biocache-service.log:

2020-06-03 18:53:43,237 [biocache-ws.l-a.site-startStop-1] ERROR au.org.ala.biocache.util.CollectionsCache  (CollectionsCache.java:197) - RestTemplate error: 504 Gateway Time-out org.springframework.web.client.HttpServerErrorException: 504 Gateway Time-out

So, disable the collectory cache.

Error creating bean with name qidCacheDao: Unsatisfied dependency expressed through field searchDAO

Despite of that weird message, it seems that for some reason biocache-service is not connecting to solr/cassandra at startup. It's somehow also a start dependency problem.

Verify also that the solr cores are correct.

After check that solr and cassandra are working and accessible from you biocache-service VM, you should restart tomcat.

Verify your DNS or /etc/host and check what you have configured in /data/biocache/config for cassandra an solr and try connect with the same names ports. For instance:

$ nc biocache-store.l-a.site 9042 -v
Connection to biocache-store.l-a.site 9042 port [tcp/*] succeeded!

$ nc index.l-a.site 8983  -v
Connection to index.l-a.site 8983 port [tcp/*] succeeded!

Take into account also that cassandra can be more slow starting up that tomcat biocache-service, so this can happen if you start your different servers at the same time and biocache-service starts before your cassandra.

Sampling

You get java.util.NoSuchElementException: None.get when sampling

This is because the value of sample.fields was set to none in biocache-config.properties file aka sample_fields in your inventory.

But regions still fails after adding a layer

You need to get that new layer/field into the solr index, and this is not automatic. Sampling only adds that layer info into cassandra. You have to reindex to get that spatial info into solr.

More info in the sample and index page and in the solr admin page.

Ingestion

You get an error similar to Exception in thread "main" java.lang.Exception: Unable to load resourceUid, a primary key value was missing on record 1

Your resource connection parameters are not configured correctly, check if the DwCtermsthatuniquely identifyarecord is configured correctly:

Wrong number of records in collections after ingestion

This is usually a mapping issue. But sometimes if numbers don't match can be a wrong data. For instance after renaming some faulty occurrences.txt to occurrences.csvand loading it withlibreoffice` we see:

Nameindex and BIE issues

"null has been blacklisted" error

If during the generation of a new index you get this message null has been blacklisted you probably have some empty value in the scientificName column of your DWCA file and this field is mandatory.

Duplicate author names

If you are using a scientificName in you DWCA including scientificNameAuthoship, you can generate wrong species names in your BIE service with duplicate author names. See this section of nameindexer page for more details.

CAS Authentication issues

If you have some unauthorized error, CAS logs has an audit that file that can show you detailed about some lack of roles or perms in your user. For instance, imagine that you have some problem accessing apikey service:

$ grep apikey /var/log/atlas/cas/cas_audit.log
(...)
WHAT: [result=Service Access Granted,service=https://auth.l-a.site/apikey/,principal=SimplePrincipal([email protected], attributes={activated=[1], authority=[ROLE_ADMIN,ROLE_COLLECTION_ADMIN,ROLE_COLLECTION_EDITOR,ROLE_COLLECTORS_ADMIN,ROLE_EDITOR,ROLE_IMAGE_ADMIN,ROLE_SPATIAL_ADMIN,ROLE_SYSTEM_ADMIN,ROLE_USER], city=[Wakanda], country=[WK], created=[2019-04-21 11:04:04], disabled=[0], email=[[email protected]], expired=[0], firstname=[John], givenName=[John], id=[43954], lastLogin=[2019-12-03 07:37:38], lastUpdated=[2019-06-04 19:13:40], lastname=[Ruiz Jurado], legacyPassword=[0], organisation=[GBIF.wk], role=[ROLE_ADMIN, ROLE_COLLECTION_ADMIN, ROLE_COLLECTION_EDITOR, ROLE_COLLECTORS_ADMIN, ROLE_EDITOR, ROLE_IMAGE_ADMIN, ROLE_SPATIAL_ADMIN, ROLE_SYSTEM_ADMIN, ROLE_USER], sn=[Doe], state=[MD], userid=[43954]}),requiredAttributes={}]
(...)

Image service

Throttling

You can experience some throttling issues when trying to upload many requests per second, for instance from iNaturalist.

You can decrease the rate in biocache-store config setting media.store.maxrequests.persec (in ansible: media_store_maxrequests_persec, 10 by default). Notice that this setting seems to be requests/sec/thread so use '--number-of-threads 1' to get ~1/sec. You can increase again later to upload faster from other remote servers.

Spatial service

Error in uploads

When I accces to https://myspatial.l-a.site/ws/manageLayers/uploads I get an simple Error page. In the tomcat logs I can see a Wrong magic number, expected XXXX, got YYYY. Goto /data/spatial-data/uploads/ and move the last directory layer you created:

[email protected]:/data/spatial-data/uploads# ls -lrta
total 44
drwxr-xr-x  2 tomcat7 tomcat7 4096 Apr  8 14:25 1586355605808
drwxr-xr-x  8 tomcat7 tomcat7 4096 Apr  8 14:25 ..
drwxr-xr-x  2 tomcat7 tomcat7 4096 May 29 13:17 1590758233964
drwxr-xr-x  2 tomcat7 tomcat7 4096 May 29 19:33 1590780637924
drwxr-xr-x  2 tomcat7 tomcat7 4096 May 29 19:35 1590780757437
drwxr-xr-x  2 tomcat7 tomcat7 4096 May 29 19:55 1590759518368
drwxr-xr-x  2 tomcat7 tomcat7 4096 May 29 21:44 10001
drwxr-xr-x  2 tomcat7 tomcat7 4096 May 29 21:56 1590768725781
drwxr-xr-x  2 tomcat7 tomcat7 4096 May 29 22:03 1590789492141
drwxr-xr-x 11 tomcat7 tomcat7 4096 Aug  3 15:58 .
drwxr-xr-x  2 tomcat7 tomcat7 4096 Aug  3 18:22 1596470330070

and move it with: mv /data/spatial-data/uploads/1596470330070 /tmp, and refresh the upload page. The error should gone. Verify that the uploaded layer has the correct contents in the zip.

Flyway

You can enable flyway debug logs adding to /data/cas/config/log4j2.xml (or similar in other modules):

        <AsyncLogger name="org.flywaydb" level="debug" />

Other resources: