Service Monitoring - csap-platform/csap-core GitHub Wiki
To ensure Application availability, CS-AP Monitoring integrates with alerting systems to notify teams when issues arise. To be effective in both the identification, routing, and resolution of incidents, it is critical that all services have implemented health and performance apis. While trivial services may only have a few SLA metrics defined - more complex services may have dozens of elements contributing to determine if a service is healthy.
Use the CSAP service editor to:
- Customize resource alerts based on production and performance test observations. This includes: cpu, memory, sockets, ...
- Add all key performance indicators to facilitate both trending and troubleshooting.
- Embed CSAP black box java component, refer to Monitoring Java
To achieve the highest rates of telemetry collection, the CSAP black box component (provided by csap-starter) is available for services using java. Black Box provides embedded collection and monitoring capabilities. Embedding additional instrumentation into each service enables 10-100x increase in telemetry over traditional collection techniques. This significantly improving engineering collaboration and opportunities for machine learning. Implementation details: Monitoring Java
Both http and Java JMX based heartbeats are supported. Java JMX is preferred for java applications as connection and collection is generally more efficient. Guidelines:
-
Any downstream availability issues should be include in the health checks as rpc invocations can fail to even leave the current hosts, or even the current process (eg. connection pool timeout, network routing tables, etc.).
-
Response must be very fast. Avoid any RPC - use in memory structures to store/retrieve results gathered either during regular workflows, or background threads.
-
Calls longer then the configured time (typically 500ms) will be marked as non-responsive.
-
Http monitors can be used to monitor endpoints. In addition to true/false, response should contain reasonable troubleshooting and routing information.
{
"Healthy": false,
"reasons": [
"Exception accessing Oracle on caf7db",
"Exception accessing Ldap on dsx.cisco.com"
]
}
- Java Monitors: implemented using JMX to invoke a isHealthy() method. The method returns an integer: 0 indicates failures, everything else is success. The isHealthy result will be graphed - so this itself can be used as a reflection of responsiveness. This will trigger alerts - and then performance metrics (see below) will be used to identify the specific failure point.
- Note: by default - CSAP will monitor every java service . If health is not configured - csap will use the java thread count retrieved to ensure that the JVM is responding to connection requests (while not ideal - this catches OOM exceptions ~80% of the time). Further, CSAP agents will use Java JMX connectivity to determine if a JVM is responsive or aware. This is a very limited test as it does not capture any downstream dependencies or logic errors. All services should implement a custom isHealthy method to ensure delivery of services.
CSAP default collections include:
- Shared OS data - mpstat, load, memory
- Process data - including top, disk reads/writes, network activity
- Java data - including heap, tomcat sessions, etc.
Strongly recommended: each service should implement key performance indicators exposed using REST or Java APIs.
While the above represents significant insight into service performance, standard industry practices include gather additional data points representing workload. Typically - this would include APIs invoke - the number of times per minutes, and min and max time to complete. The Code Samples include several examples based on leveraging the csap-starter apis provided by Runtime Support. Java Simon is a very trivial mechanism to use to capture and export data via JMX.
Services can add data point using the CSAP Service Editor, and view the graphs in the service portal.