Java Performance Tuning - ashishranjandev/developer-wiki GitHub Wiki

How to measure performance

Operations per second
Instructions per second

Hardware Components

CPU - Cache - RAM - IO Devices (Disk or Network Interfaces)

Reading from l1 cache takes 4 cycles or 1 ns
Reading from l2 cache takes 10 cycles or 4 ns
Reading from l3 cache takes 40 cycles or 20 ns (L3 Cache is shared among CPU Cores)
Reading from RAM takes 10 cycles takes 100 ns
Reading 1 million bytes takes 5000 ns
Finding a reference in main memory takes 100 ns, reading 1 million bytes sequentially from main memory takes 5, 000 ns or 5 microseconds.
Randomly accessing a location in an SSD, which takes 16, 000 ns or 16 microseconds, and reading 1 million bytes sequentially from an SSD, which takes 78, 000 ns or 78 microseconds.
A disk seek takes 3, 000, 000 ns or 3 ms, and reading 1 million sequential bytes takes 1, 000, 000 ns or 1 ms.

Smaller the program is, the higher the likelihood of its instructions and data fitting the cache. So data structures and patterns that lay out data in memory sequentially, like the Java ArrayList, are faster than nonsequential data structures, like the LinkedList for example.
Sequential data access is generally faster than random data access.

How to measure system performance?

USE by Brendan Gregg (Utilisation, Saturation and Errors method of all resources of system)

Resources of our system -

CPU cycles
RAM capacity
Disk capacity
Disk I/O
Network I/O

Utilisation - The proportion of a resource that is used or the average time that the resource was busy servicing work. Saturation - Is the degree to which the resource has extra work that it can't service

CPU Run Queue - CPU
Swap Space - RAM

Errors - Count of error events for a resource effect of errors on Performance -

Error Handling
Retries
Fewer pool resources

Software Resources

Thread pool, for example, can define utilization as the number of threads currently executing a task, while saturation can be defined as the number of items in the thread pool work queue.
For locks, utilization can be defined as the time the lock was held, and saturation can be defined as the number of threads queued and waiting in the lock.

Tools to capture the above metrics

vmstat 1 10

The vmstat tool is a virtual tool that goes beyond reporting on virtual memory statistics, but also the number of processes in the run queue, swap memory usage, block I/O, the number of interrupts and context switches per second, and processor usage.

Four components of CPU Utlisation

User time
System/Kernel time
Idle time
I/O Waiting time

Although it may sound counterintuitive, it's actually desirable to have as high of CPU usage as possible for as short of a period of time as possible. In essence, the more CPU time an application gets, the faster it can execute. This is the logic behind adding more CPU cores or reducing blocking calls so that applications can get to the CPU faster and execute faster.

A high run queue for a prolonged period of time is a sign of saturation and will result in performance degradation.

A more interesting indicator of memory capacity issues, however, is the saturation metrics. In Linux, when the OS runs out or gets low on RAM, it'll move some memory pages to a special location on disk called the swap space in order to make room for pages that are higher in demand. This process is called swapping and is an indicator of memory capacity saturation and possible performance degradation. Now we should note that even if there is sufficient RAM, the Linux kernel may choose to move memory pages that are hardly ever used to swap space. So when looking at swap activity, we should pair those numbers with memory usage numbers as well. Windows uses the swap file differently. In Windows, the swap file is called the page file, and instead of acting essentially as a spillover space for pages that can't fit in RAM, the page file contains all the memory pages currently needed or committed to. And then RAM just contains the active working set of pages. This means that RAM essentially functions as a cache of the page file, and since every page in RAM is already in the page file, it can quickly drop pages out of RAM when it needs to use the space for something else. If a reference to a page in RAM cannot be resolved, this results in a page fault, and the page must then be fetched from disk into RAM. Therefore, one of the key metrics for memory capacity saturation in Windows is the pages input per second. This is the rate at which pages are read from disk to resolved page faults. A high value of this metric is an indication of performance degradation.

iostat

Wee can run the iostat command and view usage statistics. Iostat also has these await statistics that show the average time for read requests issued to the device to be served and the average time for write requests to be served. This time includes both how long the request spent in queue and how long it took to service the request. Therefore, it's a good indication of saturation. A high value here could be an indicator of an I/O bottleneck and performance issues. For Windows, there's the Disk Read Bytes per second and Disk Write Bytes per second, which show the utilization. There's also the Avg. Disk Read Queue Length and the Avg. Disk Write Queue Length, which show the saturation.

Application Performance Metrics

latency - amount of time required to complete a unit of work. Best approach is to instrument the server code so that it captures the start time on the receipt of the request and the end time on the return of the response to avoid transmission times.
elapsed time - Elapsed time is similar to latency, but instead of measuring the time taken for each individual operation, elapsed time measures the time taken for a batch of operations to complete. This comes in handy in cases where applications typically perform actions in batches. 2nd use case is microbenchmarking where the process is too quick that calculating latency would distort the figure. So we run 1000s of the operations and take the average. Its recommended to use tools like Java Microbenchmarking Harness instead of writing it yourself.
Throughput - amount of work that an application can accomplish per unit of time.

Production Monitoring	Performance Testing
Operations Activity	Development Activity
Collect and aggregate metric data	Define application/component under test
Metric Data Storage	Generate load against application
Analysis, visualization, alerting	Analyze results
Helps to anticipate and solve performance problems	Gives aggerateted information about user's actual experience

Performance Testing tips:

Use representative data : Input data should be representative of how the application or code will actually be used.
ramp-up period is required: ramp-up period, is a period of time when the test is running but the recording of the response times or throughput has not yet started. The warm-up period is needed to allow the JVM complete its classloading and to allow any runtime optimizations, cache warming, or code compilation to complete.

Phases of Production Monitoring

instrumentation, collection, and aggregation phase

Metrics libraries like Dropwizard metrics, Micrometer metrics, Netflix's Spectator, and Google's OpenCensus and Prometheus client library. We can use timer from a metrics library. Better alternative to manual instrumentation is to enable profiling and use a low-overhead production profiler

Storage Phase

InfluxDB, Elasticsearch, Graphite, OpenTSB, Prometheus time seeries database and commercial solutions like Datadog, Dynatrance.

Analysis, Visulization and alerting phase

We have Grafana, Kibana, JMX Tools, Prometheus dashboards and more

Java Mission Control is a tool for collecting low-level and detailed runtime information from a JVM.
It integrates information available in all the older standalone apps like jstat, jinfo, jmap, jstack, and JConsole.
JMC comes bundled in JDK 7 or lower.

JMC contains a graphical user interface for Java Flight Recorder, a profiler that we'll look at in the next module. JMC can connect to both local and remote JVM processes.
For Remote Connection some command line arguments may be required and JMC would need password or cert-based login.

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=<desired-port> -Djava.rmi.server.hostname=<your IP Address>
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

# To run java mission control
jmc

Jcmd is the command-line equivalent of Java Mission Control. unlike Mission Control, it cannot connect to remote processes. If remote JMX monitoring is not enabled, we can ssh into host and run jcmd.
Good for scripting.

# Get all performance related counter
jcmd 14652 PerfCounter.print | less

# Get all VM Flags
jcmd 14652 VM.flags 

## Some flags can be changed at runtime (called manageable)
jcmd 14652 VM.set_flag CMSWaitDuration 1500

## Thead info
jcmd 14562 Thread.print | less

## Heal info
jcmd 14562 GC.heap_info

## Class Histogram
jcmd 14562 GC.class_histogram | less

JMeter

JMeter is a tool for load testing and analyzing the performance of web applications.
It supports highly concurrent tests, allowing you to simulate heavy load on a server.
Once a test is completed, JMeter will produce a detailed HTML report that allows you to easily view and analyze the service latency, throughput, and errors.

Components

Thread Groups
Group of threads to execute your test. The number of threads in the group determines the number of concurrent calls, or users.
Samplers Samplers perform the actual work of making the HTTP or Java request.
Logic Controllers Controllers are used to determine the sequence in which samplers are processed. Controllers provide branching, looping, random order execution, only-once execution, and more.
Listeners Listeners listen to the results of samplers and can then save the responses, view them, provide graphs and reports, and more.
Configuration Elements Configuration elements can be used to set up defaults and variables for later use by samplers. e.g. HTTP Cookie Manager, HTTP Request Defaults, user-defined variables, and the random variable configuration element.
Assertions Assertions can be used to provide checks on the request, or more commonly, the responses of samplers. e.g. response should contain this text or time taken should be less than some time.
Timers Timers can be used to add pauses between the execution of requests. Timers can be constant, random within a range, or designed to achieve a certain throughput
Pre processors
Post Processors

Jmeter GUI is used to create test plans as it can be complex. But run should happen using command line.

Running JMeter from console

# threads and duration are user defined parameters
jmeter -n -t energymart.jmx -l energymart.jtl -Jthreads=100 -Jduration=120

#To Convert JTL to HTML
jmeter -g energymart.jtl -o html-report

Java Profilers

Java profilers are able to inspect the state of a running JVM and can also modify the JVM execution for the purposes of monitoring, debugging, and analysis.

Java profilers monitor JVM execution at the bytecode level. provide information on

thread execution and locks
heap memory usage
garbage collection
hot methods
exceptions
class loading

Capabilities

passively listening for events from the JVM
actively querying the JVM for its state
modify the bytecode of classes to add instrumentation code, like inserting a method entered or method exit event at the beginning and end of a method or inserting an object created event into a constructor.

Events generated by JVM

instant events: one-time events that have a timestamp and the event data - e.g. exception events, class load events, and object allocation events.
duration events: Duration events have a start time and an end time and are used to provide timing for some activity e.g. garbage collection, monitor wait, monitor contended.

Profiling Functions of JVM

getThreadState()
getAllThreads()
getStackTrace()
getAllStackTraces()

This internal state data is queried periodically by the profiler in a process known as sampling, and the sampling period is how often the functions get called to fetch the data.

Profiling Activities

CPU Profiling - frequency and length of time of method execution. Finds out what methods run most frequently, therefore eat up the most CPU time. These are commonly referred to as hot methods. Two ways

Sampling (Less Overhead and less accurate) To minimize sampling errors

Profile over a long period of time
Reduce the interval between samples (May increase overhead!)
Both

Instrumentation modifying the application's bytecode and inserting code for counting invocations or timing methods. It is more accurate but has higher potential for introducing performance differences.

One is that instrumentation code would likely have some overhead attached to it.
Depending on how the instrumentation is done, there may be some optimisations that could have been applied to the non-instrumented code that can't be applied to the instrumented code

Memory Profiling Concerned with understanding what objects are using up memory and how memory is being freed up by garbage collection.

monitor the memory usage of your class objects over time find out what objects are growing and shrinking in size where in your code these allocations are taking place. number and types of garbage collections that have occurred the length of the pause times for each garbage collection how much memory they were able to free.

Thread Profiling Thread profiling is primarily concerned with understanding what states threads are in and why. (Thread profiling is useful when your application is not as performant as it should be, yet its CPU usage is low.)

see if your threads are able to run in parallel or not
find out how much time your threads spend sleeping, waiting, blocked, or executing
find and analyze cases of high lock contention.

I/O Profiling (Less commonly used)

Java Profiler Software

JProfiler (Commercial) - Includes Automated Analysis
YourKit Java Profiler (Commercial)
Java VisualVM
NetBeans Profiler
Java Flight Recorder - Low Performance impact, From Version 6 gives analysis as well

In addition to these profilers listed here, there are also application performance management products that can do Java profiling in addition to other performance- related functions.

YourKit Profiler

YourKit is able to do CPU profiling in both sampling and instrumented modes, memory profiling, and thread profiling. Other features include IDE integration, live profiling capabilities, and the ability to record and graph high- level events such as database queries, web requests, and I/O calls. Its standout feature is its powerful, automated analysis. YourKit can produce a report showing memory waste, potential I/O problems, potential memory leaks, and other hard-to-find problems.

JProfiler

JProfiler is one of the most widely used commercial profilers in the industry as of the time of this recording. It's comprehensive, full-featured, and boasts an easy-to-use interface. Like the YourKit Java profiler, it can do CPU, memory, and thread profiling, database query monitoring, and nicely integrates with all the major IDEs.

Java VisualVM

Java VisualVM is a lightweight, open-source profiler that's bundled into the JVM. It's not as comprehensive as the previously mentioned commercial offerings, but it does offer CPU, memory, and thread profiling. It's widely used as a first-level profiling tool that can be used before moving on to deeper tools if needed. Java VisualVM has plugins for Eclipse and IntelliJ.

Java Flight Recorder

The Java Flight Recorder is a profiler that was developed at Oracle to be very low overhead and therefore suitable to run in production. While other profilers use the standard JVM tools interface, Flight Recorder is directly integrated into the Oracle and OpenJDK, allowing it have a performance impact of less an 1%. Flight Recorder is a part of Java Mission Control. For users of Oracle JDK 8, Flight Recorder is free to use in development environments, but recording a production in JVM requires an Oracle license. In OpenJDK 11, Flight Recorder has been published under an open-source license and is now free to use in production as well. Flight Recorder has CPU, memory, and thread profiling capabilities. And in addition, starting from version 6, an automated analysis feature was added that will analyze your recording and provide a report with insights on things like methods that need to be optimized, memory usage, garbage collection, JVM internals, and more.

Setting up Java Flight Recorder

## For JDK 7 or 8 We have to make sure these flags are set
java -XX:+UnlockCommercialFeatures -XX:+FlightRecorder 
##Can be set even after startup after JDK 8 version 40

## For JDK 11 UnlockCommercialFeatures is not required as it is not a commercial feature anymore
## Using UnlockCommercialFeatures will give warning in Oracle JDK and JVM won't start in OpenJDK 11

##To generated metadata for code which is not safepoint we can use

java -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

2 Modes of Java Flight Recorder

Continuous Recording In continuous recording mode, the Flight Recorder engine will collect event data indefinitely.
When setting this up, you can specify how much data to keep in memory, in terms of size or in terms of the age of the events.
Then once you're in continuous recording mode, you can at any time trigger a dump of the data in memory into a file that the Flight Recorder GUI can render.
You can also specify that file dumps take place at regular intervals or if an event occurs like when the JVM is exiting or if the CPU is too high, for example.
Continuous mode gives you a great way to just have Flight Recorder running constantly in production with super-low overhead, and then get profiling data from your application when needed or if an incident is detected.
Timed Recording Mode

We specify how long you want to record for. At the end of the recording period, the Flight Recorder engine will do a dump of the event data to the specified file.

Two config files of JMC

the default configuration The default configuration has very low overhead because it selectively doesn't gather as much data as the profile configuration. This is a configuration used in continuous recording mode by default.
the profile configuration. The profile configuration gathers more data; therefore, it has higher overhead than the default configuration. This is the configuration typically used in timed recording mode by default.

Ways to start a recording

## Oracle JDK 8
java -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:StartFlighRecording=settings=Profile

## Open JDK 11
java -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:StartFlighRecording=settings=default,maxSize=60M,maxage=1d

How to dump recording into a file

jcmd

jcmd 16900 VM.check_commercial_features
jcmd 16900 JFR.start settings=default name=MyRecording maxage=4h

Java Mission Control
Trigger

## Timed Recording
java -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafePoints -XX:StartFlightRecording=settings=Profile,delay=30s,duration=10m,name=MyRecording,filename=/tmp/myrecording.jfr

JIT - Just in Time Compiler

The performance of interpreted bytecode still lags behind the performance of precompiled binary code. So in 1999, the JIT compiler was introduced and added to Sun's HotSpot JVM. The function of the JIT compiler is that during runtime, it identifies the parts of the application where the most time is spent executing bytecode, commonly known as hotspots, and then compiles the bytecode dynamically into machine code so that the program can run faster. The code is compiled just before execution, hence the name just-in-time, and is cached for future runs in the code cache.

There are 2 types of JIT Compilers

c1 Compiler/ Client Compiler The C1 compiler is intended to be used by client applications, so for that reason, it's optimized for startup performance. This means that it tries to find hot methods pretty early on, produces some basic, relatively unintrusive optimizations, and then compiles the method's bytecode to machine code.
c2 Compiler/ Server Compiler The C2 compiler is intended for use by server applications. The C2 compiler waits for a longer period of time before deciding to compile a hot method. This gives the compiler more time to learn about the method's execution pattern so that it can infer and apply more aggressive optimizations. As a result, methods compiled with C2 are generally faster than methods compiled with C1.

However, because the C1 starts earlier than C2, an application running with the C1 compiler will be faster earlier on in its lifecycle before the C2 catches up and overtakes it. Therefore, if an application is short lived or sensitive to startup times, then the C1 compiler is actually a better choice for it.

Tiered Compilation Tiered compilation is a compilation mode that was introduced in Java 7, and in tiered compilation, hot methods are first compiled with C1, and then as they get hotter, they are recompiled with C2. Although tiered compilation was introduced in Java 7, it wasn't until Java 8 that it became the default JIT compilation mode in the HotSpot JVM.

In tiered compilation, there are five execution levels. In level 0, the code is run in pure interpreted mode. In level 1, hot methods are compiled using the C1 compiler, and no further profiling is done. In level 2, hot methods are compiled using the C1 compiler, and there's some limited profiling being done. In level 3, methods are compiled using the C1 compiler with full profiling being done. And in level 4, methods are compiled using the C2 compiler. Profiling is needed for the C2 compiler to be able to infer the advanced optimizations that help it produce really high performance code. The normal path for a hot method is that it runs in level 2 interpreted mode, and then, if it meets the threshold for compilation, it's queued up for the C1 compiler and compiled at level 3 with full profiling so that it can later be moved up to the C2 compiler once it meets the requirements for level 4 execution.

JIT Tuning

Before Java 8

#C1 Compiler
java -client
#C2 Compiler
java -server

After Java 8 these flags are ignored

#C1 compilation mode
java -XX:TieredStopAtLevel=1
#Pure C2 compilation mode - Simple Compilation Policy is used
java -XX:-TieredCompilation

What is Hot method?

The definition of what a hot method is is quite complex and changes frequently between JDK updates. So again, I'm going to present to you a simplified version of it. The definition is primarily based on the value of two counters

The method invocation counter. The method invocation counter simply counts the number of times the method has been called.
The backedge counter While the backedge counter counts the number of times any loops in the method have branched back. This branchback count is effectively the number of times a loop has completed execution, either because it reached the end of the loop itself or because it executed a branching statement like continue.

In tiered compilation mode, for an interpreted method to get selected for compilation by the C1 compiler in level 3, the following conditions must be met. The invocation count either has to be higher than the Tier3InvocationThreshold, or the invocation counts must be higher than the minimum invocation threshold, and the sum of the invocation count and backedge count is higher than the Tier3CompileThreshold. By default, the Tier2InvocationThreshold is set to 200, and the Tier3CompileThreshold is set to 2,000. For a method running in level 3 to be selected for level 4, which is the C2 compiler, the same condition has to be met, but with higher thresholds. Here, the Tier4InvocationThreshold is set by default to 5,000, and the Tier4CompileThreshold is set to 15,000.

## Lowering Thresholds
java -XX:Tier4InvocationThreshold=4000 -XX:Tier4CompileThreshold=10000

java -XX:-TieredCompilation -XX:CompileThreshold=1000

Advantages of lowering Thresholds

Methods gets compiled earlier
"Warm" Method gets compiled

Code Cache

The code cache is an area of native memory where compiled code is stored for future execution. If the code cache gets filled up, normal compilations will be done, and hot methods may continue to run in the slower interpreted or C1 compiled mode. This should really happen though, as Java 8 typically reserves 50 MB for the code cache when tiered compilation is off and 250 MB if tiered compilation is on. This should be more than enough for most applications. However, if a Java application does run out of code cache space, the JVM will print out a warning message that you can go back into your logs or command prompt to examine.

Increasing Code Cache

java -XX:ReservedCodeCacheSize=<N>

In Java 9 and above, the code cache has been segmented into 3 areas

an area for JVM internal non-method code - NonNMethodCodeHeapSize
an area for code that's being profiled - ProfiledCodeHeapSize
an area for fully optimized non-profiled code - NonProfiledCodeHeapSize

This is helpful as we can reduce the ProfiledCodeHeapSize as they won't be there for long and we don't need to touch NonNMethodCodeHeapSize. Note: In Java 9 and above, if you use the old ReservedCodeCacheSize flag to manage the size of the code cache instead of using one of these newer flags, the JVM will turn off the segmented code cache feature.

GC

Garbage collection is the mechanism by which the JVM reclaims memory on behalf of the application when it's no longer needed. At a high level, it consists of

finding objects that are no longer in use
freeing the memory associated with these objects
occasionally compacting the heap to prevent memory fragmentation.

Stop the world Pause

Garbage collection is the mechanism by which the JVM reclaims memory on behalf of the application when it's no longer needed. At a high level, it consists of finding objects that are no longer in use, freeing the memory associated with these objects, and occasionally compacting the heap to prevent memory fragmentation.

We have to avoid them as they has direct adverse effect on JVM Performance *

There are four main garbage collectors available in current JVMs.

They are the serial collector
the parallel collector, which is also known as the throughput collector
the Concurrent Mark Sweep collector, or CMS collector
the garbage first garbage collector, or G1GC.

Generational garbage collectors divide the heap into two areas, the young generation area and the old generation, or tenured area. The young generation itself is split into two logical areas, eden space, also known as the allocation space, and the survivor space. Finally, the survivor space is also split up into two, survivor space 0, S0, and survivor space 1, S1. Objects are created in the eden space of the young generation area, and when eden fills up, a minor garbage collection takes place. GC Memory Generations

Minor GC

Minor GCs are optimised with the assumption that objects have a high mortality rate and are typically very fast. During the minor GC, objects are first checked to see if they're still reachable or not and then marked accordingly. Then reachable objects in eden have their age incremented and are then copied to a designated survivor space. Objects in the other survivor space also have their age incremented, and if they reach a certain threshold known as the tenured age, they get promoted to the tenured generation. But if they're under the tenured age, then they also get copied to the currently designated survivor space. Then both eden and the other survivor space are cleared, freeing the memory and compacting the space at the same time. The process is the same on the next minor GC run, but the survivor spaces switch roles, and the empty survivor space becomes the designated survivor space. Referenced objects are copied to this survivor space or tenured to the old generation, while eden and the other survivor space get cleared after the GC. And it goes back and forth from there.

Major GC

Eventually, the old generation also gets filled up and must be garbage collected. This is called a major, or full, garbage collection, and it usually takes up more time because the search space and number of objects is higher. This is where GC algorithms have their biggest differences. Basic collections will stop all application threads, mark the unreachable objects, free their memory, compact the heap, and then resume the application threads. Whereas more advanced collectors are able to scan for unreachable objects, even when the application threads are still running, and only pause all application threads to free the memory and compact the heap. These collectors are known as concurrent collectors, mostly concurrent collectors, or low-pause collectors e.g. G1GC and CMS.

How to choose GC

Factors

size of the heap
amount of live data your application uses
number and speed of available processors
whether or not your application is an online interactive application with low pause requirements

Serial Collector

It uses only one thread to process the heap, both for minor and major collections, and it'll stop all application threads while doing so. To enable the serial collector, use the +UseSerialGC flag. If your application will run on systems with only one processor or virtual processor and there are no pause time requirements. If you're running a lot of small JVMs on a single machine, then you should use the serial collector. Another use case for the serial collector is if you have a small live set of data, up to appropriate 100 MB. (In this case, using multiple threads to process the heap may not produce much of an advantage, especially considering the interthread communication overhead involved.)

Parallel collector

For machines with multiple cores or 64-bit JVMs, the parallel collector is the default collector if you're running Java 7 or 8. The parallel collector uses multiple threads for minor and major collections and fully stops all application threads for both collections. This means that use of the parallel collector can result in long pause times of sometimes over a second, but overall, application throughput should be higher with the parallel collector than with a low-pause collector. Therefore, if you're running a non-interactive batch application where overall application performance is more important than low pause times, then you should go with this collector. If you're running Java 9 or later, then to enable to the parallel collector, you would set the +UseParallelGC flag. For Java 7 or 8, you shouldn't need this flag, as it's the default.

CMS collector

The CMS collector was the first concurrent collector to be introduced to the HotSpot JVM. As a concurrent collector, it's able to trace reachable objects and clean up unreachable ones all while live application threads are still running, allowing it achieve short GC pauses. However, since the GC threads run concurrently with application threads, the GC threads may compete with application threads for CPU resources. Therefore, a CPU-bound application may see a reduction in application throughput when GC threads are running. To use the CMS collector, pass the +UseConcMarkSweepGC flag to the JVM. However, you should note that G1GC, which is the newer JVM concurrent collector, has been pegged as the replacement for the CMS collector, and since Java 9, the CMS collector has been marked as deprecated.

G1GC - default collector in Java 9 and above

It's designed for multi-processor machines with large heaps, and it tries to achieve the best balance between latency and throughput. We can specify a goal for the maximum pause time you would want your application to encounter, and the collector will try to meet the requirement by trying to reclaim as much space as it can within the given constraints. G1GC threads mark unreachable objects concurrently while the application threads are running. Good for Interactive applications. To enable G1GC in Java 7 or 8, set the +UseG1GC flag.

Shenandoah Collector

Shenandoah is able to do both the marking and moving concurrently without stopping the application threads, bringing garbage collection closer to being a fully concurrent process. Shenandoah is able to achieve pause times that are on average 10 times smaller than G1GC pause times while only decreasing throughput by 10%.

How to Measure GC Performance

GC Logging

when collections were run
how long they lasted
what the pause times were
how much memory was reclaimed
how many objects were promoted to the old generation

## Java 7 and 8
java -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<file-path> -XX:+UseGCLogFileRotation -XX:NumberofGCLogFiles=10 -XX:GCLogFileSize=10M
## Java 9 and above
java -X:log:gc:file=<file-path>:fileCount=10;fileSize=10M
## To get addition info add *
java -X:log:gc*:file=<file-path>:fileCount=10;fileSize=10M
## Dynamically start GC Logging
jcmd <pid> VM.log what=gc output=<file-path>

PrintGCDateStamps to print actual time instead of relative time if file path is not given the log will be printed to stdout.

After Java 9 JEP:Unified JVM Logging was introduced to have same type of log output. Before Java 9 it depends on the type of Garbage Collector.

How to analyze GC Logs

Text editor
Specialised tools - GCeasy.io or GCViewer

GC Tuning

Heap Size

Too Small? - Too many GCs and application throughput will suffer. Too Large? - GC will take long time and response time will suffer. G1GC collector might help here.

The size of the heap is controlled by two flags, an initial value, specified with the ms flag, and a maximum value, specified with the mx flag. ADAPTIVE SIZING If the JVM is experiencing memory pressure and observes that it's doing too much GC, it'll continually increase the heap until the memory pressure dissipates or until the heap hits its maximum size. We can turn it off as well (To avoid this calculation if we are sure of size) -XX:-UseAdaptiveSizePolicy=false -Xms6g -Xmx6g (Same max and initial Heap Size) -XX:NewSize=1g -XX:MaxNewSize=1g (Same max and initial Size of New Generation)

Heap should be 30% Occupied after a full GC.

-XX:MaxGCPauseMillis=350

Sets target for Maximum GC Pause Time

To achieve this

Parallel Collector will

Adjust size of the young and the old generation
Adjust size of the heap

in G1GC, the MaxGCPauseMillis flag goes beyond adjusting the heap sizes

to include starting background processing sooner,
changing the tenuring threshold,
and adjusting the number of old regions processed.

GC Failure

G1GC's a concurrent collector. This means that some of the phases of the garbage collection process can be running concurrently with application threads. This can lead to a situation where the old generation runs out of memory while the collector's in the middle of the garbage collection process since the application is still running and is producing garbage faster than it can be cleaned. This situation is known as either a concurrent failure mode, a promotion failure, or an evacuation failure, depending on when the failure occurs.

If you see a lot of these failure in the GC logs, the solution is either

increase the size of the heap
start G1 background processing earlier To perform G1 background activities more frequently -> reduce the threshold at which a G1 cycle is triggered. This is achieved by reducing the value of the InitiatingHeapOccupancyPercent flag. This flag is set to 45 by default. This means that a GC cycle is triggered when the heap becomes 45% filled. Reducing this value means that a GC would get triggered earlier and more often. But care should be taken that the value is not set to a number that's too low, which would result in GCs happening too frequently.
speed up GC processing by using more background threads. To increase the number of background threads -> use the ConcGCThreads flag. The default value for this flag is set to the value of parallel GC threads plus 2 divided by 4. As long as you have sufficient CPU available on the machine, you can increase this value without incurring any performance penalties.

If tuning the size of the heap and tuning the collector doesn't work for you, then you can try another collector. And if you still aren't getting good results, then you need to look at tuning the application code itself.

Algorithm Efficiency

When the size of the input is small, the performance difference between a O(log(n)), O(n), and even a O(n log(n)) algorithm is not so big. At those small sizes, if the O(n) algorithm is simpler to implement, it may make more sense to use the simpler algorithm instead of the more complex algorithm. Some of the Java API implementations like Arrays.sort or HashMap nodes take this approach. If the number of items is less than a particular threshold, for example, 8 items, then a simpler algorithm with a higher time complexity is used. But if the number of items is bigger than the threshold, then a more complex algorithm with a lower time complexity is used.

Saving Memory

Benefits of Saving Memory

Reduce GC Activity
More of the application's instructions and data are able to fit in RAM and in the CPU caches, allowing it to run faster.
The less data a CPU has to move around in memory, the faster it can perform.
Having a smaller memory footprint allows you to make better use of your hardware and can result in monetary savings.

In Java, we have primitive types and object types. All eight primitive types have set sizes.

Size of Objects is expressed in 3 ways

shallow size The shallow size of an object is the size of the object itself. If the object has references to other objects, the size of the other objects are not included in the calculation, just the size of the references.
the deep size Whereas the deep size of the object includes both the size of the object and the size of all referenced objects.
retained size The retained size is the same as the deep size with the notable exception that if a referenced object is shared with other objects, then its size is excluded from the calculation.

To get the shallow size of an object, you would need to add up the size of the object header with the size of the fields. In a 64-bit HotSpot JVM, the object header is 12 bytes, and then each field is sized according to it type. If the field is a reference to another object, then the reference is either 4 bytes if the heap is 32 GB or lower, or 8 bytes is the heap is greater than 32 GB.

Java types are aligned into 8-byte segments in memory; therefore both ClassA and ClassB on your screen will take up 16 bytes of memory. Even though ClassA only has an object header of 12 bytes, these 12 bytes will require two 8-byte segments.

The remaining unused 4 bytes from the second segment will be filled up with padding. In ClassB, the remaining 4 bytes are used up by the integer. If we add a second field to ClassB, let's say a string, its shallow size goes up to 24 bytes, due to the 4-byte object reference plus the padding to make up the rest of the 8-byte segment.

Reducing Object Size

reduce object size is to get rid of fields you're not using
correctly sizing the data type to the range of values that it'll actually hold If you have a field that represents a person's age, then instead of using an integer to represent the age, you could use a short instead. If you have a variable that encodes the state of a process in which there are 12 possible states, you can also replace that integer with a byte value. if you need a date with only a year, month, and day, you can choose to go with a Java Date or LocalDate objects, which are 24 bytes and 32 bytes in size respectively, or you can simply use a long to represent the number of seconds or milliseconds. The primary benefit of the BigDecimal class is that it helps you solve rounding issues when dealing with floating-point numbers. However, BigDecimal objects are 32 bytes in shallow size alone.If your application does not require absolute precision on floating-point numbers, then you can use 4-byte floats or 8-byte doubles and then use the Math.round method when needed. Instead of representing them with Booleans, which take up a byte each, you can represent them with a BitSet, in which each Boolean takes up only a bit. You can then use the BitSet's get, set, clear, and flip methods to read and manipulate the bits. Prefer primitives over Objects When to use Objects over Primitives?

When collection class is used [Alternative Collection library like FastUtil or Trove]
When a variable can have a valid unset state [use sentinel value like -1 to indicate the filed is not set]

(Keep in mind that there are always tradeoffs when it comes to performance optimization. The tradeoffs could be between memory and CPU utilization or between performance and usability)

Avoid Creating Unnecessary Objects

When a method returns empty result ?

Either return null or
return Collections.emptyList The added benefit of using Collections.emptyList, emptyMap, or emptySet is that the returned Collections object is immutable, which prevents the calling class from modifying our static returned value.
Set Initial Capacity you should set the initial capacity of the list or map via the appropriate constructor. If instead you use the default constructor, the list or map will allocate enough contiguous memory for a backing array of 10, in the case of the list, and 16, in the case of the map. And since you only plan on using one of the slots, you end up wasting 36 or 60 bytes per singleton collection.

Interner

Let's say you're reading a million instances of a ClassA object into memory, either from a stream or database, and one of the fields of the object is a reference to a ClassB object, which also read from the stream or DB. The naïve method would be to instantiate a ClassB instance for each ClassA instance you create. But if there are only about 30 unique ClassB in your data source, then you've created thousands of duplicate ClassB objects when you only needed 30.

One of the ways to combat this duplication is to use an interner. Interning is a method of storing only one copy of a distinct object. Within interning, objects are compared using their equals method, and when an object is found to be equal to another, one of the objects is discarded, effectively deduplicating the object. Only to be used for an Immutable Class. Guava and Triava has Thread-safe implementations for this based on Concurrent HashMaps.

Lazy Initialisation

Lazy initialization is a tactic of delaying the creation of an expensive object until the first time it's needed.

This means that if the object is never needed, then you can avoid the creation of the object and not incur the cost that comes with it. Let's use a Session object as an example. It has two methods, one that uses a Connection object and one that doesn't. If the majority of the time the users of the application create a Session object they call the do Something method and only call the writeSomething method 10% of the time, then in 90% of the cases, you're creating an unused object and an expensive one at that. The solution is to lazily initialize the connection object.

Managing Strings

String duplication is a situation where multiple string references point to different instances of strings on the heap, even though the contents of those strings are the same. since strings are immutable, it would be preferable to have only one canonical version of the string object in the heap and just have multiple pointers to that one instance. This can be achieved by interning. To interns strings, we can use the String.intern method that's provided by the JDK.

String Pool

To implement this, the JDK maintains a string pool, which is actually a fixed-capacity HashMap with each bucket containing a list of strings with same hashcode. A fixed-capacity HashMap means that unlike regular HashMaps, the number of buckets doesn't grow as the HashMap gets filled. Once the number of buckets is set, it doesn't change.

This means that as more and more entries are added to the HashMap, the list of strings per bucket grows larger and larger. Therefore, if the number of buckets is set to a small size, then the LinkedList of buckets per string could potentially be very long, and data access performance could suffer. The larger the table, the better the performance of the String.intern method. To view the default size of the string table

java -XX:+PrintFlagsFinal -version | grep StringTableSize
## Increase the size by setting the StringTableSize option to a value of your choosing. 
java -XX:StringTableSize=1000013

** Java automatically interns and stores in the string pool all string literals and string valued constant expressions. **

As each bucket uses 4 bytes on most JVMs, for every increase of a thousand, you would be using up 4 KB, which may not be a bad price to pay for the performance increase.

In Java 8u20, the string deduplication feature was added to the G1 garbage collector. The string deduplication feature uses a deduplication thread to analyze all strings that have been moved between heap regions and that meet a certain object threshold. The deduplication thread uses the hashcode of the string, the equals method, and a deduplication hash table to see if a particular string needs to be deduplicated. If so, it then replaces the string reference with a reference to the stored version, which frees up the string to be garbage collected. As this process can be expensive, it only tends to run when there are available CPU cycles. To enable string deduplication, ensure that you're using the G1GC collector and then set the UseStringDeduplication option.

java +XX:+UseStringDeduplication
## To view results
-XX:+PrintStringDeduplicationStatistics 
-Xlog:stringdedup*=debug ##Java 9

Compact Strings [Java 9]

In previous versions of Java, strings stored characters in a backing array that uses 2 bytes for each character. This is to ensure that Java can support UTF-16 characters. However, there are many applications that only deal with Latin-1 characters and can be stored in 1-byte character arrays instead. This is what the Compact Strings feature does. It switches the internal representation of Java strings so that 1-byte arrays are used if the string only contains Latin-1 characters, and 2-byte arrays are used if it contains UTF-16 characters.

Other string feature introduced in Java 9 is the moving of the string pool for interned strings into the class data sharing archives. This means that multiple JVMs running on the same machine will now use the same string pool, which could result in some memory savings.

Avoid Keeping Objects Around for Longer Than Needed

Local Scope Optimization

It's important to know how an object will be used and move it to the lowest object scope possible.

Objects that are created and referenced in a method and that are not allowed to escape from the method have a lifespan that only lasts as long as the method is executing. Whereas objects that are referenced from instance variables will continue to exist as long as the instance exists. While objects that are referenced from static variables will exist for as long as the application is running.

Advantages of Local Scope Optimization

There's another benefit to moving objects into local scope, and that is that objects that are completely local to a method can be optimized by the JIT compiler.
Escape analysis - if all the necessary conditions are met, the compiler will deconstruct the objects into its fields and put the fields on the stack instead of the heap, which has significant performance benefits.

The other guidance concerning not keeping objects around for longer than needed comes in situations where you add items to a long list collection like a map or a list.

Using weak or soft references

For every request to your application you add an item to the collection but don't remove the item when the request is done, then the item will live in the heap for longer than needed and will never get garbage collected, as long as there's a strong reference to it. This is actually a type of memory leak, and it can easily occur if you have a poorly implemented cache.

ThreadPoolExecutor Optimization

One of the key considerations when it comes to configuring a ThreadPoolExecutor is the size of the pool. This is important because if the pool size is too small, then you may not be taking full advantage of your hardware to achieve the best possible performance.

Setting the pool size to a value that's too high can have a negative important on performance. Reasons:

higher the number of threads in a system, the more time the CPU has to devote to scheduling threads instead of just running them.
switching from one thread to another involves saving the outgoing thread's context and restoring the incoming thread's context, which is an expensive process.
context switching leads to invalidating caches and data locality that was built up for the previously running thread.

two common approaches for determining the optimal pool size:

One approach is to follow a formulaic model based on the number of available CPU cores, the target CPU utilization, and the characteristics of the task that's being performed.

Number of threads = Number of Available Cores * Desired CPU Utilisation * (1 + Wait time / Service time)

This means that if you have a compute- intensive task, that is a task that's CPU bound and has virtually zero wait time and you'd like to optimize for 100% CPU utilization, you should set your thread pool size to the number of CPU cores.

For mixed or I/O-intensive workloads, you would need to determine the average wait-to-compute ratio. If the wait to compute ratio is 75 to 25 and you're targeting 100% CPU utilization, then the number of threads to be set to the number of cores multiplied by 4.

Some of the factors other than CPU that can affect thread pool sizing are memory, file or socket handles, and database connections.

experimental approach - where the system is put on the load and different thread pool size values are benchmarked until the optimal size is found. With the experimental approach, you would subject the application to the volume and types of traffic you expect to handle and then vary the thread pool sizes while monitoring system performance and thread pool characteristic such as the number of busy and idle threads and the length of the work queue. After enough iterations, the optimal thread pool size should emerge.

Recommendations

Regardless of which approach you decide to use for thread pool sizing, for important thread pools, it's still beneficial to monitor thread pool behavior. Also, due to the nature of thread pool sizing, a best practice for doing so is not to hardcode the thread sizes into your code, but instead to make it configurable using some configuration method like a config file or to dynamically calculate it based on the Runtime.availableProcessors method.
Another recommendation is that for a task that has both CPU- and I/O-bound components, if it can be cleanly and efficiently done, it might be beneficial to split the task and use different thread pools that can be individually sized and tweaked.
if you have multiple thread pools in your application, focus first on tuning your core or critical thread pools like your service or data layer thread pools before tuning auxiliary thread pools for functions like logging.

When creating a ThreadPoolExecutor instance in Java, you configure the executor with a corePoolSize, a maximumPoolSize, a keepAliveTime, and working queue instance. The corePoolSize is the number of worker threads that will kept in the pool for executing tasks. When tasks arrive in the pool, they're queued and then picked up by an idle worker thread. If all the threads are busy, then the task stays in the queue.

There are three different types of queues:

unbounded queues When an unbounded queue is used in an executor, the pool size never changes. As the queue grows, the latency of your application grows as tasks in the back of the queue must wait for all preceding tasks to be executed first. And if your clients are throwing requests faster than the server can handle, you will run the risk of exhausting the resources of your server.
bounded queues when the queue is full, the thread pool adds a new working for handling tasks up until a maximum number of worker threads, and afterwards, it starts rejecting requests. For many server applications, a best practice is to correctly size your pool, use a bounded queue with a limit that's relatively small but just large enough to handle bursts of requests, and to set the core and max number of workers to the same size or the core pool size to the bottom range of your optimal size calculation and the max pool size to the top range.
synchronous queues.

** Note: One last thing to mention is the choice of queue implementation. The typical first-in, first-out bounding queues like LinkedBlockingQueue and ArrayBlockingQueue cause tests to be started in the order in which they were received. If you have some tests that are more important than others, you can instead use a PriorityBlockingQueue, which will order tasks by priority. **

The ForkJoinPool

Introduced in Java 7, the ForkJoinPool is an executor service implementation that's designed for tasks that need to be split up into smaller chunks which can then be executed in parallel.

The ForkJoinPool uses the paradigm of recursion to accomplish the splitting and merging.

Fork Phase: If a task is larger than a certain threshold, it gets split over and over again until it's under the threshold, at which point, it actually gets executed by one of the threads in the thread pool. This is the fork phase.
Join Phase: Tasks that get split into subtasks are suspended until their subtasks are completed. When the subtasks are completed, the results are merged in the join phase to form the final result.

If the task size is above the threshold, we're creating two new instances of the recursive tasks and calling fork on each. This places the task in the queue, where it'll wait to be picked up by a free worker thread. Next, I have the join statements, which tell the task to wait if necessary for the results of the task. When the task is below the threshold, then it gets executed directly and the results get returns to the caller, which is either the waiting parent task or the client itself.

The advantage of the ForkJoinPool is that with its unique recursive model, it can handle a large amount of tasks with very few threads. The Java 8 parallel streams feature uses the ForkJoinPool under the hood to parallelize operations and collections.

Reducing Lock Contention

** Locks are not bad. Lock Contention is ** In Java, locks are used to protect shared state from data corruption when being accessed by multiple threads. When a thread obtains an exclusive lock, other threads trying to obtain the same lock must wait until the owning thread releases the lock.

The blocked thread can either wait for the lock to be released by

spin-waiting, that is repeatedly trying to acquire the lock until it succeeds. If the wait time is short, spin-waiting might be more efficient than getting suspended, as it avoids the cost of context switching.
OS Suspension - With longer wait times, suspension by OS is preferable.

Disadvantages

Low scalabiliity
Context Switches which hurt performance

Locks

Fat Lock
Thin lock

Reduce the Lock Contention

Reduce the duration for which lock is held Reducing the duration for which a lock is held is simply a matter of reducing the size of the critical block to the point where only the shared data access is protected. Techniques for reducing the demand for a lock:

Lock splitting takes a single lock that's guarding different independent shared state variables and splits it into multiple locks for each variable. This diffuses the demand on the lock and results in each lock getting requested less frequently.
use of the ReadWriteLock If a collection on a data structure is written to infrequently, for example, when it's initialized and when there are updates, but reads happen at a high frequency, then a ReadWriteLock is ideal for reducing lock contention. Instead of having a single lock to protect some shared state, the ReadWriteLock gives you two locks, the read lock and the write lock. Multiple reader threads can hold the read lock and run simultaneously as long as there's no writer thread holding the write lock. As soon as a writer thread obtains the write lock, the reader threads are blocked. With the write lock, only one writer thread can hold the write lock at a time.
Lock striping Lock striping, like lock splitting, reduces the granularity at which a lock is applied. Lock striping is applicable when we have a variable size collection of independent objects. you partition your data into groups then have a lock for each group instead of a lock for the whole object. This is the approach used by some versions of concurrent HashMap to allow concurrent writes to a single map. As long as each write is updating a bucket guarded by a different lock, then there's no lock contention and they can operation concurrently.

reducing the frequency at which the lock is requested
replacing locks with other mechanisms that allow for greater concurrency.

Atomic Variables and Concurrent Collections

If you only need to synchronize access to a single variable, then instead of using a lock, you should use an atomic variable instead. Java provides the AtomicBoolean, AtomicInteger and AtomicIntegerArray, AtomicLong and AtomicLongArray, and the AtomicReference and AtomicReferenceArray classes.

Objects of these classes allow you to set their values atomically, and in the case of AtomicInteger and AtomicLong, perform simple arithmetic operations such as incrementAndGet, decrementAndGet, and getAndAdd that can be done atomically.
The significance of these operations being atomic is that because they can happen in one step, write interference, which leads to data corruption, cannot occur.
Therefore, when atomic variables are used to synchronize access to shared state, locking and the cost associated with threads being blocked are eliminated. Atomic variables generally outperform lock- based synchronization, and often by a lot.
The implementation of atomic variables relies on hardware support. Atomic variables used the atomic compare- and-swap instruction that's been provided by most modern CPUs since 2013. If the compare-and-swap instruction is not available, then the operating system may fall back to a lightweight locking mechanism like spin locking.

** it's advisable to replace synchronized access to a single variable with the atomic variance of the variable. **

Even Faster Atomic Operations

The LongAdder is internally composed of cells that each contain a variant of AtomicLong. If a thread tries to update the value of a LongAdder while another thread is currently executing the compareAndSet operation, instead of waiting the thread selects another cell to add its value to. The effect of this is that under high contention, writes are spread across various cells, which reduces the contention and increases throughput. To get the total value of the LongAdder, the values of all the cells are collected and added up.

All adder and accumulator implementations in Java are inheriting from an interesting base-class called Striped64. Instead of using just one value to maintain the current state, this class uses an array of states to distribute the contention to different memory locations. We expect dynamic striping to improve the overall performance. However, the way the JVM allocates these states may have a counterproductive effect.

To be more specific, the JVM may allocate those states near each other in the heap. This means that a few states can reside in the same CPU cache line. Therefore, updating one memory location may cause a cache miss to its nearby states. This phenomenon, known as false sharing, will hurt the performance.

To prevent false sharing. the Striped64 implementation adds enough padding around each state to make sure that each state resides in its own cache line:

Concurrent HashMaps

The ConcurrentHashMap is a thread-safe version of the HashMap data structure that supports high concurrent updates and fully concurrent retrievals. Earlier implementations of the ConcurrentHashMap uses lock striping to enable high concurrency. As a result, they provided much better performance and scalability than a Collections.synchronizedMap or a hashtable, which both used a single lock to guard the entire object. In the current OpenJDK implementation of ConcurrentHashMap, these locks have been dropped in favor of compare-and-swap operations and volatile puts and gets with locking limited to just the node level. As a result, current implementations have even greater performance and scalability characteristics. ConcurrentHashMaps are a drop-in replacement for synchronized maps.

CopyOnWriteArrayList

The CopyOnWriteArrayList ensures thread safety by creating a copy of the unaligned array any time a mutative operation like add, set, or remove takes place. This guarantees that for threads that are currently iterating through the ArrayList, the array object they're iterating won't change out under them. The alternative to this would be to use a synchronized list, which places the lock on the list in order to ensure stable iteration, but with the CopyOnWriteArrayList, no locking is needed. When a new reader thread comes in, it gets an iterator that points to the current state of the underlying array. Any subsequent additions, removals, or updates will not be visible to it. The result of this model is that the CopyOnWriteArrayList is highly inefficient if there are lots updates to the list or when the list is very large. But if there are very few updates and list traversal operations vastly outnumber update operations, the CopyOnWriteArrayList can be more efficient than a synchronized list.

ConcurrentLinkedQueue

The ConcurrentLinkedQueue uses a nonblocking compare-and-swap operation for adding and removing items from the collection. It also implements the iterable interface so that you can loop over its contents with the enhanced for loop.

Avoiding Synchronization

Volatile

Applying the volatile keyword to a field lets the compiler in the Java Runtime know that this variable is shared.
Compiler: this means that it should not reorder operations that surround this variable.
Runtime: this means that volatile variables are not cached in registers or processor caches. Instead, they're written directly to and read from memory.
When a thread writes to a volatile variable, all variables visible to that thread are also flushed from cache to main memory. The effect of all this is that when you have a volatile variable, updates from one thread are predictably propagated to other threads.

Usage:

The most common usage of volatile variables is as a completion, interruption, or status flag.
However, the use of volatile can extend to any situation where updates are being done by only one thread and you have one or more reader threads.
You can also use volatiles if you have multiple writer threads but writes to the variable do not depend on its current value, which means that the writer threads are just publishing their results.

Immutable Objects

Immutable objects are objects which, once constructed, their state cannot be changed. That means that data corruption cannot happen, hence synchronization is not needed. Therefore, the more you use immutable classes when writing concurrent code, the less synchronization and locking that's needed.

No Sharing

The last workaround for synchronization that I'll discuss is not sharing at all. Instead of sharing an object amongst threads, you can have each thread have their own copy using the ThreadLocal class.

Concurrent Collections

However, there are instances where you've completely parallelized your workload, each worker has its own resources, and nothing is shared. But at the end of the task, you need to gather all the results back together to pass back to the user or to pass on to the next stage of the process. Aside from concurrent collections, one of the ways you can do this is to use a ConcurrentQueue implementation.

Blocking and non-blocking queues

There are two variations of these, a set of queues that are non-blocking and set that are. The utility of the blocking versions is when you want the consuming thread to wait for a period of time for a message to arrive in the queue or if you want to control the growth rate of the queue.

Avoid Doing Expensive Things

I should also remind you that no system is perfect, and achieving a perfectly optimized system, while noble, may not be the right goal to have. There are two reasons for this. One is that quite often, tradeoffs need to be made. These tradeoffs are often between CPU cycles and memory. But you could also be trading off between design features, such as having a more modular design versus a more tightly coupled design with better data locality, or having a simpler design versus a more complicated one that's more efficient but less readable. The other reason to aim for less than perfect is the law of diminishing returns. This is not true for all situations, but in many situations, the first few optimizations provide the most benefit, and then additional optimization efforts on top of that only yield marginal performance improvements. These two factors highlight the importance of having a performance goal and then measuring application performance to see how you're doing in relation to your goal. In the introductory module, we talked about the common application performance metrics, throughput, latency, and elapsed time, and how to measure them either during performance testing or production monitoring. Your performance goals should be expressed in terms of these metrics. For example, you could set a goal of having an average latency of 50 ms for database queries, or you could set a target for the number of concurrent users that you want your web application to be able to support. Some of the guidelines for performance testing we discussed are, you should have a warmup period to allow the system to come to a steady state, you should ensure that the test setup and the traffic being sent is representative of how the code will be used in production, and you should not only measure application performance, but also system metrics. Measuring system performance can help you find hardware bottlenecks and inform you on how to better optimize your application or system based on the hardware characteristics. Some of the tools that come into play in the process of performance tuning are load testing tools like JMeter, Gatling, and several commercial options, application instrumentation libraries like Dropwizard metrics, Micrometer metrics, Spectator, and Prometheus, OS tools for system monitoring like typeperf from Windows and vmstat, iostat, and netstat on Linux, and JVM monitoring tools like Java Mission Control, JConsole, and the jcmd command line tool. To get detailed information about the execution of a Java application, you should use a Java profiler. Java profilers allow you to do CPU profiling, which tells you which methods are taking up the most CPU time and should be a target for optimization. You also have memory profiling capabilities, which allow you to view the memory usage for your class objects over a period of time, find out which objects are growing and shrinking in size, where in your code memory allocations are taking place, and how much memory is being freed by garbage collection. The third capability common to profilers is thread profiling. This allows you to view the states of your application threads and find bottlenecks in cases of lock contention. This capability comes in super handy when you need to perform concurrency optimization, as it tells you what locks, threads, or thread pools to target for optimization. In addition to the three common capabilities, certain Java profiling products have other capabilities like automated analysis, SQL profiling, I/O profiling, exception analysis, and more. In this course, we used the Java Flight Recorder because of its low latency, tight integration, and license structure. But there are many other open-source and commercial profilers out there like JProfiler and the YourKit Java Profiler. Doing some form of profiling and performance analysis is an important step in performance tuning, as it helps in determining which parts of your application are used most often and hence will provide the most benefit when optimized. Optimizing a part of your code that's not frequency used or that doesn't eat up much CPU time or memory will result in an almost negligible contribution to overall application performance. After identifying your performance targets, setting up monitoring and metrics, and performing some performance analysis to identify bottlenecks and tuning targets, then comes optimization. There are many levels at which optimization can be done. The two we looked at in this course were JVM-level and code-level optimization. On the JVM level, we have the just-in-time compiler. Just-in-time compilation is a technique used by the JVM to speed up application execution by profiling application behavior, dynamically identifying hot methods and performance tuning opportunities, and then compiling the hot methods into optimized machine code, which it caches for future use. The most important tuning activities as it relates to JIT compilation are choosing a compilation mode, choosing the compilation threshold, and tuning the size of the code cache. The other JVM-level tuning activity we discussed is GC tuning. When it comes to GC tuning, depending on the type of application, we're often trying to find a balance between minimizing GC pauses and minimizing background GC activity. As such, the tuning knobs we have to play with are the size of the heap regions, the number of background threads, and the threshold at which GC activities are triggered. On the code optimization level, performance tuning is mainly focused around understanding the cost of data structures and algorithm and figuring out ways to pay less for the same functionality. The standard way of measuring this cost is the big O notation. This allows us to abstractly define the performance characteristics of algorithms or operations on data structures. This way, if you have A cheaper alternative, we can use that instead. The usage patterns in your application also play an important role in data structure selection. For example, having more frequent writes than reads or vice-versa, requiring order traversals, or the frequency of insertions into the middle of the list can all affect your choice of data structure. In addition to choosing the right data structure, cost reduction is also the goal of applying techniques to minimize the footprint of Java applications while maintaining functionality or applying techniques for reducing lock contention or thread synchronization while maintaining concurrent execution. In all three cases, minimizing the cost of data and operations improves the performance or scalability of the system.

Caching

Typically, object creation in Java isn't expensive. However, there are certain objects that are expensive to create, either because the creation of the object requires a good amount of computation, the object performs I/O on creation, or the object uses system resources like threads, shared memory, or file handles. Creating these types of objects frequency, for example, for every method call or every loop iteration, will have an impact on performance. So to avoid repeatedly incurring the cost of expensive operations, we can turn to caching. Caching is a broad term that describes the mechanism by which is stored so that future requests for the data can be served faster. The simplest form of caching is object reuse. For objects that are expensive to create, you can ensure that the object is created just once, either during class initialization or on demand, and then reused every time it's needed afterwards. This strategy especially works for stateless, thread-safe objects. An example of such an object is the Pattern object from the java.util .regex package. In order to get a instance of the Pattern object that can be used for matching, you must compile the RegEx pattern. This is a relatively expensive process that you don't want to do hundreds of times per second, particularly if you're always using the same RegEx string. So the solution is to have a compiled Pattern object that's cached and then reuse it for all the matching operations. You should also note that there are a number of string methods that use a pattern object under the hood. These are String.split, matches, replace, replaceFirst, and replaceAll. The string.split method provides a fast path, and only compiles a Pattern object is the RegEx string is more than a single character and the character is not a RegEx metacharacter, whereas the other methods always compile and use a new Pattern object on every invocation. So if you have a hot method where these string methods are used extensively, you may be wasting a lot of CPU cycles. To mitigate this, in the case of the String.matches method, you can directly use a pattern object that you compiled once and is reusable. And for the case of the split and replace methods, you can instead use the StringUtils class from the Apache Commons library. Apache Common StringUtils has split and replace methods which are implemented without using a Pattern object under the hood. Another form of caching is object pooling. For non-thread safe, stateful objects that are expensive to create, we can use an object pool. With an object pool, the application creates a number of the expensive object instances ahead of time, and then leases them out when needed. When the requestor is done using the object, the object is then returned to the pool so that it can used for another request. This way, instead of each requestor creating and destroying a new object instance, the requestors can share a pool of pre-created objects. The most common example of such objects are database and socket connection objects, but there are other expensive objects that can be pooled. It's important to keep in mind though that only objects that are expensive to create should be pooled. If the object's creation uses OS resources like threads and shared memory or performs I/O during creation, then it's likely a good object for pooling. Otherwise, it may be just as performant or even more performant to create new objects when needed and to let them get garbage collected when they're not needed anymore. To determine whether your application benefits from object pooling or not, the simplest thing to do is to test an instance of your application with pooling and without pooling and note the performance difference. Many libraries that create expensive objects may already implement object pooling for you, but if you need to implement object pooling yourself, instead of writing your own implementation, I'd suggest you use the Apache Commons Pool library, as it provides several object pool implementations that are fully featured and optimized for high performance and scalability. The third form of caching we'll discuss is results caching. This technique involves storing the results of an operation, along with the request that generated that result so that the next time an identical request is received, the application can simply retrieve the stored result instead of recalculating it. The most common way of implementing this is a Cache-Aside technique. For the Cache-Aside technique, the cache is checked first, and if a cache hit occurs, meaning the result is found, the result is returned to the caller. But if there was a cache miss, we do the computation and put the result and the request into the cache before returning the result to the caller. The cache hit ratio is the ratio of cache hits to total cache requests. If the cache hit ratio is high enough, then caching can provide a significant performance boost. However, very low cache hit rates suggest that the majority of the time, creating the cache and writing results back to the cache is a wasted effort since it yields no results and we still end up having to perform the computation. In addition to the wasted time querying and saving the data, there's also cache memory space that's wasted when we have low cache hit rates. Therefore, caching is only an effective performance optimization if we have hot data or if the data has low variance. But it could be actually detrimental if the data's evenly distributed or has very high variance. To implement in-memory caching, you should use a cache implementation from the Guava, triava, or Apache Commons libraries. These implementations offer features like configurable eviction policies, time-based expiration, and element event handling, which allows you to produce efficient caching behavior based on your application usage and data patterns.

Architecture Level Performance Optimizations

So far, we've covered performance optimization on the system level, on the JVM level, and on the application level. But that's not the whole story when it comes to building performant applications. There are a number of decisions that are made on an architectural level that guide how an application is built and greatly influence the performance and scalability characteristics of the application. And while coverage of these architectural- level decisions is mostly out of the scope of this course, it's something to be mindful of. One of these architectural decisions is how to scale your application. There are two choices available here, scaling up or scaling out. The choice of scaling up versus scaling out is typically driven by economics or organizational culture. However, choosing one or the other affects how the system gets architected. These days, most companies choose to scale out. This means that the workload is divided up amongst all the different server machines, and when the workload increases, you can simply add more machines, or if the workload decreases, you can remove machines. There are two things to consider in this scenario. One is that it's still important to correctly size each machine based on the application workload. For example, if the application is compute heavy, but even after tuning will not address more than 500 MB of RAM per core, using machine instances that have 8 cores and 16 GB would be a waste of resources. The other thing to keep in mind is that if the application reads or writes from a database, which most useful applications do, database access can become a bottleneck. And if that happens, then it'll be necessary to scale the database as well. In fact, database access being a bottleneck is so prevalent that many performance analysts start there when looking for ways to improve application performance. On the client side, it's possible to take advantage of strategies like connection pooling, which allows you to reuse connections, statement caching, which allows you to use precompiled queries, and batching, which allows you to send multiple queries in a single database roundtrip. Another current architectural trend is the breaking up of monoliths into microservices. Although microservices have tremendous advantages in enabling applications to scale more efficiently and in allowing organizations to scale their development teams, this comes at a price in terms of the performance overhead of interservice communication. There are many ways to lessen the performance overhead, both on the hardware and on the software side. On the software side, one option is to use a binary format for interservice Communication like protobuf, Thrift, Avro, or MessagePack instead of a text-based format like JSON or XML. These formats result in messages that are significantly smaller, hence, faster to transmit and using less bandwidth. Also, serializing and deserializing them into Java objects is usually much faster than serializing and deserializing text- based formats. The other tip is to use the circuit breaker pattern in your microservices in order to limit cascading failures. The circuit breaker pattern avoids cascading failures by monitoring a service call for failures, and when a failure is detected, shutting off traffic to the service and switching to some failover mechanism. This prevents the struggling service from getting further overwhelmed and causing the calling service itself to become unresponsive, at which point, the issues can cascade down the line. The last tip I'm going to give is to switch to asynchronous communicate between services when possible. The benefit of this is that a thread from the calling service doesn't have to wait on a response before it can continue working. It can just queue up the request and continue doing other things. Then once the response is ready, it can get notified and react to the notification. So as you can see, architectural factors can also play a huge role in application performance, sometimes more so than JVM-level or application-level optimizations.