TL;DR:
A minimal webserver achieved less than 5k concurrent requests / second, much below expectations. The reason: the logging configuration was invisibly consuming cycles.
Context
Last night I stumbled upon this post which outlined a performance issue related to JVM webservers. The post you're reading contributes nothing, it just serves to document my thoughts.
Reproducing
I cloned the repo and reproduced the results.
- I found running the command twice gave different results. So I let the first run warmup the server, then on the next run I took the measurement.
- The initial run of
-t 12 -c 200
showed no difference in requests / second before/after the fix. That happened only after I bumped the concurrency up to 2000. - The configuration with higher throughput had lower p50 latency,
higher p75 latency, and a lower p99 latency. I'd like to better understand this relationship.
Before the fix:
❯ wrk --latency -t 12 -c 2000 -d30s http://localhost:3003
Running 30s test @ http://localhost:3003
12 threads and 2000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 94.94ms 196.02ms 1.21s 92.88%
Req/Sec 748.27 455.32 2.02k 64.31%
Latency Distribution
50% 42.66ms
75% 47.92ms
90% 64.82ms
99% 1.06s
170610 requests in 30.09s, 26.52MB read
Socket errors: connect 0, read 300559, write 0, timeout 0
Requests/sec: 5669.51
Transfer/sec: 0.88MB
After the fix
❯ wrk --latency -t 12 -c 2000 -d30s http://localhost:3003
Running 30s test @ http://localhost:3003
12 threads and 2000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 99.53ms 124.28ms 1.05s 79.77%
Req/Sec 2.95k 1.08k 9.13k 75.04%
Latency Distribution
50% 31.52ms
75% 190.35ms
90% 292.67ms
99% 409.98ms
992290 requests in 30.09s, 154.25MB read
Socket errors: connect 0, read 66686, write 252, timeout 0
Requests/sec: 32972.35
Transfer/sec: 5.13MB
Future Questions/Reading
- Are there any other java logging pitfalls?
- I should gain a better understanding of the logback configuration.
- Why is JMX referenced in that post? [This article] baeldung-flight-recorder implies it's closely related to flight recorder.
- How is
wrk
implemented? - Tested limits are per-codebase and per-machine. In order to reproduce the issue I had to search for the issue. How should I keep this in mind when using load-testing tools? Is there an identifiable sweet spot?
- How useful is local profiling? My local machine will never exactly reproduce prod.
- Learn how to use
wrk
Lua scripting to implement fancy load-testing - Can I enable profiling/clustered on production servers?
- How can I get an intuition for server speed?