hpincket - Webserver Load Testing: A first exposure

TL;DR:

A minimal webserver achieved less than 5k concurrent requests / second, much below expectations. The reason: the logging configuration was invisibly consuming cycles.

Context

Last night I stumbled upon this post which outlined a performance issue related to JVM webservers. The post you're reading contributes nothing, it just serves to document my thoughts.

Reproducing

I cloned the repo and reproduced the results.

I found running the command twice gave different results. So I let the first run warmup the server, then on the next run I took the measurement.
The initial run of -t 12 -c 200 showed no difference in requests / second before/after the fix. That happened only after I bumped the concurrency up to 2000.
The configuration with higher throughput had lower p50 latency,
higher p75 latency, and a lower p99 latency. I'd like to better understand this relationship.

Before the fix:

❯ wrk --latency -t 12 -c 2000 -d30s http://localhost:3003
Running 30s test @ http://localhost:3003
  12 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    94.94ms  196.02ms   1.21s    92.88%
    Req/Sec   748.27    455.32     2.02k    64.31%
  Latency Distribution
     50%   42.66ms
     75%   47.92ms
     90%   64.82ms
     99%    1.06s
  170610 requests in 30.09s, 26.52MB read
  Socket errors: connect 0, read 300559, write 0, timeout 0
Requests/sec:   5669.51
Transfer/sec:      0.88MB

After the fix

❯ wrk --latency -t 12 -c 2000 -d30s http://localhost:3003
Running 30s test @ http://localhost:3003
  12 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    99.53ms  124.28ms   1.05s    79.77%
    Req/Sec     2.95k     1.08k    9.13k    75.04%
  Latency Distribution
     50%   31.52ms
     75%  190.35ms
     90%  292.67ms
     99%  409.98ms
  992290 requests in 30.09s, 154.25MB read
  Socket errors: connect 0, read 66686, write 252, timeout 0
Requests/sec:  32972.35
Transfer/sec:      5.13MB

Future Questions/Reading

Are there any other java logging pitfalls?
I should gain a better understanding of the logback configuration.
Why is JMX referenced in that post? [This article] baeldung-flight-recorder implies it's closely related to flight recorder.
How is wrk implemented?
Tested limits are per-codebase and per-machine. In order to reproduce the issue I had to search for the issue. How should I keep this in mind when using load-testing tools? Is there an identifiable sweet spot?
How useful is local profiling? My local machine will never exactly reproduce prod.
Learn how to use wrk Lua scripting to implement fancy load-testing
Can I enable profiling/clustered on production servers?
How can I get an intuition for server speed?