VDD25Q2 - varnishcache/varnish-cache GitHub Wiki
VDD 2025Q2 May 26/27 Oslo
-
Voted on topics, ran topics from most to least votes.
-
Did not take notes of people attending.
-
These notes are an edit of the meeting etherpad.
-
New acronym: VDDV for Varnish Developer Day Verdict
VIPC + VEXT
phk sent a patch to -dev a long time ago and did not receive feedback:
mgr can set up UDS to multiple worker processes. should be robust for restarts. need to register cli commands, maybe just json? should be possible to send file descriptor (socketpair?). master will know how to restart processes.
where/how should vexts hook in? If vext are to be called no earlier than before dropping privileges in the worker, we need to re-do argument parsing and socket maintenance in the worker and/or in coordination with mgr
worker needs to be able to reject argv (ie: dont restart it). move all acceptor code into worker ?
Side topic: platform support
solaris and 32bit support should probably phase out, should make an announcement, break 32bit on purpose.
We should support more libcs, in particular musl/alpine guillaume to set up a vtester on alpine
Should look at existing alpine failures: https://github.com/varnishcache/varnish-cache/issues?q=is%3Aissue%20alpine%20%20author%3Agquintard
What to do with {net,open}bsd? VDDV: put back the autoconf.
H2 rework
VS has current work to turn implementation upside-down, turn to async non-blocking for session handler, only session thread handles I/O, worker thread hand work items to session thread.
nils: sounds like a match with my own plans, would like to build upon it. Please get the PR out
TLS
Guillaume: "sick of people putting nginx in front"
UPLEX working on TLSentinel (keyserver) and Tritium (async IO with TLS)
side topic: vtest
VDDV: Merge as suggested, eep vtc_varnish, vtc_haproxy etc in vtest, try to keep it compatible (already done at the time of the publication of this wiki page)
dynamic backends
Went through concrete proposal by Alve, then agreed to focus on collecting the pain points and requirements
Pain points
- vcl scope, implies counters, probes, connection limits
Problems
-
backend.list is not particularly useful
-
new user experience is not good
-
backend discovery and load balancing are tied and should not be
- should be done first, because fundamental to modularity
- need to support custom attributes (key/value)
- version number & pull update (unlocked, some kind of cooldown)
-
counters are at the wrong place (and get lost when backends get re-created at fast pace), should be where the users need them
-
no cli configurable backends
-
no sharing of directors/backends across VCLs
- setting backend sick can be counter-acted by loading a new vcl
-
relation between backend and IP addresses might be outdated
- at which layer does the backend sit?
- happy eyeballs are non-trivial today
-
what does it mean to be std.healthy()?
- readiness vs. liveliness heath?
- make healthy a float instead of boolean?
- [0.0-1.0] as "estimated probability of the next request returning successful"
-
connection limit at director level
- ties into counters & how we call .resolve() instead of .connect()
THE FORBIDDEN SOLUTION SPACE: what we need
(This was discussed despite multiple attempts to focus on the problem space only)
-
ref backends across vcl (limited to export/import hierarchy)
- needs recursive vcl switch
- does this ref backends or vcl objects?
-
counters shared across backends
-
cross-director notification
-
maybe "only probe at idle"?
Disucssion points
-
global backends: which configuration wins? last vcl.use ?
-
multi-tenant: should be namespaced
-
phk idea: export/import backend
-
asad: separate configuration?
-
phk "top level vcl" using vcl switching to share vcl objects
Original suggestion by Alve
Backend groups can be created and used natively in VCL:
# group has set of counters, not the individual endpoints
# group is "pool of IP addresses"
# typically just one probe per group
# max_connection is for all of the group
backend_group native_group {
.host = "https://example.com";
}
sub vcl_recv {
set req.backend_hint = native_group.backend;
}
# Backend groups can also be created and used by VMODs:
import dns;
sub vcl_init {
new vmod_group = dns.backend_group("https://example.com");
}
sub vcl_recv {
set req.backend_hint = vmod_group.backend();
}
# Both of the groups defined above can be passed to other VMODs:
sub vcl_init {
new shard = directors.shard();
shard.add_group(native_group);
shard.add_group(vmod_group.group());
}
# A backend group can be declared without a .host attribute and passed to a VMOD:
import k8s;
backend_group template {
.first_byte_timeout = 10s;
.between_bytes_timeout = 5s;
.max_connections = 500;
}
sub vcl_init {
new k8s_group = k8s.group("https://$K8S_API", template);
}
# A backend group can also be controlled through the CLI:
backend_group cli_group {
.first_byte_timeout = 10s;
}
varnishadm backend_group.set_host cli_group https://example.com
varnishadm backend_group.set_max_connections cli_group 500
Alternative: (Pål)
vcl 4.2:
backend_group example_org {
# The internal DNS thread will keep the group updated
.type = "dns";
.host = "example.org";
.port = "80";
}
backend_group my_manual_dynamic {
# The CLI can be used to add and remove IP addresses to the group, which persists outside this VCL
# The group will be garbage collected when no VCLs holds a reference (by a declaration of the group)
# and there are no backends in the group.
.type = "cli";
}
backend_group my_abstract_group {
# A VMOD/VEXT needs to take control and be responsible for updating the group
}
sub vcl_init {
new exampleorg = directors.fallback();
exampleorg.add_group(example_org);
new d = directors.round_robin();
d.add_group(my_manual_dynamic);
# Maybe:
k8s.manage(my_abstract_group, ...
);
}
VAI
Intro to the PR, Q&A, disucussion, no additional notes taken
Backend errors reported in VCL
https://docs.varnish-software.com/varnish-enterprise/troubleshooting/#errors <- This is pretty cryptic.
https://github.com/varnishcache/varnish-cache/pull/4097/
See notes in the ticket
New VSL
-
The current VSL approach basically working, but feedback about shortcomings
- know if logs are complete
- react if they are not
-
Logging is costly
- multiplying processing overhead for separate log client instance
- VSL clients internally process log twice by extracting/filtering transactions first, then formatting them
-
If VSM ring gets overwritten, log clients usually crash and we do not know anything about lost data
-
1 ratio between log and traffic for many cases
-
would it help to segregate the vsm?
- similar technique is using std.log() from vcl, otherwise not sure
-
would it help to batch differently (submit entire client+backend transaction)?
- problem: backend transactions usually take much longer (bgfetch), want to see client transaction ASAP
- need to be able to give less knowledgeable users precise instructions on how to gather log data required for troubleshooting
-
should varnishd open multiple vsm files ("infinite" limited by disk space)
- if varnishd passes file descriptors to log clients, O/S will automatically close if clients go away
- does not address the performance issues
- not reusing files could be more robust
- backpressure could be implemented via free space monitoring or link count
- could use the link count to inform varnishd about the number of clients and avoid overwrite today (needs writable file system for clients)
- varnishd would publish names of files in shared memory
-
could have vcl control over vsl tags and/or records
proposal:
-
change shared memory: split in blocks, which no longer need to be temporarily ordered (blocks in nKB)
-
log transactions acquire indexeses into shm, refcounted in varnishd
-
log clients receive references, have guarantee that referenced buffers will not be overwritten
-
varnishd keeps track of references with timeouts (implicit de-reference for dying client)
- default: similar to current behavior - if client is too slow, hang up
- option for backpressure to slow down varnishd processing
-
varnishd executes query "which client needs log"
-
varnishlog -d option should remain possible by keeping ref==0 segments around for longer
-
clients would not know the amount of records lost if client gets hung up on, but if client can not keep up, varnishd can inform the client about number of lost records
-
vslq is more efficient in varnishd because re-creating tx linkeage is spared and filtering records is driven by them, not by the query (match all queries against record, improving cache locality)
- can use bitmaps to mark vsl tag presence in a tx
-
log writing is more efficient because vsl_buffer is skipped
-
interface: uds named socket, actual log "attach" should be fd passing (trying containers might be a good idea)
Extend VSC metrics to include labels
-
metrics have name/value/type
-
add (prometheus) label (unrelated to vcl labels)
-
needs example
better VCL segmentation across domains/address/ports
-
probably solved by https://varnish-cache.org/tutorials/vmod_sets.html (at least to some extent)
-
would need a strawman, write docs
-
pondering options to auto-magically load vcls for specific hosts / urls?
General focus on GSD
- make tickets easier actionable
- which problem are we trying to solve
- how
- see link in CONTRIBUTING
multi-step:
-
present the idea, get feedback early
-
write documentation
-
prepare full PR
(bonus) New/updated CLI varnishadm API, with proper JSON output
- general agreement to do json first, then clear text
- add generic cli interaction code to libvarnishapi?
- phk to do work in the course of VIPC
Various additional notes
New varnish-cache based project: https://github.com/thechangelog/pipely
phk notes on Backend (and later other stuff) Import/Export
This will have to go through vcl labels, so the exporting VCL can be replaced.
But we do not want to resolve labels whenever we need a backend, so an intermediate layer with refcounts will be required.
When reload/relabeling providing VCL, should the stats be moved over ?
If backend stats become configurable, the stats could live in the intermediate layer.
Exporting a sub is much more involved, because the VCC cannot tell from the import statement which objects the sub may access (the state/compat check is compile-time now), so that becomes a vcl.load time check. /phk