Every platform team eventually faces the same sequence of networking decisions. First: which CNI. Then: do we need a service mesh. Then: how do we observe traffic without drowning in data. Each decision constrains the next. Get the CNI wrong and you are migrating under pressure later. Add a service mesh too early and you are debugging two control planes instead of one.
This post walks through these decisions as we approach them on client engagements — with specifics, tradeoffs, and opinions. We have preferences. We will be honest about where those preferences have limits.
The CNI decision
The Container Network Interface plugin is the foundation. It handles pod-to-pod communication, network policy enforcement, IP address management, and — depending on your choice — much more. In 2026, the realistic options for production Kubernetes are Cilium and Calico. Flannel is fine for dev clusters. For production, the conversation is between these two.
Cilium
Cilium is eBPF-native. Every feature — networking, policy enforcement, observability, load balancing — runs through eBPF programs in the Linux kernel. No iptables chains. No kube-proxy. Traffic decisions happen at the kernel level before packets reach userspace.
What this means in practice:
- Throughput: ~9.2 Gbps in service routing scenarios, roughly 25% higher than Calico in BGP mode.
- Latency: ~0.20 ms baseline. At scale — over 1,000 services — the gap widens because eBPF uses hash-table lookups with O(1) complexity. Iptables-based approaches degrade linearly with rule count.
- Observability: Hubble gives you L3/L4/L7 flow visibility without deploying a separate stack. You can see which pod is talking to which service, at what rate, with what latency — from the kernel, not from a sidecar.
- Network policy: Kubernetes NetworkPolicy plus Cilium's own CiliumNetworkPolicy, which adds L7 filtering (HTTP method, path, headers), DNS-aware policies, and identity-based rules that survive pod restarts and IP changes.
- kube-proxy replacement: Cilium can replace kube-proxy entirely, handling service load balancing in eBPF. One fewer component to run and debug.
Where Cilium is the clear choice:
- Greenfield deployments on modern Linux kernels (5.10+)
- Performance-sensitive workloads (streaming, real-time APIs, high-throughput data pipelines)
- Teams that want CNI + observability + network policy in a single component
- Platforms where you plan to evaluate service mesh capabilities later (Cilium can grow into this)
Calico
Calico takes a multi-dataplane approach. It supports iptables, eBPF, and Windows. It integrates natively with BGP for environments that need to peer with physical network infrastructure. It has been in production longer than Cilium and has a larger installed base.
What this means in practice:
- Throughput: ~8.5 Gbps in BGP mode. With eBPF dataplane enabled, performance approaches Cilium's numbers — but eBPF mode is a configuration option, not the default architecture.
- BGP integration: Native BGP peering with physical routers. If your network team manages a layer-3 fabric and requires BGP route advertisements from your Kubernetes nodes, Calico handles this without additional tooling.
- Network policy: Full Kubernetes NetworkPolicy support plus Calico's own GlobalNetworkPolicy for cluster-wide rules. Mature, well-understood, widely documented.
- Windows support: If you run Windows nodes alongside Linux, Calico is the established choice.
- Ecosystem: Larger community, more StackOverflow answers, more battle-tested edge cases.
Where Calico is the better choice:
- Brownfield environments with existing BGP infrastructure
- Mixed Linux/Windows clusters
- Teams that value operational simplicity and established debugging workflows over raw performance
- Enterprises where the network team requires BGP integration with physical infrastructure
Our opinion
We default to Cilium for new platform builds. The eBPF-native architecture is cleaner, the observability (Hubble) eliminates the need for a separate network monitoring layer, and performance at scale is measurably better. The kube-proxy replacement simplifies the cluster.
But we do not migrate clients off Calico when it is working. If your Calico deployment is stable, your team knows how to debug it, and you do not have scale or observability requirements that Calico cannot meet — leave it. The migration cost (rolling restart of every node, draining workloads, testing rollback procedures) is real. Migrate during a planned platform refresh, not as a standalone project.
Do you need a service mesh?
The honest answer for most teams in 2026: probably not yet.
A service mesh adds mutual TLS between services, L7 traffic management (retries, timeouts, circuit breaking), and detailed per-request observability. These are real capabilities. The question is whether you need them today, and whether the operational cost is worth it.
When you do not need a service mesh
- You have fewer than 20 services. You can manage retries and timeouts in application code or with a lightweight library. The operational overhead of a mesh is not justified.
- You do not have a regulatory requirement for mTLS between all services. Cilium provides pod-to-pod encryption (WireGuard-based) and identity-based network policy without a mesh.
- Your observability needs are met by Cilium + Hubble + Grafana. L3/L4 flow data plus application-level metrics from your Grafana stack covers most debugging scenarios.
- Your team is already stretched. A service mesh is another control plane to understand, debug, and upgrade. If your platform team is a bottleneck (and if you are reading this blog, there is a chance it is), adding complexity is the wrong move.
When you do need a service mesh
- Regulatory or compliance requirements mandate mTLS for all east-west traffic — not just between zones, but between every service pair. Some financial services and healthcare platforms require this.
- You are running 50+ services with complex inter-service dependencies and need per-request observability (distributed tracing at the infrastructure level, not application level), fine-grained traffic shifting (canary deployments at 1% traffic), or request-level rate limiting.
- You need to enforce L7 policies across services you do not own — for example, controlling which HTTP methods a third-party service can call on your internal APIs.
If you do need one: the 2026 options
Cilium Service Mesh (sidecarless)
If you are already running Cilium as your CNI, this is the natural path. Cilium handles mTLS via SPIFFE identity, L7 policy, and observability through Hubble — without sidecar proxies. L4 traffic stays in eBPF. L7 traffic routes through a shared per-node Envoy proxy (DaemonSet), not a per-pod sidecar.
Advantages: No sidecar overhead. Single control plane (Cilium already manages your network). Lower resource consumption.
Tradeoffs: L7 scalability limits under extreme load (the per-node Envoy becomes a shared resource). Cilium's mutual authentication uses eventual consistency for policy sync — in environments where security policy must be instantaneous and auditable, this is a consideration. Debugging eBPF programs is harder than debugging sidecar logs.
Istio Ambient Mode
Istio reached production readiness for ambient mode in 2025. Ambient removes per-pod sidecars in favour of a node-level ztunnel proxy for L4 (mTLS, basic traffic management) and optional waypoint proxies for L7 (full Envoy capabilities, per-service).
Advantages: Mature L7 capabilities. Synchronous policy enforcement (stronger auditability). Large ecosystem and community. Can run on top of Cilium as the CNI — they are not mutually exclusive.
Tradeoffs: Two control planes (Istio + CNI). Higher operational complexity. Even in ambient mode, resource overhead is higher than Cilium's native mesh. Your team needs to understand both Cilium and Istio.
Our opinion
For the platforms we build, we start with Cilium and do not add a service mesh until there is a clear, specific requirement that Cilium alone cannot satisfy. Most platforms never reach that point. The ones that do typically need Istio's L7 capabilities for regulatory compliance — and in those cases, we run Istio ambient on top of Cilium.
The worst outcome is deploying a service mesh because it seems like best practice and then spending months debugging proxy injection, sidecar resource limits, and control plane version mismatches. If you do not have a specific problem a mesh solves, wait.
Observability: what to instrument and what to skip
Networking observability is where teams either get actionable data or drown in metrics nobody looks at.
The useful layer: flow data
Cilium + Hubble gives you network flow data out of the box: source pod, destination pod/service, protocol, port, response code (for L7), latency, bytes transferred. This answers the questions you actually ask during incidents:
- Which service is calling this endpoint?
- What is the request rate and error rate?
- Is latency between service A and service B normal?
- Are there unexpected connections (a pod talking to an external IP it should not)?
Pipe this into Grafana via the Hubble Prometheus exporter. Build four dashboards:
- Service map — who talks to whom, at what rate
- Error rates — per-service, per-endpoint HTTP 5xx and connection failures
- Latency — P50/P95/P99 per service pair
- Policy denials — network policy drops, DNS failures
These four cover 90% of networking debugging. Build them first. Resist the temptation to instrument everything.
The trap: too much granularity too early
Per-request distributed tracing at the network level (not application level) is expensive — in storage, in processing, and in cognitive load. Most teams that deploy it at the infrastructure level discover they rarely use it. Application-level tracing (OpenTelemetry) gives you the same insight with more context and less overhead.
Our rule: instrument flows at the infrastructure level (Hubble), instrument requests at the application level (OpenTelemetry). If you find a gap between the two, add targeted L7 visibility in Cilium for specific services — do not turn it on globally.
Decision framework
If you are making these decisions now, here is the short version:
| Decision | Default choice | Switch when | |----------|---------------|-------------| | CNI | Cilium | Calico if you need BGP peering with physical infrastructure or Windows nodes | | Service mesh | None (Cilium handles mTLS + L4 policy) | Cilium service mesh when you need L7 policy; Istio ambient when you need auditable L7 compliance | | Observability | Hubble flows → Grafana | Add L7 per-service visibility only when flow data is insufficient for a specific debugging scenario | | kube-proxy | Cilium replacement | Keep kube-proxy only if your platform has a hard dependency on iptables-based tooling |
Every row has a default and a condition for changing it. Start with the defaults. Move when you have evidence, not assumptions.
What we got wrong
In the interest of intellectual honesty: we have made mistakes on networking decisions.
We deployed Istio too early on a client platform (2024). The team had 12 services. The mesh added complexity without solving a real problem. We spent more time debugging sidecar injection than the mesh saved in operational benefits. We removed it six months later and replaced it with Cilium network policies. Lesson: a service mesh is an answer to a specific question. If you cannot articulate the question, you do not need the mesh.
We underestimated Calico's eBPF mode (2025). We assumed Calico's eBPF support was secondary to its iptables dataplane. On a client engagement where Calico was already deployed, we tested eBPF mode and found performance within 5% of Cilium. The migration to Cilium was not worth the risk. Lesson: test your assumptions against your actual workload, not benchmarks from the internet.
These are the kinds of decisions that look obvious in retrospect and ambiguous in the moment. The framework above reflects what we have learned — including from getting it wrong.
Building a platform and need to make these decisions for your stack? We have opinions and we are happy to debate them. Check our open source work or start a conversation.