Software-Defined Networking (SDN) is an approach to networking that uses software-based controllers or application programming interfaces (APIs) to communicate with underlying hardware infrastructure and direct traffic on a network.

This model differs from that of traditional networks, which use dedicated hardware devices (i.e., routers and switches) to control network traffic. SDN can create and control a virtual network – or control a traditional hardware – via software.
– VMware

Preamble

This post is part of a series of articles on project Superphénix (SPX) where we present how we rebuilt the infrastructure of an existing cloud provider from scratch, using cloud native technology and Kubernetes.

If you want to know more about specific topics, check out the other articles and the introduction !

Why Superphénix Needs an SDN

When we started designing the Superphénix infrastructure, one of our core goals was to abstract away physical hardware.

We didn’t want to care which vendor we were using, what interfaces they provided, or how they wired things up. Why? Because we believe in automated, software-defined everything, and networking is no exception.

Up to that point, it wasn’t the philosophy of Agora Calycé, who were in the process of making a major overhaul of their CSP. Although they used VMware, they didn’t use NSX, as to not be entangled in the proprietary ecosystem.

To make up for that, the entire network was wired into the virtualization through port groups, and VLANs were defined to segment the VMs to their own L2 domains. OPNsense firewalls were used to connect those private networks to the Internet, to setup firewalling and to configure NAT and port forwarding.

The results of such an architecture are the following:

Changes are slow and it is difficult to trace them
Configurations are prone to errors
End users can’t order network resources by themselves and must rely on network administrators

That architecture was the opposite of what we’re building. In this project, everything must be automated, from compute to network. Self-service is paramount, and an SDN was gonna help us with that.

What We Tried (and failed)

Agora Calycé relies on a typical spine and leaf architecture, with BGP used to carry routes throughout the network.

Our first idea was to use EVPN/VXLAN and BGP to plug the network of VMs provisioned by Kubevirt into the core network. This is a proven idea, with which SKALA SYSTEMS has experience.

But there were numerous problems:

The mix of network gears (Cumulus/Juniper/Cisco) didn’t allow for a seamless automated setup. Some code was gonna have to be written.
We need to automate bridging of VMs
We need to enforce policies to prevent MAC/ARP/NDP spoofing
We must auto-detect VMs to announce their routes to the leaves over BGP

And more challenges like that were piling up. It could have been an interesting project to develop our own code to make this work, but we started to consider that it might not allign with our philosophy at all.

Indeed, this entire setup would rely a lot on how the upper layers of the networks are architectured, without any abstraction. In this setup, Kubernetes was also not presented as the network orchestrator. That means we couldn’t use GitOps to create the L2/L3 networks of our customers, define their firewalling rules and more. It would have been a big disapointment.

We tried a some Kubernetes network plugins (note: we didn’t say CNIs) to automate the creation of bridges, interfaces and more:

Luigi to install network plumbing
NMstate operator to setup interfaces, bridges and more through CRDs

But it never felt right. We even made our own fabric with GoBGP + FRRouting and EVPN/VXLAN as a proof of concept. While our SDN could have been built this way, it didn’t feel as clean as we wanted, and Kubernetes wasn’t the orchestrator we hoped it would be.

What we needed was a CNI truly capable of building an SDN on top of Kubernetes.

The Hunt For The Perfect CNI

A CNI in Kubernetes is an operator that sets up the network of individual pods so that they can communicate with one another. Sometimes, it also handles services, and it usually takes care of external connectivity (to the Internet, through NAT, BGP or something else).

Because it is fundamentally an operator, the process of creating network resources can be orchestrated from the Kubernetes API and with GitOps.

The problem is that most CNIs follow a very simple network model, dictated by the Kubernetes documentation. There’s usually one IP range for pods, one for services, and if you’re lucky they come in both IPv4 and IPv6. Every pod is assigned one address of each protocol, guaranteed to be collision-free. They have no concept of L2 networks, only a default gateway, usually represented by the node on which the pod is running. To make matters worse, the Pod CIDR is split up in equal chunks allocated to each node.

For a CSP, this is just not the type of network you want. We need custom subnets that can overlap because two different customers may want to place their VMs in a 192.168.0.0/24 without minding if someone else is already using it. We need VPCs to route between L2 domains and to control how egress to the Internet is done. We need VMs to keep their IPs from one node to another.

And there aren’t many CNIs that allow for that to exist. Cilium, for example, is a really good CNI for a standard Kubernetes cluster, but it doesn’t fulfill any of our requirements to be the SDN of Superphénix.

Luckily, there’s one that promises exactly what we need. It’s a CNCF hosted project called Kube-OVN. It’s based on OpenvSwitch and OVN to bring all the features of that ecosystem back into Kubernetes. OpenvSwitch is used by Openstack, making it a battle-tested SDN for cloud providers.

Kube-OVN offers everything we need: VPCs, Subnets (L2 networks), BGP, NAT gateways, egress gateways, firewalling and more. It advertises itself as a traditional SDN, but applied to Kubernetes. We tested it for a month before settling entirely on it. It missed a couple of features we needed and we implemented and upstreamed them to Kube-OVN.

The main drawback that made us took so long to decide whether we should use it or not was the documentation. It was sometimes hard to truly grasp how Kube-OVN was meant to be used and how to setup some of its features. And that’s why we are also contributing to the documentation!

Multus

Now that we’ve got a CNI capable of doing what we want, we need to plug it into our Kubevirt VMs. The easiest way to do it is to use Multus as a meta CNI.

A meta CNI is called first when the CRI of the cluster requests a network for the pod. It doesn’t know how to do any networking by itself, but it knows what other CNIs are installed alongside it. It routes the request to the CNI defined by the user at the creation of the pod. This is done through an annotation.

Because VMs are represented by pods with Kubevirt, the behaviour is analogous for VMs.

Multus brings multiple features to the table:

We can have multiple CNI on the cluster and choose which one to use for each VM (we only use Kube-OVN, but it can be useful for migrations)
We can plug multiple interfaces in the same VM
We can have multiple CNIs plugged into the same VM on different interfaces

With Multus, we have the ability to hotplug new interfaces whenever we want inside our VMs, and we can easily migrate from one CNI to another if the need arises.

The Big Picture: Our Network Architecture

Now, you know that at the core of our SDN stack, we use Kube-OVN, OpenvSwitch, and Multus to create a flexible, high-performance network fabric, tightly integrated with Kubernetes.

Let’s look at the different features we provide to our customers through this SDN.

VPCs: Logical Isolation at Scale

Every tenant gets one or more Virtual Private Clouds (VPCs). A VPC is a logically isolated network that spans multiple compute nodes. It encapsulates “subnets”, which are L2 networks. It governs routing policies, isolation rules, and segmentation. Each tenant may have one or more VPCs, providing logical separation of resources while enabling routing control and customization.

Key Features:

Full layer-3 isolation between VPCs
Custom routing tables per VPC (customers can inject routes if they want to)
Support for overlapping CIDRs across VPCs
VPC peering supported for interconnection

Each VPC consists of subnets, and routing between subnets is handled via distributed routers deployed on each node. These routers are managed by Kube-OVN and built using the OpenvSwitch dataplane.

Subnets: Self-Service and Smart Allocation

Subnets in SPX are dynamically allocated by the SDN and span across nodes in an AZ. Users and admins can request subnets directly, and we handle the rest. We use Kube-OVN’s internal IPAM for IP address management.

Features:

Centralized view of IPs per VM, pod, or service
Dual stack support
Persistent IPs across VM restarts
Multi-subnet attachment: A VM can be part of several subnets
Manual allocation available (for statically addressed VMs)
MAC address management

Subnet objects are defined as CRDs, and allocation is managed per-AZ. While the subnets are scoped to the AZ level, we maintain global uniqueness across regions by reserving ranges centrally.

NAT Gateway: Efficient Internet Access

Not every VM needs a public IP. That’s where our NAT Gateway comes into play - a containerized component that connects private subnets to the internet securely.

It follows the CNF architecture, making network functions into containers. Instead of provisioning an entire firewall such as OPNsense to handle the NATing (and consuming 3 GBs of RAM), we can simply start small containers with a footprint of a few megabytes.

Here’s a few features of our NAT gateways:

Runs as a set of HA containers across nodes (we’re working on upstreaming this to Kube-OVN)
Each gateway is backed by OpenvSwitch rules and IPtables SNAT/DNAT
Handles Floating IPs (FIPs) for dynamic IP assignment and Elastic IPs to share an IP with many nodes
Implements BGP peering via GoBGP for dynamic IP advertisement to our core network

We’re implementing multi-node, highly available NAT gateways on top of Kube-OVN, contributing upstream to enable active-active HA and BGP announcements to our core network. Thanks to ECMP, we get load-balanced, resilient internet access.

We also upstreamed into Kube-OVN’s code the ability to peer NAT gateways with BGP routers. Traffic egressing through a NAT gateway is SNATed to a public IP and routed via BGP to the core network, without introducing single points of failure.

Direct Exposure via BGP: When You Really Need It

Some workloads need direct public exposure (e.g., load balancers, DNS servers). In those cases, we can assing public IPs to VMs and advertise them directly using GoBGP routers we’ve embedded in our stack.

We support two BGP modes:

Local mode: Public IPs are announced from the node hosting the VM or pod only. This minimizes routing hops but sacrifices availability on failure.
Resilient mode: Public IPs are announced from all AZ nodes. Incoming traffic is routed to the correct node internally. This mode offers HA but adds latency.

This gives us the flexibility to choose between performance and redundancy, depending on the use case. We use GoBGP for managing route advertisements to our upstream providers. Each SDN node runs a GoBGP instance and advertises public IPs based on mode. This BGP server is embedded in the NAT gateways or on dedicated agents of Kube-OVN.

But we also use GoBGP as our BGP speakers and reflectors. More on that in this section.

SDN Firewalling: Cloud-Native Security Rules

We use Kubernetes NetworkPolicies to enforce security rules across pods and VMs. But since we go beyond vanilla K8s, we extended these policies to support hypervisors as well. We use Kube-OVNs network policies, implemented through the ACLs of OpenvSwitch.

NOTE:
We initially tried chaining Cilium and Kube-OVN to use Cilium’s performing eBPF filters. It did work, and provided us with useful observability with Hubble. The major drawback is that Cilium doesn’t support overlapping CIDRs. If two VMs in different VPCs and Subnets ended up on the same hypervisors, Cilium would prevent the VM to start.

This is because identities used by Cilium to track Pods and enforce policies are stored in a map where the key is the IP of the pod. We thought about pushing a fix to the codebase, but it would have been an enormous change for the project, way too much for what it was worth. And considering the time it would have taken to support our usecase, we prefered to simply abandon Cilium on hypervisors.

We still use Cilium as a CNI on our storage clusters, as they don’t need to be part of the SDN. They only need to be exposed to the hypervisors through routing. Cilium stays our CNI of choice for standard clusters that do not run VMs.

For MAC/IP spoofing protection, we leverage OpenvSwitch port security features.

Network Interfaces and VM Connectivity

SPX VMs are managed via KubeVirt. Network attachment is handled via Multus, allowing each VM pod to be assigned multiple interfaces, each using a different CNI.

Placement and Limitations

All network resources (VPCs, subnets, gateways) are scoped per availability zone. For now, Kube-OVN only supports cross-cluster networking between two clusters, so extending this further is a technical limitation we’re working around.

Under the Hood: VM Networking

When we spin up a VM (via KubeVirt), it gets an eth0 interface managed like a Kubernetes pod. Behind the scenes:

A bridge conects the VM interface to the pod interface
veth pairs carry traffic from pod to host
OpenvSwitch tags and routes traffic based on VPC and Subnet
DHCP/DHCPv6/SLAAC configures IPs (we use Kube-OVNs DHCP, not Kubevirt’s one)

From the VM’s point of view, it’s just another network interface. From our point of view, it’s a pod with a full SDN behind it.

Our Cloud-Native SDN Stack

Layer	Component
CNI orchestrator	Multus
Primary CNI	Kube-OVN
Secondary CNI for standard clusters	Cilium
Distributed switching	OpenvSwitch managed by Kube-OVN
BGP speaker	GoBGP

Contributions to the Community

This stack wouldn’t exist without upstream collaboration. We’ve contributed quite a few PRs:

It includes:

BGP support to Kube-OVN’s NAT gateways
HA NAT gateways (WIP)
Fixes and features for GoBGP
Issue tracking, discussions, and upstream PRs on Cilium, OVS, and more
A lot of documentation

It’s all open source, after all, we’re not just consuming it, we’re building it with the community.

Preamble#

Why Superphénix Needs an SDN#

What We Tried (and failed)#

The Hunt For The Perfect CNI#

Multus#

The Big Picture: Our Network Architecture#

VPCs: Logical Isolation at Scale#

Key Features:#

Subnets: Self-Service and Smart Allocation#

Features:#

NAT Gateway: Efficient Internet Access#

Direct Exposure via BGP: When You Really Need It#

SDN Firewalling: Cloud-Native Security Rules#

Network Interfaces and VM Connectivity#

Placement and Limitations#

Under the Hood: VM Networking#

Our Cloud-Native SDN Stack#

Contributions to the Community#