Have an account?

  •   Personalized content
  •   Your products and support

Need an account?

Create an account

Cisco Application Centric Infrastructure Calico Design White Paper

Networking Solutions White Paper

Available Languages

Download Options

  • PDF
    (696.1 KB)
    View with Adobe Reader on a variety of devices
Updated:December 13, 2019

Available Languages

Download Options

  • PDF
    (696.1 KB)
    View with Adobe Reader on a variety of devices
Updated:December 13, 2019

Table of Contents

 

 

Introduction

With the increasing adoption of container technologies and Cisco® Application Centric Infrastructure (Cisco ACI) as a pervasive data-center fabric technology, it is only natural to see customers trying to integrate these two solutions.

At the time of writing, the most popular container orchestrator engine is Kubernetes (often abbreviated K8s). A key design choice that every Kubernetes administrator has to make when deploying a new cluster is selecting a network plugin. The network plugin is responsible for providing network connectivity, IP address management, and security policies to containerized workloads.

A series of network plugins is available, with different transport protocols and/or features being offered. To browse the full list of supported network plugins, follow this link: https://kubernetes.io/docs/concepts/cluster-administration/networking/#how-to-implement-the-kubernetes-networking-model.

Note:      While Cisco offers a CNI (container network interface) plugin directly compatible and integrated with Cisco ACI, this is not covered in this document. In this white paper, we will be discussing the current best practices for integrating Cisco ACI with Project Calico.

Calico

Calico supports two main network modes: direct container routing (no overlay transport protocol) or network overlay using VXLAN or IPinIP (default) encapsulations to exchange traffic between workloads. The direct routing approach means the underlying network is aware of the IP addresses used by workloads. Conversely, the overlay network approach means the underlying physical network is not aware of the workloads’ IP addresses. In that mode, the physical network only needs to provide IP connectivity between K8s nodes while container-to-container communications are handled by the Calico network plugin directly. This, however, comes at the cost of additional performance overhead as well as complexity in interconnecting container-based workloads with external noncontainerized workloads.

When the underlying networking fabric is aware of the workloads’ IP addresses, an overlay is not necessary. The fabric can directly route traffic between workloads inside and outside of the cluster as well as allowing direct access to the services running on the cluster. This is the preferred Calico mode of deployment when running on premises.[1] This guide details the recommended Cisco ACI configuration when deploying Calico in direct-routing mode.

You can read more about Calico at http://docs.projectcalico.org/.

Calico routing architecture

In a Calico network, each compute server acts as a router for all of the endpoints that are hosted on that compute server. We call that function a vRouter. The data path is provided by the Linux kernel, the control plane by a BGP protocol server, and management plane by Calico’s on-server agent, Felix.

Each endpoint can only communicate through its local vRouter, and the first and last hop in any Calico packet flow is an IP router hop through a vRouter. Each vRouter announces all of the endpoints it is responsible for to all the other vRouters and other routers on the infrastructure fabric using BGP, usually with BGP route reflectors to increase scale.[2]

Calico proposes three BGP design options:

1.     The BGP AS per rack model

2.     The BGP AS per compute server model

3.     The downward default model.

They are all detailed at https://docs.projectcalico.org/v3.9/networking/design/l3-interconnect-fabric.

After taking into consideration the characteristics and capabilities of ACI and Calico, Cisco’s current recommendation is to follow a design approach similar to the AS per compute server model.

“AS per compute server” design – overview

In this design, a dedicated ACI L3Out will be created for the entire Kubernetes cluster. This will remove control-plane and data-plane overhead on the K8s cluster, providing greater performance and enhanced visibility to the workloads.

Every single Kubernetes node will have a dedicated AS number and will peer via eBGP with a pair of ACI Top-of-Rack (ToR) switches configured in a vPC pair. Having a vPC pair of leaf switches provides redundancy within the rack.

Each rack is allocated with a range of AS numbers that follow this schema:

   ToR leaf switches configured in a vPC pair are assigned the same AS number.

   Each Kubernetes node is configured with a dedicated AS number and peers with the ACI leaf switches via eBGP.

This eBGP design does not require running any route reflector in the Calico infrastructure, resulting in a more scalable, simpler and easier to maintain architecture.

Once this design is implemented, the following connectivity is expected:

   Pods running on the Kubernetes cluster can be directly accessed from inside ACI or outside via transit routing.

   Pod-to-pod and node-to-node connectivity will happen over the same L3Out and external endpoint group (EPG).

   Exposed services can be directly accessed from inside or outside of ACI.

Figure 1 shows an example of such a design with two racks and a 6-node K8s cluster:

   Each rack has a pair of ToR leaf switches and three Kubernetes nodes.

   Rack 1 uses AS 65101 for its leaf switches (leaf 101 and leaf 102).

   The three Kubernetes nodes in rack 1 (nodes 1, 2, and 3) are allocated three sequential AS IDs:

     65102, 65103, and 65104, all in the 651xx range

   Rack 2 uses AS 65201 for its leaf switches (leaf 103 and leaf 104).

   The three Kubernetes nodes in rack 2 (nodes 4, 5, and 6) are allocated three sequential AS IDs:

     65202, 65203, and 65204, all in the 652xx range

Related image, diagram or screenshot

Figure 1.         

“AS per compute server” design

Physical connectivity

Physical connectivity is provided by a virtual Port-Channel (vPC) configured on the ACI leaf switches toward the Kubernetes nodes. One L3Out is configured on the ACI fabric to run eBGP with the vRouter in each Kubernetes node via the vPC port-channel.

The vPC design supports both virtualised and bare-metal servers with the following design recommendations:

   A /29 subnet is allocated for every server.

   A dedicated encapsulation (VLAN) is allocated on a per-node basis.

   Each Kubernetes node uses ACI as its default gateway.

The first two recommendations ensure that no node-to-node bridging happens over the L3Out SVI. The L2 domain created by an L3Out with SVIs is not equivalent to a regular bridge domain, and it should be used to connect only “router”-like devices[3] and not as a regular bridge.

If the Kubernetes nodes are virtual machines, follow these additional steps:

   Configure the virtual switch’s port-group load balancing policy (or its equivalent) to the route based on the IP address.

   Avoid running more than one Kubernetes node per hypervisor. It is technically possible to run multiple Kubernetes nodes on the same hypervisor, but this is not recommended as a hypervisor failure would result in a double (or more) Kubernetes node failure.

   Ensure each virtual machine is hard-pinned to a hypervisor and ensures that no live migration of the VMs from one hypervisor to another can take place.

If the Kubernetes nodes are bare-metal, follow these additional steps:

   Configure the physical Network Interface Cards (NICs) in Link Aggregation Control Protocol (LACP) bonding (often called 802.3ad mode).

Kubernetes node egress routing

Because ACI leaf switches are the default gateway for the Kubernetes nodes, nodes only require a default route that uses the ACI L3Out as the next-hop both for node-to-node traffic and for node-to-outside world communications. In order to simplify the route exchange process, this validated design guide configures each node with a default gateway pointing to a secondary IP address on the ACI L3Out that is shared between the ACI leaf switches to which the nodes are connected.

The benefits of using this L3Out secondary IP as default gateway include the following:

   No routes need to be redistributed from ACI into the Kubernetes nodes, reducing control-plane load on the node and improving convergence times.

   Zero impact during leaf reload or interface failures: the secondary IP and its MAC address are shared between the two ACI leaves. In the event of one leaf failure, traffic will seamlessly converge to the remaining leaf.

   No need for a dedicated management interface. The node will be reachable even before eBGP is configured.

Note:      If required, it is possible to add additional interfaces to the Kubernetes nodes to access additional services, such as distributed storage. If this is required, it is still recommended to use ACI as the default gateway and simply add more-specific routes for these additional services over the additional interfaces.

Kubernetes node ingress routing

Each Kubernetes node will be configured to advertise the following subnets to ACI:

   Node subnet

   Its allocated subnet(s) inside the pod supernet (a /26 by default[4])

   Host route (/32) for any pod on the node outside of the pod subnets allocated to the node

   The whole service subnet advertised from each node

   Host route for each exposed service (a /32 route in the service subnet) for each service endpoint on the node with externalTrafficPolicy: Local

ACI BGP protection

In order to protect ACI against potential Kubernetes BGP misconfigurations, the following settings are recommended:

   Set the maximum AS limit to two:

a.   Per the eBGP architecture, the AS path should always be two.

   Configure BGP import route control to accept only the expected subnets from the Kubernetes cluster:

a.   Pod subnet

b.   Service subnet

Kubernetes Node maintenance and failure

Before performing any maintenance or reloading a node, you should follow the standard Kubernetes best-practice of draining the node. Draining a node ensures that all the pods present on that node are terminated first, then restarted on other nodes. For more info see: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/.

While draining a node, the BGP process running on that node stops advertising the service addresses toward ACI, ensuring there is no impact on the traffic. Once the node has been drained, it is possible to perform maintenance on the node with no impact on traffic flows.

BGP Graceful Restart

Both ACI and Calico are configured by default to use BGP Graceful Restart. When a BGP speaker restarts its BGP process or when the BGP process crashes, neighbours will not discard the received paths from the speaker, ensuring that connectivity is not affected as long as the data plane is still correctly programmed.

This feature ensures that if the BGP process on the Kubernetes node restarts (CNI Calico BGP process upgrade/crash) no traffic is affected. In the event of an ACI switch reload, the feature is not going to provide any benefit because Kubernetes nodes are not receiving any routes from ACI.

Scalability

As of Cisco ACI Release 4.1(2), the scalability of this design is bounded by the following parameters:

Nodes per cluster:

A single L3Out can be composed of up to 12 border leaves. This will limit the scale to a maximum of six racks per Kubernetes cluster. Considering current rack server densities, this should not represent a significant limit for most deployments. Should a higher rack/server scale be desired, it is possible to spread a single cluster over multiple L3Outs. This requires additional configuration that is not currently covered in this design guide. If you are interested in pursuing such a large-scale design, please reach out to Cisco Customer Experience (EX) for further network design assistance.

Nodes and clusters per rack:

   A border leaf can be configured with up to 400 BGP peers.

   A border leaf can be configured with up to 400 L3Outs.

These two parameters are intertwined. Here are a few possible scenarios:

   20-node Kubernetes cluster, each composed of 20 nodes per rack

   100-node Kubernetes cluster, each composed of 4 nodes per rack

   400-node Kubernetes cluster, each composed of 1 node per rack

All of these options will result in having 400 BGP peers per rack.

Longest prefix match scale

The routes that are learned by the border leaves through peering with external routers are sent to the spine switches. The spine switches act as route reflectors and distribute the external routes to all of the leaf switches that have interfaces that belong to the same tenant. These routes are called Longest Prefix Match (LPM) and are placed in the leaf switch's forwarding table with the VTEP IP address of the remote leaf switch where the external router is connected.

In this design, every Kubernetes node advertises to the local border leaf its pod-host routes (aggregated to /26 blocks when possible) and service subnets, plus host routes for each service with externalTrafficPolicy: Local. Currently, on ACI ToR switches of the -EX, -FX, and -FX2 hardware families it is possible to change the amount of supported LPM entries using “scale profiles” as described in the documentation at the following link: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/kb/b_Cisco_APIC_Forwarding_Scale_Profile_Policy.html.

In a summary, depending on the selected profile, the design can support from 20,000 to 128,000 LPMs.

Detailed configurations – an example

Below are the steps required to configure the following architecture:

Related image, diagram or screenshot

Figure 2.         

Example topology

Note:      It is assumed that the reader is familiar with essential Cisco ACI concepts and the basic fabric configuration (including interfaces, VLAN pools, external routed domains, and Attachable Access Entry Profiles [AAEPs]) and that a tenant with the required VRF already exists.

A possible IP allocation schema is shown below:

1.     Allocate a supernet big enough to contain as many /29 subnets as nodes. For example, for a 32-node cluster, you could use a /24 subnet, a 64-node would use a /23, and so on.

2.     For every /29 subnet:

a.   Allocate the first usable IP to the node.

b.   Allocate the last three usable IPs for the ACI border leaves:

i.    Last IP-2 for the first leaf

ii.    Last IP-1 for the second leaf

iii.   Last IP for the secondary IP

ACI BGP configuration

Note that, because each K8s node is essentially a BGP router, the nodes attach to the ACI fabric through an external routed domain instead of a regular physical domain. When configuring the interface policy groups and Attachable Access Entity Profiles (AAEP) for your nodes, bind them to that single external routed domain. You will attach that routed domain to your L3Out through the next steps:

1.     Create a new L3Out.

   Go to Tenant <Name> à Networking, right click on “External Routed Network,” and create a new L3Out.

2.     Configure the basic L3Out parameters.

   Name: <Your L3Out Name>

   VRF: <Your VRF Name>

   External Routed Domain: <Your External Routed Domain>

   Enable BGP

   Route Control Enforcement: Select Import and Export (default)

3.     Click Next (we will configure the ACI nodes later).

4.     Create a new External EPG.

   Name: “default”

   Subnets:[5]

   Add a 0.0.0.0/1 subnet with a scope of:

    External Subnets for the External EPG

   Add a 128.0.0.0/1 subnet with a scope of:

    External Subnets for the External EPG

   (Optional) Add the Pod and Service Subnets to it with a scope of:

    Export Route Control Subnet

This is required if these subnets are to be advertised outside of ACI.

5.     Click Finish.

6.     Expand the newly created L3Out, right click on “Logical Node Profile,” and select “Create Node Profile.”

    Name: You could pick the K8s rack ID here, for example

    Select “+” to add a node.

     Node ID: Select the ACI node you want to add (for example, Leaf 203, as shown in Figure 2).

     Router ID: Insert a valid routed ID for your network.

    Select “+” to add a node.

     Node ID: Select the node you want to add (for example, Leaf 204, as shown in Figure 2).

     Router ID: Insert a valid routed ID for your network.

     Click OK.

7.     Repeat step 6 for all the other ACI switches in the other racks.

8.     Expand the newly created “Logical Node Profile,” right click on “Logical Interface Profile,” and select “Create Interface Profile.”

    Name: Select an appropriate name for your network.

    Disable “Config Protocol Policies.”

    Click Next.

    Select “+” to add a new SVI.

    Path Type: Virtual Port Channel.

    Path: Select the vPC link toward your first Kubernetes node (node 1).

    Encap VLAN: Select a valid VLAN ID (for example, Vlan 310, as shown in Figure 2).

    MTU: Select the desired MTU; 9000 is the recommended value.

    Side A IPv4 Primary/Side B IPv4 Primary: Insert a valid IP (for example, X.X.X.B/29 and X.X.X.C/29, as show in Figure 2)

    Side A IPv4 Secondary/Side B IPv4 Secondary: Insert a valid IP (for example, X.X.X.D/29, as show in Figure 2). This IP will be the default gateway of our Kubernetes node and is expected to be the same on both Side A and B.

    Click on “+” to create the “BGP Peer Connectivity profile” for the Kubernetes node connected to this interface

     Peer Address: Insert the IP address of the Node (for example, X.X.X.A/29, as show in Figure 2).

     BGP Control:

     Select: Next-Hop Self.

     Remote Autonomous System Number: Select the AS number you selected for your node (for example, 64102, as show in Figure 2).

     Local-AS Number: Select the AS number you have selected for the rack you are configuring (for example, 64500, as show in Figure 2).

     Press: OK.

    Press: OK.

9.     Repeat Step 8 for all the remaining nodes/vPCs in the same rack as well as for nodes in other racks. Ensure that you select the correct “Rack” Interface Profile.

10.  Set the BGP Timers to a 1s Keepalive Interval and a 3s Hold Interval to align with the default configuration of Calico and provide fast node-failure detection. We are also configuring the maximum AS limit to 2, as discussed previously.

    Expand “Logical Node Profiles,” right click on the Logical Node Profile, and select “Create BGP Protocol Profile.”

    Click on the “BGP Timers” drop-down menu and select “Create BGP Timers Policy.”

     Name: <Name>

     Keepalive Interval: 1s

     Hold Interval: 3s

     Maximum AS Limit: 2

     Press: Submit

   Select the <Name> BGP timer policy.

     Press: Submit.

11.  Repeat Step 10 for all the remaining “Logical Node Profiles.”

   Note: The 1s/3s BGP timer policy already exists; there is no need to re-create it.

12.  In order to protect the ACI fabric from potential BGP prefix misconfigurations on the Kubernetes cluster, import route control has been enabled in step 2. In this step we are going to configure the required route map to allow the Pod Subnet (that is, 192.169.0.0/16) and the Cluster Service Subnet (that is, 10.96.0.0/12) to be accepted by ACI)

   Expand your L3Out. Right click on “Route map for import and export route control” and select “Create Route Map for…”

   Name: From the drop-down menu, select “default-import” node.

     Note: Do not create a new one; select the pre-existing one called “default-import.”

   Type: “Match Prefix and Routing Policy.”

   Click on “+” to Create a new Context.

     Order: 0

     Name: <Name>

     Action: Permit

     Click on “+” to Create a new “Associated Matched Rules.”

     From the drop-down menu, select “Create Match Rule For a Route Map.”

     Name: <Name>

     Click “+” on Match Prefix:

    IP: POD Subnet (that is, 192.169.0.0/16)

    Aggregate: True

    Click Update.

     Repeat the above steps for the Cluster Service Subnet (that is, 10.96.0.0/12)

   Click Submit.

   Ensure the selected Match Rule is the one we just created.

   Click OK.

   Click Submit.

Calico routing and BGP configuration

This section assumes that the basic Kubernetes nodes configurations have already been applied:

   The Kubernetes node network configuration is complete (that is, the default route is configured to be the ACI L3Out Secondary IP, and connectivity between the Kubernetes nodes is possible).

   Kubernetes is installed.

   Calico is installed with its default setting as per https://docs.projectcalico.org/v3.8/getting-started/kubernetes/installation/calico#installing-with-the-kubernetes-api-datastore50-nodesNodes-or-less.

Note:      In the next examples, it is assumed that Calico and “calicoctl” have been installed. If you are using K8s as a datastore, you should add the following line to your shell environment configuration:

DATASTORE_TYPE=kubernetes KUBECONFIG=~/.kube/config

1.     Disable IP-in-IP and VXLAN encapsulation mode:

   Export the current configuration with calicoctl.

./calicoctl get ippool -o yaml > ip_pool_NO_nat_NO_ipip.yaml

   Modify the exported configuration and set:

     ipipMode: Never

     vxlanMode: Never

     natOutgoing: False

   Re-apply the configuration with calicoctl.

calico-master-1# cat ip_pool_NO_nat_NO_ipip.yaml

apiVersion: projectcalico.org/v3

items:

- apiVersion: projectcalico.org/v3

  kind: IPPool

  metadata:

    name: default-ipv4-ippool

  spec:

    blockSize: 26

    cidr: 192.169.0.0/16 #<== Change to your subnet

    # Disable ip-in-ip and Vxlan encap as they are not needed.

    ipipMode: Never

    vxlanMode: Never

    nodeSelector: all()

kind: IPPoolList

 

calico-master-1#./calicoctl apply -f ip_pool_NO_nat_NO_ipip.yaml

2.     Because we are using eBGP, there is no need to form full-mesh adjacencies between the K8s nodes, so we should disable the “node to node mesh” default configuration:

calico-master-1#cat nodeToNodeMeshDisabled.yaml

apiVersion: projectcalico.org/v3

kind: BGPConfiguration

metadata:

  name: default

spec:

  logSeverityScreen: Info

  nodeToNodeMeshEnabled: false

calico-master-1# calicoctl apply -f nodeToNodeMeshDisabled.yaml

3.     Configure eBGP (this assumes that the node interface and ACI are already configured as per the previous chapter).

    Create a new BGP Calico Resources for every Kubernetes node with calicoctl. As part of this configuration we are going to:

     Set the node IP address.

     Allocate the node AS number.

     Define the two eBGP neighbors (the ToR ACI leaves).

Below is an example for calico-master-1 as shown in Figure 2:

# cat calico-master-1-bgp.yaml

apiVersion: projectcalico.org/v3

kind: Node

metadata:

  name: calico-master-1

spec:

  bgp:

    ipv4Address: 192.168.1.1/29

    asNumber: 64501

---

apiVersion: projectcalico.org/v3

kind: BGPPeer

metadata:

      name: leaf203-calico-master-1

spec:

      peerIP: 192.168.1.12

      Node: calico-master-1

      asNumber: 64500

 

---

apiVersion: projectcalico.org/v3

kind: BGPPeer

metadata:

      name: leaf204-calico-master-1

spec:

      peerIP: 192.168.1.13

      Node: calico-master-1

      asNumber: 64500

 

#./calicoctl apply -f calico-master-1-bgp.yaml

4.     Repeat step 1 for all the remaining Calico nodes.

Note:        The BGPPeer name (for example, leaf203-calico-master-1) must be unique across all nodes.

5.     Verify that the configuration is applied:

cisco@calico-master-1:~$ ./calicoctl get bgppeer

NAME                      PEERIP       NODE                 ASN

leaf203-calico-master-1   192.168.1.2  calico-master-1      64500

leaf203-calico-node-1     192.168.1.10 calico-node-1    64500

leaf203-calico-node-2     192.168.1.18 calico-node-2    64500

leaf204-calico-master-1   192.168.1.3  calico-master-1      64500

leaf204-calico-node-1     192.168.1.1011 calico-node-1  64500

leaf204-calico-node-2     192.168.1.1819 calico-node-2  64500

6.     Verify that BGP peering is established:

   From the Kubernetes node:

cisco@calico-node-1:~$ sudo ./calicoctl node status

Calico process is running.

 

IPv4 BGP status

+---------------+---------------+-------+----------+-------------+

| PEER ADDRESS  |   PEER TYPE   | STATE |  SINCE   |    INFO     |

+---------------+---------------+-------+----------+-------------+

| 192.168.1.2   | Node specific | up    | 00:25:00 | Established |

| 192.168.1.3   | Node specific | up    | 00:25:00 | Established |

+---------------+---------------+-------+----------+-------------+

   From ACI:

fab2-apic1# fabric 203 show ip bgp summary vrf common:calico

----------------------------------------------------------------

 Node 203 (Leaf203)

----------------------------------------------------------------

BGP summary information for VRF common:calico, address family IPv4 Unicast

BGP router identifier 1.1.4.203, local AS number 65002

BGP table version is 291, IPv4 Unicast config peers 4, capable peers 4

16 network entries and 29 paths using 3088 bytes of memory

BGP attribute entries [19/2736], BGP AS path entries [0/0]

BGP community entries [0/0], BGP clusterlist entries [6/24]

 

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd

192.168.1.1     4 64501    3694    3685      291    0    0 00:01:30 2

192.168.1.9     4 64502    3618    3612      291    0    0 00:00:10 2

192.168.1.17    4 64503    3626    3620      291    0    0 00:00:13 3

7.     Advertise the Cluster Service Subnet via eBGP. If you have changed the default cluster subnet (10.96.0.0/12), then modify the command below. This will result in all the nodes advertising the subnet to the ACI fabric.

calico-master-1#kubectl patch ds -n kube-system calico-node --patch     '{"spec": {"template": {"spec": {"containers": [{"name": "calico-node", "env": [{"name": "CALICO_ADVERTISE_CLUSTER_IPS", "value": "10.96.0.0/12"}]}]}}}}'

8.     By default, Kubernetes service IPs and node ports are accessible via any node and will be load balanced by kube-proxy across all the pods backing the service.  To advertise a service directly from just the nodes hosting it (without kube-proxy load balancing), configure the service as “NodePort” and set the “externalTrafficPolicy” to “Local.” This will result in the /32 service IP being advertised to the fabric only by the nodes where the service is active.

apiVersion: v1

kind: Service

metadata:

  name: frontend

  labels:

    app: guestbook

    tier: frontend

spec:

  # if your cluster supports it, uncomment the following to automatically create

  # an external load-balanced IP for the frontend service.

  # type: LoadBalancer

  ports:

  - port: 80

  selector:

    app: guestbook

    tier: frontend

  type: NodePort

  externalTrafficPolicy: Local

9.     Verify ACI is receiving the correct routes:

    Every Calico node advertises a /26 subnet to ACI from the Pod Subnet.[6]

    Every exposed service should be advertised as a /32 host route. An example:

cisco@calico-master-1:~$ kubectl -n gb get svc frontend

NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE

frontend   NodePort   10.108.139.126   <none>        80:30702/TCP   6d1h

    Connect to one of the ACI border leaves and check that we are receiving these subnets.

Leaf203# show ip route vrf common:calico

IP Route Table for VRF "common:calico"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

<=========== SNIP =============>

10.96.0.0/12, ubest/mbest: 1/0 ß Service Subnet

    *via 192.168.1.1%common:calico, [20/0], 02:22:39, bgp-65002, external, tag 64500

10.108.139.126/32, ubest/mbest: 1/0 ß Exposed Service IP to node-1

    *via 192.168.1.9%common:calico, [20/0], 02:21:22, bgp-65002, external, tag 64500

192.169.0.0/24, ubest/mbest: 1/0 ß POD Subnet assigned to node1 (master)-1

    *via 192.168.1.1%common:calico, [20/0], 02:22:39, bgp-65002, external, tag 64500

192.169.1.0/24, ubest/mbest: 1/0 ß POD Subnet assigned to node2 (worker16e436dcb8)node-1

    *via 192.168.1.9%common:calico, [20/0], 02:21:19, bgp-65002, external, tag 64500

192.169.2.0/24, ubest/mbest: 1/0 ß POD Subnet assigned to node3 (worker6181c88e8a)node-2

    *via 192.168.1.17%common:calico, [20/0], 02:21:22, bgp-65002, external, tag 64500

<=========== SNIP =============>

Note:      ACI strictly follows the eBGP specification by installing only one next-hop for each eBGP prefix regardless of the number of peers advertising that prefix.

This results in suboptimal traffic distribution for exposed services: A single node (node-1 in the above example) will be used for all the ingress traffic for a single service (10.108.139.126/32). A new feature in Cisco ACI Release 4.2 will relax this single next-hop rule to allow ECMP load balancing across all nodes that are advertising the host route for the service.

Cluster connectivity outside of the fabric - transit routing

In order for the node of the Kubernetes cluster to communicate with devices located outside the fabric, transit routing needs to be configured between the Calico L3Out and one (or more) L3Out connecting to an external routed domain.

This configuration requires adding the required consumed/provided contract between the external EPGs. Please refer to https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/application-centric-infrastructure/white-paper-c11-737909.html?cachemode=refresh#_Toc6453023 for best practices on how to configure transit routing.

Cluster connectivity inside the fabric

Once the eBGP configuration is completed, pods and service IPs will be advertised to the ACI fabric. To provide connectivity between the cluster’s L3Out external EPG and a different EPG, all that is required is a contract between those EPGs.

Conclusion

By combining Cisco ACI and Calico, customers can design Kubernetes clusters that are capable of delivering both high performance (with no overlays overhead) as well as providing exceptional resilience while keeping the design simple to manage and troubleshoot.

 



[4] The POD supernet is, by default, split into multiple /26 subnets and allocated to each node as needed. In case of IP exhaustion, a node can potentially borrow IPs from a different node-pod subnet. In this case a host route will be advertise from the node for the borrowed IP. More details on the Calico IPAM can be found here: https://www.projectcalico.org/calico-ipam-explained-and-enhanced
[5] Because each Kubernetes node is placed with a dedicated subnet, we will need to configure Single L3Out transit routing as per: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/2-x/L3_config/b_Cisco_APIC_Layer_3_Configuration_Guide/b_Cisco_APIC_Layer_3_Configuration_Guide_chapter_010100.html#id_69433; however, because we do not need to advertise any subnet to the Calico node, there is no need to configure the 0.0.0.0/0 subnet with Export Route Control.
[6] This is the default behavior. Additional /26 subnets will be allocated to nodes automatically if they exhaust their existing /26 allocations.  In addition, the node may advertise /32 pod–specific routes if it is hosting a pod that falls outside of its /26 subnets. Refer to https://www.projectcalico.org/calico-ipam-explained-and-enhanced for more details on Calico IPAM configuration and controlling IP allocation for specific pods, namespaces, or nodes.

Learn more