What Is a Service Mesh?

A service mesh uses lightweight containers to provide a transparent infrastructure layer over a microservice-based app. It provides operational features to improve app resilience, observability, and security by allowing the app's owners to focus on the business logic.

Why would an organization need a service mesh?

Using a service mesh, which is a configurable, low-latency infrastructure layer in software, is still a fairly new approach. It was born out of the need for organizations to effectively manage the rapidly growing number of microservices they are developing as they build applications.

For example, an e-commerce application app that typically has microservices architecture, with front-end and back-end components, needs services to communicate securely to support customer transactions. Such apps can include shopping cart and shipping services.

A service mesh can bind all the services in a Kubernetes cluster together so they can communicate with each other. It enables secure, service-to-service communication by creating a Transport Layer Security (TLS)-encrypted gateway. Service mesh features include traffic management, observability, and security.

What are data planes and control planes?

A typical service mesh has two main architectural components, a data plane and a control plane.

Data plane

The data plane is responsible for tasks such as health checking, routing, authentication, and authorization. It translates, forwards, and observes all network packets flowing to and from service instances. A service mesh may include more than one data plane.

Control plane

The control plane in a service mesh provides policy and configuration for all the data planes in the service mesh. Unlike a data plane, the control plane doesn't touch any packets or requests in the system. But it can turn all the data planes into one distributed system.

The control plane of a service mesh is usually human-operated through a command-line interface (CLI), web portal, or some other kind of user interface.

What is a sidecar proxy?

A service mesh is, essentially, a mesh of network proxies. App development teams implement the service mesh using sidecar proxies, which are additional containers that proxy all connections to the containers where the services live, such as in a container orchestrator like Kubernetes, also known as K8s.

As its name suggests, a sidecar proxy in a service mesh runs alongside a service or instances, such as a Kubernetes pod. Sidecar proxies enforce policies and collect telemetry in the data plane. They can handle communications, monitoring, and security-related issues between microservices.

Sidecar benefits for developers

Sidecar proxies allow developers to focus on developing, supporting, and maintaining the application code for microservices and help operations teams stay focused on running applications and maintaining the service mesh. In short, the service mesh lets these teams decouple their work.

A container orchestration framework is used to manage all the sidecar proxies and becomes an increasingly critical tool as an application's infrastructure expands.

What are options for building a service mesh?

There are several popular offerings, which are also open-source solutions, that organizations can consider using to build a service mesh. Consul, Kuma, AWS App Mesh, and Istio are examples of well-established options in the marketplace.

Key benefits of a service-mesh architecture

Controlled rollouts of new services

Traditional applications work like this: A client sends HTTP requests and responses to and from a web server. That server, in turn, communicates with an application server and perhaps a database. But if an organization wants to change or upgrade any function, it must upgrade the entire app.

With microservices, each application function runs in its own container. These containers need to communicate with each other across the network, which acts as an operating system of sorts. The advantage of using microservices is that they can be rolled out or upgraded slowly or carefully without having to upgrade the whole app at once.

A top benefit of a service-mesh architecture for application developers is that it provides teams more flexibility in how and when they test and release new functions or services. For example, as explained below, service mesh supports canary deployments, A/B testing, and green-blue testing.

Canary deployments

With a canary deployment or canary rollout, application developers can send a new version of code into production and send a proportion of users to the new version while the rest remain on the current version.

For example, developers could introduce just 10 percent of a new service to start and rely on the service mesh to confirm that the service is working as intended before expanding the service further. Through this approach, the developers can confidently introduce more of the upgrade into the service mesh over time.

A/B testing

The service mesh also lets organizations road test new ideas. For example, say a retailer wants to use a new web design for a Black Friday campaign. It can use the service mesh to test the new design months ahead of time to gather user feedback. The retailer could test the design with 5 percent of customers and, if it resonates, make it available to all customers on Black Friday.

Green-blue testing

Developers can also use the service mesh to conduct green-blue testing, which is an engineering process for testing new versions of a service. It involves running two identical production environments and monitoring for errors or unwanted changes in user behavior as more traffic is moved from the old version of the service to the new.


Splitting services across multiple clusters

Splitting services across multiple clusters is another key benefit of service-mesh architecture. This provides DevOps, ITOps, and other teams working with applications to gain insight into how, and how quickly, various and often disparate clusters are communicating.

This insight allows teams to set up service-level objectives (SLOs). While SLOs aren't tied to a business outcome, organizations typically group together multiple SLOs to ensure they're meeting their service-level agreements (SLAs) for end customers.

For example, an SLO might be for a microservice to have a response delay of no more than 250 milliseconds for 99.9 percent of traffic over a rolling 14-day period.


Managing all services more effectively 

A service-mesh architecture allows organizations to control the sharing of services among the many disparate, front-end and back-end clusters in an application, whether those clusters are located in the cloud or in on-premises environments.

However, the DevOps, ITOps, site-reliability engineers (SREs), and other teams working with hundreds of applications and their components also need the ability to visualize all the communication—the TLS-encrypted traffic—occurring across and within those environments. The service mesh provides some insight into this communication.

With a service-mesh manager tool, which sits on top of the service mesh, organizations can also gain deep observability of the services' topologies across the multiple infrastructure clusters in an application. This tool can help simplify management of the service mesh and help organizations use it to their advantage.

Organizations should consider adopting a service-mesh manager tool that provides both real-time and historic insight into security, latency, workloads, and more. This would allow teams to make changes to optimize app performance, connect services, pinpoint problems, and gradually introduce new services into the application infrastructure.

The tool should also offer the ability to click on any microservice in the service mesh, so that teams can easily make changes to that service, the cluster, or even the service mesh itself. Note that only the service mesh features are changeable; no changes to the application code should be required.

If a service isn't working as intended or meeting the desired SLO requirements, the service-mesh manager tool should be able to create a ticket or page automatically. That helps teams move quickly to investigate and correct the issue.