Full-stack observability in action: stories from inside Cisco IT

Published: November 2021

Slow page loads or queries? Delayed message delivery? If there’s a user experience problem with any application, Cisco IT wants to pinpoint the cause as soon as possible so we can fix the problem. That’s true whether or not we own the code. It’s part of our commitment to a great experience—for our workforce, contractors, and customers.

But zeroing in on what’s causing performance issues isn’t so simple. The problem could be in the code, server, our network, internet service provider network, cloud provider network, or the user’s laptop. Troubleshooting each part of the stack can take hours—or days. Even small delays in issue detection or remediation take a toll on customer satisfaction, workforce efficiency, and our own workload in Cisco IT.

Faster issue identification, faster remediation

If we own the code, since 2020 we’ve been able to identify the source of experience problems much faster, with Full-Stack Observability, or FSO. We use Cisco ThousandEyes to monitor internet and network performance and the end-user experience. Cisco AppDynamics to monitor the application experience. And Cisco Intersight to troubleshoot Cisco UCS rack servers (blog here). The upshot is that we can see snags anywhere in our technology stack—from cloud or data center infrastructure, across various networks, all the way to the user device.

Here are a couple of stories about how FSO helped us improve mean time to detection and resolution (MTTD and MTTR) for applications we own. The third story describes how ThousandEyes helps us keep tabs on a SaaS application that we don’t own or control.

Use case 1 - Less than 2 minutes to identify the source of slow queries in a critical application

Customers use our internal Support Case Manager (SCM) application to open, view, and update their Technical Assistance Center (TAC) cases. By the time customers need support, they’re already not in a good mood. We want to make sure that opening a case or looking up case status is painless.

One day, customers who tried to look up their case status had to wait longer than usual. In the past, we wouldn’t have found out until one of those customers called the helpline. Then we’d assign someone to call the customer back to investigate. That takes time—and meanwhile, other customers would have the same issue. Although we use various third-party monitoring tools, they don’t always detect user-experience problems.

With FSO, we detected and resolved the issue before a single customer complained. Likely before they even noticed. “We’ve set up AppDynamics to send an alert whenever any database that SCM queries is slow,” says Arunima Karunanidhi, Cisco IT analyst supporting SCM. “On the day queries were slow, it took us less than two minutes to see which database was causing the problem, and less than five minutes for the database team to resolve it. Without the information from AppDynamics, resolution probably would have taken about 30 minutes.” That’s a significant improvement in MTTD and MTTR.

“We helped Cisco IT teams adopt AppDynamics and get full value from its features,” says Sesh Surapaneni, technical program manager in Cisco IT assurance. To make sure AppDynamics alerts don’t get lost in email, the SCM team integrated AppDynamics with Webex Teams. Now alerts show up right away on team members’ mobile phones. With all alerts in one place, the team can quickly see if alert volume has increased. If so, they know the application has a problem—or that they need to raise the threshold for alerts.

Since starting to use AppDynamics for SCM, we’ve seen:

50% faster detection and root cause analysis.
9 major and 135 minor incidents avoided in fiscal year 2021. That’s because we follow up when AppDynamics reveals signs of impending failure. Slow queries, for instance, can signal that the application is likely to fail soon or isn’t performing up to par.
99.85% availability, up from 99.79%. The improvement might seem tiny, but it actually translates to several hours less downtime in a year, helping our business run more smoothly and improving the customer experience.

Use case 2 – Speedy identification of a service provider issue affecting Webex Messaging

One way we use ThousandEyes is to view the performance of real-time collaboration applications like Cisco Webex, voice, and video. These applications are very sensitive to packet loss, latency, and jitter. ThousandEyes has helped lower MTTD in two ways. First, we can quickly see whether a performance problem is caused by the network or application so the right team can get right to work. Second, if the network is the culprit, we can zero in on the specific segment causing the problem—whether it’s our network, a service provider network, or an employee’s home network.

Here’s a real-life example. One day the Cisco IT Collaboration team heard from employees in Bangalore that messages sent via the Webex App were delayed. The problem was sporadic, making it difficult to troubleshoot. Enter ThousandEyes. “In just a few minutes I was able to set up a test with ThousandEyes to observe the Webex Messaging traffic flow,” says Dishendra De Silva, site reliability engineer. “Right away we could see significant packet loss and latency in the region—411 milliseconds compared to about 200 milliseconds ordinarily.” Drilling down to visualize the path, the team narrowed down the problem to a bad link between the local internet service provider network and the cloud service hosting Webex. When De Silva’s team shared the ThousandEyes data, the service provider acknowledged the issue and re-routed traffic around the problem area. Issue resolved.

“Without ThousandEyes, we would have had to do multiple trace routes and pull multiple logs, delaying detection,” says De Silva. “ThousandEyes let us immediately visualize the issue so we could work with our networking team and the service provider to troubleshoot.”

Figure 1 – The red lines indicate the network segment causing delays

Use case 3 - Resolving a critical Salesforce incident

We don’t own or control software-as-a-service (SaaS) applications, so troubleshooting used to be difficult or impossible. With ThousandEyes we can quickly spot user experience issues so we can alert the service provider.

For example, Cisco TAC engineers use a Salesforce SaaS product to create service requests and communicate with customers. Performance is ordinarily excellent. Then one day, it wasn’t. “From ThousandEyes we could see sporadic performance problems,” says Nandini Subbaraja, enterprise operations center duty manager. “Pages could load normally ten times in a row, but the next time the engineer might have to wait three minutes.”

Subbaraja detected the problem right away. That’s because her team uses ThousandEyes to perform test transactions that mimic the way engineers work—logging in, searching for and opening cases, and drilling down for the detail behind statistics. “On the day in question, we could see that over a 30-minute period, some transactions were fine where others were very slow,” she says. “No other tool would have shown that pages weren’t loading correctly. ThousandEyes also showed that the problem started in the SaaS provider’s environment—not our network or the internet.”

Our application team alerted Salesforce, presenting data showing that the problem was in their environment. Salesforce traced it to a recent network change that left some customers unable to log in to their Salesforce services. Salesforce’s initial attempt at remediation—switching the active and ready sites—caused the incident to spread to three additional data centers around the world. “ThousandEyes helped us detect the problem sooner,” Subbaraja says. “That allowed us to fail over to our backup service, an internal web ticketing tool, and continue to meet the service-level agreements that customers count on.”

Future plans for a great user experiences with FSO and SaaS monitoring

Plans are underway to implementing Cisco Secure Application for security monitoring and managing application vulnerabilities. “We’ll be detecting runtime threats with Cisco secure Application in addition to Application availability and performance issues,” says Rajesh Bansal, senior director for assurance in Cisco IT. “The integration between ThousandEyes and AppDynamics helps us deliver an exceptional application experience by giving us end-to-end visibility into application and network issues.”

Stay tuned for lessons learned.

Learn more

Full Stack Observability

Employees do their own performance troubleshooting, with ThousandEyes (Part 1)

Zeroing in on network performance issues, with ThousandEyes (Part 2)

How AppDynamics helps improve IT applications