I’m sure many of you have heard terms such as segmentation, microsegmentation, nanosegmentation, picosegmentation, and so on and all the marketing buzz about how important they are. In my opinion, you shouldn’t buy a product just to check a box off your list; you want to satisfy a business goal. Say, for the sake of this discussion, the end goal is to materially improve your workload’s security posture from attacks.
First, step back and look at what the experts practicing cybersecurity defense say what you should do. These people are cyberanalysts from nation-state actors and know a thing or two about security. More on the experts later.
Before hearing what the experts say, a few small observations:
● Observation 1: A key thing to keep in mind is the fact that most successful attacks come from exploiting vulnerabilities in the application and software stack and not from an attack in the network layer (TCP/IP). Another way to put this is, even if an enterprise has generated and implemented a perfect allowed list network microsegmentation policy, it is still vulnerable to many attacks.
Most attacks either are rooted in the attacker having the credentials and walking in through the front door or are rooted in an application vulnerability (for example, SQL injection or acquiring a web shell on a front-end server) or a combination of both. If the application is vulnerable, the attack or attacker might exploit the vulnerability and then simply pivot through the allowed microsegmentation rules between the application components (no amount of encryption or authentication will help). After all, the process running the application might be compromised, and the microsegmentation rule doesn’t know that. The process might be compromised because of any number of vulnerabilities such as (reflective) Dynamic-Link Library (DLL) injection attack, shell code exploit, and so on.
Is network segmentation a useless idea? Absolutely not. Network segmentation is an essential tool, but you need to understand its strengths and limitations and use it effectively. Segmentation is a compartmentalization approach. If a breach happens, it tries to contain the attack and make the life of the attacker a bit more difficult. However, if there are connections (doors) between the compartments, that’s what the attack/attacker can then exploit.
● Observation 2: Enforcing the network segmentation policy is a simpler part of the microsegmentation story. Generating an always up-to-date, correct network microsegmentation policy in a highly agile, cloud-scale environment is a massive challenge.
Now add additional operational requirements such as merging an existing business security policy with the auto generated application policy, providing right access through role-based access control, merging tribal knowledge from different application groups, maintaining versioning and rollbacks, tracking usage analytics around policies, identifying per-packet policy compliance deviations, threat modeling (because you don’t want your microsegmentation to be used as a building block for ransomware because the vendor did a poor threat analysis and protection of a solution), and so on.
Hopefully you’ll see that enforcement is essential, but is really just a small part of a broader workload protection strategy.
● Observation 3: Are there differences between the campus computers and data center (running public or private) workloads? The answer is absolutely. Keeping this fact in mind when designing your security strategy is crucial.
Data centers are a relatively easier challenge to address by using allowed listing. On my laptop, the software I run differs from what other people in similar roles use, and the code that I run varies over time. Sometimes I code using vi or an Integrated Development Environment (IDE); other times I do email, message using a messaging app, browse the web using different browsers, have different browsing personas, run isolated virtual machines to try out an exploit technique or a penetration test idea, and so on.
So, the applications I run vary over time on the same machine or are different from those of my coworkers who might be running a different music app or use a different IDE to code. This situation evolved because users like me and you using campus end devices are human, and we like our freedom to install software packages we like and devices with which we are comfortable. This is why the bring-your-own-device movement picked up.
Now contrast this to workloads running in the data center. Most workloads are homogeneous for a given function. Think web front-end servers for one given application: they follow the same set of software. Without a model that can be automated, you can’t build a large data center because the human operation cost to run the data center will be immense.
It’s easier to allow applications and their communication patterns on data center workloads. For campus machines, often the easier approach is to use a blocked list of what not to do or run, with highly coarse rather than fine-grained allowed list policies. Our observation is that the data center is a much more tractable problem for allowed listing and automation in general (with homogeneity and automation systems into which to plug).
● Observation 4: Whatever solution we consider must work for today’s technology trends and those on the horizon. For example, mainframes (if you have them), bare metals, virtual machines, and containers all must be natively supported; public and private data centers must be supported. The solution must integrate with external orchestrators, load balancers, and external enforcement points (there is value in defense in depth) and pull metadata/context from the Configuration Management Database (CMDB) systems. This integration is absolutely essential because the application knowledge and context are distributed among all of these systems.
Back to the nation-state cyberdefense experts. The Australian Signals Directorate (ASD) came out with a great set of white papers (https://www.asd.gov.au/infosec/mitigationstrategies.htm) in which the ASD looked at the attacks it saw and how to mitigate or prevent them. I’m a big fan of this analysis. It’s pretty comprehensive and, most importantly, is data driven. The cyberanalyst team started from the attacks it saw across a wide set of workloads and mapped those attacks back to what would have prevented them. The study is vendor neutral, with no one selling any gear. I have seen similar recommendations from the NSA’s TAO group (https://www.wired.com/2016/01/nsa-hacker-chief-explains-how-to-keep-him-out-of-your-system/).
The primary takeaways from the ASD analysis are:
● The threat is real, but there are things every organization can do to significantly reduce the risk of a cyberintrusion.
● In 2009, based on our analysis of these intrusions, the ASD produced strategies to mitigate targeted cyberintrusions. At least 85 percent of the intrusions to which ASD responded in 2011 involved adversaries using unsophisticated techniques that would have been mitigated by implementing the top four mitigation strategies as a package.
● The top four mitigations are application allowed listing, patching applications and operating systems, using the latest versions, and minimizing administrative privileges.
Why doesn’t everyone just follow the four steps, if it is that simple?
Allowed listing applications—their allowed system calls and interactions—and application communication patterns is a nontrivial problem (see the observations earlier, especially 2 and 4). In this paper, we will study the problem and then look at the solutions. There is hope.
Life evolved on planet Earth because there were multiple factors that had to come together. NASA (https://science.nasa.gov/science-news/science-at-nasa/2003/02oct_goldilocks) coined the term Goldilocks zone, which must be present to support life.
In computer security for workload protection, multiple things have to come together to produce an effective security solution for cloud workloads.
A good cloud workload protection solution must have all of the following characteristics.
Characteristic 1: independence of how the workload is instantiated
The solution must be able to span containers at one end of the spectrum and to mainframes at the other. There must be a common policy enforcement and workload protection framework. Example: A Programmable Data Processor (PDP) mainframe and a container have processes. The same concepts apply to both. The solution must cover mainframes, bare metal workloads, virtual machines, containers, and so on. And it must work across a plethora of operating systems.
This makes the life of you as an end user really simple. You should only have to worry about how to apply a policy to isolate workloads with a highly exploitable vulnerable software package from high-risk workloads. You shouldn’t have to worry about where that workload resides (in the cloud or an on-premises data center), what operating system it’s running, whether it is a container or a virtual machine, and so on. The solution should be extensible to handle function as a service: something that’s on the horizon, but not in the mainstream yet.
Characteristic 2: Workload location independence from public or private clouds to data centers
The solution must be able to handle workloads running both in on-premises data centers and in any public cloud. Like characteristic 1, this characteristic enables the customer to simply push policies and actions without having to worry about where the workload resides.
Characteristic 3: Federation and scale
Every system has scale limits. A highly desirable solution, which could even be argued to be a mandatory requirement at scale, is to federate with other systems. Federation enables the customer to have multiple workload protection zones for availability and robustness reasons and still want the zones to share information with each other.
Characteristic 4: Encourage and integrate with an ecosystem
The solution must be able to talk to other external systems, including:
● Security Event and Incident Management (SEIM) or high-level log correlation systems in the northbound interface
● Enforcement products from other vendors on the southbound interface
● Ability to talk horizontally to the campus network security controllers to enrich its own knowledge and share its insights with them
● Ability to talk to compute orchestration systems such as AWS, Azure, GCP, VMware vSphere, Kubernetes API server, and so on
● Ability to talk to CMDB systems to get inventory metadata
● Ability to talk to application delivery controllers
● Ability to talk in threat feeds and nonthreat feeds (for example, geo data)
Characteristic 5: Encourage multiple points of enforcement
Security is always best with layers of defense. Having just one enforcement point is risky. The more points of enforcement, the harder the target is to attack.
Characteristic 6: Visibility across multiple planes and domain boundaries
The solution must have visibility and an enforcement reach across multiple planes. The solution must be able to watch the network plane, storage plane, compute plane, and user plane. It must be able to view things inside the workload (virtual machine/container/bare metal) and also be able to see outside the workload to correlate things across these boundaries and make sense.
Here’s an example of how watching multiple domains helps when there is a kernel rootkit and the only presence of visibility is inside the virtual machine. Chances are if the rootkit is well written, it will hide its signatures such as the Command and Control (C&C) channel from the agent watching the workload. However, if the infrastructure sees the flows and attributes them to the workload, the differential analysis clearly triggers suspicion.
Now let’s look at the building blocks of a robust cloud workload protection solution.
Building block 1: Visibility
You want a solution that magically captures high-granularity interaction data between the workloads and inside the workload. This data store forms the foundational basis of all the layers above it. As they say, “If you can’t measure it, you can’t improve it.” Another version applicable to security is “If you can’t see it, you can’t secure it.”
What should a system ideally collect?
● Packet-by-packet network activity
● Process metadata (command line, process hash, libraries loaded, dynamic or static, forks, exits, process I/O, File Descriptors (FD), and so on)
● Storage or file system activity: files accessed, hash of the content of the file, and so on
● User activity: who logged in, how the login was done (remote, local, tty, console, and so on), time of login, time of logout, and so on
In addition to collection, the system should be able to aggregate and uplevel the data.
Why is this so important?
To generate high-quality policies in a brownfield environment, you need to access high-resolution data. Also, this data helps in testing policy changes on past data before pushing it into enforcement and can be used to search for Indicators Of Compromise (IOC) in the environment on past data. This data also helps with forensic analysis of an incident.
Building block 2: Vulnerability detection
All installed software that runs in the user space or in the kernel on the workload must be scanned for vulnerabilities. Vulnerability analysis can be done using static analysis of the coder or by comparing the code against a known set of vulnerabilities. Vulnerability analysis also helps to identify unused network services and packages and help the user to uninstall them.
One approach is to use an outside in vulnerability scanner: one that checks the workloads using an external scanning tool. The other approach, also known as the inside out approach, uses an agent running on the workload that does an internal analysis of the workload. Of the two approaches, inside out is more reliable and produces superior results compared to an external scanner limited to discovering the packages based on the network fingerprint. Cloud operators often provide image analysis service before the workload is launched, assuming nothing is installed at runtime.
Building block 3: Full policy lifecycle of workload segmentation
Workload segmentation is a very important building block of the protection solution. It provides protection by isolating the workload from the part of the universe in which it doesn’t have to communicate. Another contribution of workload segmentation is containing any system breach to as small a blast radius as possible. Workload segmentation often makes use of firewalling in the workload (for example, iptables/ipsets, Windows advance firewall, or Windows filtering platform). Policies inserted inside the workload move with the workload if it is migrated elsewhere.
The primary stumbling block of workload segmentation is definition and discovery of the allowed list policy at scale. Doing this is an impossible task without an analytics engine. The other challenge with workload segmentation is to build a policy that supports elastic workloads, testability of the policy without enforcing it, providing real-time compliance reports, version tracking, and policy rollback.
Segmentation policy sometimes covers traffic encryption or authentication protection between workloads, thereby protecting the communication between the workloads on a shared segment.
Building block 4: Application behavior allowed listing
Application behavior allowed listing observes the behavior of the application running on the workload and optionally builds a behavior baseline model for the process. This function observes network communication, system call activity of each process, and the relationship between processes and examines CPU counters and files opened by the process, the user associated with the process, memory analysis, memory cache spills, and more of every application.
To do application behavior allowed listing, the agent must have presence inside the workload or must be able to look into the workload’s name space.
Well-known bad behavior is often preloaded with the agent or the solution, and in case that behavior is picked up, mitigation action is taken. Also, if the application deviates from its known behavior model with some added tolerance, mitigation action is taken.
Enabling application behavior allowed listing with very few false positives or false negatives is a notoriously hard problem to implement in production systems at scale.
Process hash analysis across a fleet of workloads looking for known bad hashes is often considered by some as application behavior allowed listing; for this to be effective, the process hash must check on physical storage and the process hash running in memory.
Mitigation action on detection of bad behavior often involves using built-in operating system application control capabilities such as software restriction policies and AppLocker in Windows or SELinux (CentOS/Red Hat/Oracle Linux) or AppArmor with Ubuntu Linux.
Building block 5: Application allowed listing
Application allowed listing involves allowing only a known set of processes to be started; everything else is banned. The use of allowed listing to control what executables are running on a server provides a powerful security protection strategy. All malware that manifests itself as a file to be executed is blocked by default. The philosophy here is that if there is an SQL server, it’s much easier to lock down the SQL server application than to tell the server thousands of things it shouldn't be doing.
The difference between building blocks 4 and 5 is that 4 makes a behavioral model, and 5 allows a known set of executables and libraries to be loaded. The allowed list-only model does not block bad behavior from an allowed listed application because it doesn’t have the visibility into the behavior of the workload.
Building block 6: File integrity and memory monitoring
File integrity monitoring tracks the sanctity of the file system, bootloaders, startup folders, registry on Windows, drivers, and so on. The bootup process then becomes a secure boot. Part of the solution must include memory scans for memory-resident malware or malware probing the cache lines or setting the read, write, and execute settings of the page tables. The latter is a specific form of memory monitoring, in which the control system of the memory is watched rather than the contents of the memory.
Building block 7: Deception and decoys
Deception and decoys try to turn the tables on the attacker. Security protection capability creates fake vulnerabilities, systems, shares, cookies, and so on, often called honey tokens. If an attacker tries to exploit these fake resources, it’s a strong indicator that an attack is in progress. A legitimate user should not see or try to access these resources.
Another technique is to tunnel traffic sent by the attacker to unbound network ports on the machine under attack to another decoy machine with the same operating system. This binds on all ports. The promiscuous decoy machine allows the attacker to play and explore itself, while continuously logging the attacker’s actions. The monitoring system watching the decoy learns from the decoy system logs. The key in this model is to deceive the attacker from noticing that it is not attacking a real system but rather attacking a decoy system. A sophisticated decoy, when clubbed with an artificial intelligence, results in the attacker educating the defending AI system of the attack trade craft and techniques. The defense system can now use those techniques to look for patterns in the real workload fleet.
Now that we’ve gone over the characteristics and essential building blocks of a solution, let’s see which components the Cisco® Secure Workload platform already addresses and how it’s done.
To track vulnerabilities in the operating system or in the user space software in the workload, the Cisco Secure Workload platform first establishes a baseline inventory in the data center, whether public or private. The Secure Workload platform tags every piece of workload inventory with an arbitrary set of user-defined tags or tags it discovers from the orchestration systems (for example, AWS, VMware, Kubernetes). And it scales to multimillions of workloads it tracks. Information is kept up to date in real time. Any changes in the environment (orchestrators) or on the workload are picked up. This information is then curated or indexed and readily available for fast search and further processing, say, for segmentation policy enforcement.
In addition to the context of the workload, the Cisco Secure Workload platform gathers the full software inventory (installed software packages) from all monitored workloads in the data center, whether public or private. Just like the context of the workload, the Cisco Secure Workload curates the software package inventory per workload. Any changes made to the software inventory triggers the platform to reevaluate policy and check for vulnerabilities.
Cisco Secure Workload platform also bundles the vulnerabilities reported over the last 19 years. It uses Common Vulnerabilities and Exposures (CVE) like ones available from NIST (https://nvd.nist.gov/) or MITRE. If enabled by the customer, platform will also be keep the CVE databases on the cluster up to date with periodic updates. Cisco Secure Workload platform also cross-checks every software package against the full CVE database looking for vulnerable packages, taking into account the version of the software package, the impact scores, and so on.
Figure 1 shows a query across workloads that belong to a specific tenant on the set of machines with a vulnerable version of wget installed.
Inventory search to quickly identify all servers running the wget application and has a known CVE
Figure 2 shows that the Cisco Secure Workload platform enables the user to glean information about what each software vulnerability is and give it a vulnerability score or ranking. It also deep links the CVE to NIST’s site for more details, as shown in Figure 3.
Software package inventory on a server and details on the known CVEs
Vulnerability details from NIST website
The Cisco Secure Workload platform also enables the users or machines to drive policy updates (over the Secure Workload open RestAPI interface) based on the presence of a package, vulnerability or vulnerability scores, and so on.
It tracks both versions of vulnerability scores. Users can also acknowledge and disregard vulnerabilities, assuming they are deemed to be unexploitable by the user because of some other mitigating factor (for example, an air-gapped system might be considered to be a sufficient barrier).
This system enables Cisco Secure Workload customers to:
● Search for any package in real time across their infrastructure without affecting the CPU load on their workloads.
● Deploy sophisticated policies across their workloads, which can track a vulnerability and automatically trigger a time-delayed policy, isolating the workload for remediation.
● Enable info security to mandate policies such as isolating all web-facing workloads with a CVE vulnerability score greater than 9.0 after giving the workload owner a five-day fix notice.
A picture says a thousand words. Figure 4 shows a system with a vulnerable version of zookeeper installed.
Information about software packages and vulnerability details
Cisco Secure Workload is a flexible metadata system and therefore can enforce policies based on inventory metadata. It reevaluates the policy associated with each inventory if the metadata changes. (See Figure 5.)
Policy definition using metadata
Figure 6 shows the sample enforced policy.
Policy enforcement using metadata
Cisco Secure Workload software agents track all activity on the workload, including the process, server user, interactions with changing the memory or page table permissions and cache behavior, and many other system calls on the workload. It also tracks actions done on the file system. All of this information is gathered with minimal load (under 3 percent of CPU for all its monitoring functions) on workloads. Cisco Secure Workload software agent enforces a strict SLA on itself. Cisco Secure Workload platform curates all workload activity in a time-series database and associates the activity stream to the workload in its data lake.
The user can ask Cisco Secure Workload platform to show the entire process lineage on any workload in the fleet of machines since the workload was started. In fact, it tracks the full process lineage tree. The user can step forward or backward and play the sequence. The roots in Figure 7 show the lineage for the kernel threads and for the user space processes.
Process lineage on a server host
Cisco Secure Workload platform tracks process hashes for those that are running. It calculates an SHA256 hash of each process it observes. The process hashes are stored in its corpus. It then cross-checks the hashes’ with open-source hash databases such as the VirusTotal.
Process hash analytics are used by many security practitioners to identify good software from bad by cross-checking the hash against databases. By automating that process, Cisco Secure Workload platform makes it simple for the user to apply this across the data center for scale.
In Figure 8, you can see the process hashes on the workload. Anvil’s process hash is cross-checked by following the process hash link leading to VirusTotals. The screen shot from virus totals in Figure 9 shows the verdict from the other partners in VirusTotals regarding the anvil process hash.
Process hash information from servers
Verdict provided for the process hash by VirusTotal
In the future, the Secure Workload platform will programmatically integrate with VirusTotals: pull data and push data to it. It will exchange its verdict on the process hash and programmatically consume verdicts from other security companies. This information could then be used in Secure Workload enforcement policies. In addition, it will cross-check the process hashes it sees against the National Software Reference Library RDS datasets of known binaries and applications.
Cisco Secure Workload platform and application behavior allowed listing
Cisco Secure Workload platform studies the behavior of the various processes and applications in the workload, measuring them against known bad behavior sequences. It also factors in the process hashes it collects. By studying various sets of malwares, the Secure Workload engineering team deconstructed it back into its basic building blocks. Therefore, the platform understands clear and crisp definitions of these building blocks and watches for them.
The various suspicious patterns for which the Cisco Secure Workload platform looks in the current release are:
● Shell code execution: Looks for the patterns used by shell code.
● Privilege escalation: Watches for privilege changes from a lower privilege to a higher privilege in the process lineage tree.
● Side channel attacks: Cisco Secure Workload platform watches for cache-timing attacks and page table fault bursts. Using these, it can detect Meltdown, Spectre, and other cache-timing attacks.
● Raw socket creation: Creation of a raw socket by a nonstandard process (for example, ping).
● User login suspicious behavior: Cisco Secure Workload platform watches user login failures and user login methods.
● Interesting file access: Cisco Secure Workload platform can be armed to look at sensitive files.
● File access from a different user: Cisco Secure Workload platform learns the normal behavior of which file is accessed by which user.
● Unseen command: Cisco Secure Workload platform learns the behavior and set of commands as well as the lineage of each command over time. Any new command or command with a different lineage triggers the interest of the Secure Workload s platform. (See Figure 10.)
Process related events that can the tracked and setup alerts
In the following examples, Cisco Secure Workload platform did not have any malware signatures. It simply looked for the building blocks of bad behavior, cross-checked them with the process history and baseline, and alerted only if it was confident that the incident was worth noting and alerting: application behavior allowed listing. Cisco Secure Workload platform has longer retention for events as compared to the full process tree.
Let’s take an example and see Cisco Secure Workload platform detecting a shell code execution. Figure 11 shows how Cisco Secure Workload platform detects shell code execution, triggers an alert, and tracks all the actions done by the malicious process.
Detecting shell-code execution and providing forensics
The user can then drill down into the malicious process and study it further. Cisco Secure Workload platform tracks all 25 descendent jobs forked by the malicious bash shell script. (See Figure 12.)
View of all the process forked by malicious shell script
Figure 13 shows a large process tree where stage one of the malware, which is the shell code, is loading or installing the second stage of the malware. In the process, 176 child processes are created and executed by the malware.
Tree view of all processes spawned by a malware
Custom rule engine
Cisco Secure Workload platform also enables users to define their own set of rules based on the building blocks it built to find custom malware. Users can define the desired action and severity level they want, as shown in Figure 14 and Figure 15.
Forensics alert definition window
Forensics alert definition window
When Meltdown and Spectre (https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html) were announced, our team tried the exploits in the lab and checked what the Secure Workload platform did with them. It detected anomalies in kernel page table fault behavior and also observed cache timing attacks without any extensions. We have since added the strings “Meltdown” and “Spectre” to the reports; other than that, the engine built by the engineering team stood the test.
Cisco Secure Workload platform detected both the attacks Proof-Of-Concept (POC) exploits and the vulnerabilities. This was possible because it has a side channel attack detection and page table operations engine. This picks up bursts in kernel page table faults (used to detect Meltdown) and cache timing attack patterns (used to detect Spectre and other side channel attacks).
With Spectre attacks, the attacker needs to access the cache side channel to obtain the data. This behavior is the same as for the Meltdown attack. There are a variety of cache side channel attacks, including flush plus reload, prime plus probe, and other variants. A primary behavior of these attacks is the large amount of last level cache misses generated during the attack.
With Meltdown, if an attacker accesses the memory location, which is inaccessible to the attacker, a page fault will be triggered. In the case of a user attacking kernel memory space, a kernel page fault will be triggered. To dump the kernel contents, the attacker needs to trigger a large amount of kernel page faults to reduce the noise in the cache side channel. This is what Meltdown does. The attacker can get stealthier and trigger only a small number of kernel page faults periodically, but then the attacker’s cache becomes polluted because the attacker has no control over the execution of other processes. In the real world, the kernel rarely has page faults because the kernel loads its pages. Therefore, a small number of faults in a short period (for example, one second) becomes an interesting signal for the platform.
Figures 16 through 19 are some interesting snapshots of Meltdown and Spectre detection in action.
View of a Meltdown event
Meltdown event details
View of a Spectre event
Spectre event details
Now you have an in-depth knowledge of the new cloud workload protection capabilities added to the Cisco Secure Workload platform. The must have cloud workload protection characteristics limit viable solutions to a Goldilocks zone. The Goldilocks zone covers the edge of infrastructure and goes deep into the workload.
You learned about software vulnerability detection, application behavior allowed listing, and how Cisco Secure Workload platform ties in these capabilities using enforcement, enabling the building blocks of a robust workload protection solution. You can learn more about currently available features of the Secure Workload platform, including high-resolution network visibility and operationalizing microsegmentation enforcement for cloud workload protection at scale, by visiting the Cisco Secure Workload website: https://www.cisco.com/go/Secure Workload.