Pre-Upgrade/Downgrade Checklists

Check Basic Information on Your Fabric

Check some basic information on your fabric to ensure that you have everything that you need for a smooth upgrade. Specifically, it is critical that you clear all faults. Although some faults are described as specific issues in Check Configurations and Conditions That May Cause An Upgrade Failure, you should always clear any faults before performing an upgrade except for the faults that are expected due to configurations in staging phase.

  • Clear all your faults

  • Perform a configuration export with AES Encryption

  • Verify access to out-of-band IP addresses of all your ACI nodes (all your APIC nodes and switch nodes)

  • Verify CIMC access for all your APICs

  • Verify console access for all your switches

  • Understand Changes in Behavior in Release Notes of both APIC and ACI switches for versions between the target and current version

  • Understand Open Issues and Known Issues in Release Notes of both APIC and ACI switches for the target version

Check Configurations and Conditions That May Cause An Upgrade Failure

The table below lists configurations and conditions that you should check to avoid upgrade failures or any known issues related to upgrades.

The items in the table should be detected automatically by the pre-upgrade validator embedded in APIC. However, some items might not be included in the APIC at this time or your APICs might be running a version that does not yet have the check implemented. In those cases, you can run the Pre-Upgrade Validator app from dcappcenter.cisco.com, or use the standalone script referenced below.

  • Pre-Upgrade Validator (APIC): A validator embedded in the APIC Upgrade Configuration. This is automatically performed when configuring an update group for APIC or switches.

  • Pre-Upgrade Validator (App Center app): A validator that can be installed on APICs as an app that can be downloaded through dcappcenter.cisco.com. This can be run on demand and is supported on release 3.2 and later.

  • Script: For any feature not currently implemented in the Pre-Upgrade Validator, a standalone script can be run directly on the APIC to validate any existing issues prior to upgrading. The script supports all versions of software. See https://github.com/datacenter/ACI-Pre-Upgrade-Validation-Script for more details about the script.

Details of each item is explained in the Details of Configurations and Conditions for Pre-Upgrade Validations.

Items

Faults, defects, or version

(if any or version specific)

Pre-Upgrade Validator

(APIC)

Pre-Upgrade Validator

(App)

Script

All Your APICs Are In a Fully Fit State 4.2(6)
All Your ACI Switches Are In an Active State
Compatibility (Target ACI Version) 4.2(4)
Compatibility (CIMC Version) 4.2(4)
Compatibility (APIC, Switch Hardware) 4.2(4)
Compatibility (Remote Leaf Switch) To 5.0(1) or later
NTP (Clocks Are Synchronized Across the Fabric) 4.2(5)
Implementation Change for Firmware Update Groups on APICs from Release 4.0(1)
Configurations That Must Be Disabled Prior To Upgrades

AppCenter App: 5.2(c)

Rule 1 – Divide your leaf and spine switches into at least two groups
Rule 2 – Determine how spine switches should be grouped 4.2(4)1
Rule 3 – Determine how leaf switches should be grouped
Switch Graceful Upgrade Guidelines
All Switch Nodes In vPC 4.2(4)
APIC Disk Space Usage (F1527, F1528, F1529)

F1527: 80% - 85%

F1528: 85% - 90%

F1529: 90% or more

4.2(1)
ACI Switch bootflash Usage F1821: 90% or more 4.2(4)
MD5sum Check for APIC and Switch Firmware

5.2(3e)

APIC Firmware Synchronization Across APICs 5.1(1)
Filesystem on Standby APICs

5.2(3a)

EPG Configuration on Ports Connected to APICs (F0467: port-configured-for-apic) F0467: port-configured-for-apic

6.0(1g)

Conflicting Interface L2/L3 Mode (F0467: port-configured-as-l2, port-configured-as-l3)

F0467: port-configured-as-l2

F0467: port-configured-as-l3

5.2(4d)

Conflicting L3Out Subnets for Contracts (F0467: prefix-entry-already-in-use) F0467: prefix-entry-already-in-use

6.0(1g)

Overlapping BD Subnets In the Same VRF (F0469: duplicate, F1425: subnet-overlap)

F0469: duplicate-subnets-within-ctx

F1425: subnet-overlap

5.2(4d)

SSD Health Status on APICs (F0101, F2730, F2731, F2732)

F0101: not available

F2730: less than 10% remaining

F2731: less than 5% remaining

F2732: less than 1% remaining

4.2(1)
SSD Health Status on ACI Switches (F3074, F3073)

F3074: reached 80% lifetime

F3073: reached 90% lifetime

4.2(1)
VMM Controller Connectivity (F0130) F0130 4.2(1)
Missing LLDP/CDP Adjacency Between Leaf Nodes and VMM Hypervisors (F606391) F606391 4.2(1)
Different Infra VLAN Being Injected Through LLDP (F0454: infra-vlan-mismatch) F0454: infra-vlan-mismatch
Policy CAM Programming for Contracts (F3545) F3545 5.1(1)
L3Out Subnets Programming for Contracts (F3544) F3544 5.1(1)
General Scalability Limits
Overlapping VLAN Pool
L3Out MTU Mismatch
L3Out BGP Peer Connectivity Profile Under a Node Profile Without a Loopback CSCvm28482 – to 4.1(2) or later
L3Out Incorrect Route Map Direction (CSCvm75395) CSCvm75395 – to 4.1(1) or later
Compatibility (Remote Leaf Switch) CSCvs16767 – to 14.2(2)
EP Announce Version Mismatch (CSCvi76161) CSCvi76161 – to 13.2(2) or later
Intersight Device Connector is Upgrading 4.2(5)
1 IPN connectivity is not checked by the pre-upgrade validator

Note

If the column does not have a checkbox next to the respective item, that means that the corresponding validation item is not yet covered by that automated option.


Details of Configurations and Conditions for Pre-Upgrade Validations

All Your APICs Are In a Fully Fit State

Check the status in System > Dashboard > Controller to ensure that the cluster status on all your APICs is in a Fully Fit state. If one or more of the APICs are in other state, such as Data Layer Partially Diverged, you must resolve the status of your APIC cluster first.

If your APICs are currently on release 4.2(1) or later, the command acidiag cluster on each APIC CLI will check the basic items related to APIC clustering for you. If not, follow Initial Fabric Setup in ACI Troubleshooting Guide 2nd Edition (http://cs.co/9003ybZ1d)

All Your ACI Switches Are In an Active State

Check Fabric > Inventory > Fabric Membership in the APIC GUI to ensure that all your ACI switches are in an Active state. If one or more of the ACI switches are in other state, such as Inactive, Maintenance and so on, you must resolve those issues first.

  • Inactive: This means that the switch has fabric discovery issues, such as IP reachability from APICs through the ACI infra network. If your switches are currently on release 14.2(1) or later, the command show discoveryissues on the switch CLI will check the basic items related to switch fabric discovery for you.

  • Maintenance : This means that the switch is in Maintenance Mode through the GIR (Graceful Insertion and Removal) operation. This implies that the switch is isolated from the fabric and does not process most of the APIC communications, including the upgrade-related communications. You must bring the switch back to the Active state before you can perform an upgrade. If you want to gracefully upgrade the switch by isolating the switches from the network first, consider Graceful Upgrade instead. See Graceful Upgrade of ACI Switches for details.

Compatibility (Target ACI Version)

Check the APIC Upgrade/Downgrade Support Matrix for the supported upgrade paths from your current version.

Compatibility (CIMC Version)

Check the APIC Upgrade/Downgrade Support Matrix for the supported UCS HUU version for your target APIC version to make sure all server components are running the version from the supported HUU bundle.

Compatibility (APIC, Switch Hardware)

Check the Release Notes of both the APIC and ACI switches for the target version to make sure your hardware is supported.

Compatibility (Remote Leaf Switch)

It is critical to enable Direct Traffic Forwarding for remote leaf switches prior to upgrading to APIC release 5.0(1) as the option becomes mandatory starting from this release.

Direct Traffic Forwarding can be enabled starting from the APIC release 4.1(2). Note that additional configuration for TEP IP addresses such as Routable Subnets or External TEP might be required for this option to be enabled. This means that if you are running a version prior to 4.1(2) and you have remote leaf switch configured, you cannot directly upgrade to release 5.0. In this case, we recommend that you upgrade to a 4.2 release, enable Direct Traffic Forwarding, and then upgrade to the desired 5.0 version.

See "Upgrade the Remote Leaf Switches and Enable Direct Traffic Forwarding" in the Cisco APIC Layer 3 Networking Configuration Guide for more information.

A related issue is addressed in "Remote Leaf Switch with Direct Traffic Forwarding Enabled" (CSCvs16767). If you are upgrading to the release 14.2(2) release while Direct Traffic Forwarding is enabled for the remote leaf nodes, you may hit a defect (CSCvs16767) that could cause remote leaf nodes to crash due to the Multicast FIB Distribution Manager (MFDM) process. This issue happens only when spine nodes are upgraded to release 14.2(2) first while remote leaf nodes with Direct Traffic Forwarding are still on release 14.1(2). Note that Direct Traffic Forwarding was introduced in release 14.1(2).

To avoid this issue, it is critical that you upgrade to release 14.2(3) or a later release instead of release 14.2(2) when Direct Traffic Forwarding is enabled.

If you must upgrade to release 14.2(2) for any reason, you must upgrade the remote leaf nodes first to avoid this issue.

NTP (Clocks Are Synchronized Across the Fabric)

Ensure that NTP is configured on both the APICs and the switches along with the necessary IP reachability to the NTP servers through Out-of-band (OOB) or In-band (INB) from each individual node.

Check the following sections in Troubleshooting Cisco ACI – Second Edition:

  • In-band and out-of-band management

  • Pod Policies — BGP RR / Date&Time / SNMP

Implementation Change for Firmware Update Groups on APICs from Release 4.0(1)

Beginning with APIC release 4.0(1), there is only one type of switch update group instead of the two that were used in previous releases (the firmware groups and maintenance groups). By consolidating two groups into one, the upgrade configurations are simplified. However, when upgrading Cisco APICs from a pre-4.0 release to release 4.0(1) or later, you must remove all firmware group and maintenance group policies prior to the upgrade.

  • To remove a firmware group policy, navigate to Admin > Firmware > Fabric Node Firmware > Firmware Groups, then right-click on the name of firmware group and choose Delete the Firmware Group.

  • To remove a maintenance group policy, navigate to Admin > Firmware > Fabric Node Firmware > Maintenance Groups, then right-click on the name of maintenance group and choose Delete the Maintenance Group.

Once the APICs are upgraded to 4.0(1) or later, you can create new switch update groups and perform the switch upgrade from pre-14.0 release to 14.0(1) or later.

This is applicable only when you are upgrading your APICs from pre-4.0 to 4.0(1) or later. Once your APICs are on 4.0(1) or later, you do not have to worry about this for any further upgrades.


Note

Internally, APICs running the 4.0(1) or later release handle the switch update groups with the same objects as the old maintenance group policies (such as maintMaintP), but with additional attributes. If you are using an API to configure upgrade policies, you should only use maintenance group policies starting from APIC release 4.0(1) or later without manually creating any firmware group policies, unlike the old pre-4.0 releases.


Configurations That Must Be Disabled Prior To Upgrades

The following features must be disabled prior to upgrades:

  • App Center apps

  • Maintenance Mode through Fabric > Inventory > Fabric Membership > Maintenance (GIR)

  • Config Zone

  • Rogue Endpoint (only when the running version is 14.1(x) or when upgrading to 14.1(x))

Deprecated Managed Objects

The Pre_Upgrade checker script checks for the existence of the following deprecated managed objects on the running version of the software and blocks the upgrade, if they exist in the configuration. You must update your script or code to use the new managed object.

  • Class: config:RsExportDestination

  • Class: config:RsImportSource

  • Class: fabric:RsResMonFabricPol

  • Class: infra:RsResMonInfraPol

  • Class: fabric:RsResMonCommonPol

  • Class: trig:Triggered

  • Class: trig:TriggeredWindow

  • Class: fv:CCg

  • Class: fv:RsToCtrct

  • Class: mgmt:RsOobEpg

  • Class: mgmt:RsInbEpg

  • Class: vns:RsCIfAtt

  • Class: fault:RsHealthCtrlrRetP

  • Class: fv:PndgCtrctCont

  • Class: vz:RsAnyToCtrct

  • Class: fv:PndgCtrctEpgCont

  • Class: fv:AREpPUpd

  • Class: vns:Chkr

  • Class: aaa:RsFabricSetup

  • Class: ap:PluginPol

  • Class: tag:ExtMngdInst

  • Class: telemetry:Server

  • Class: telemetry:FltPolGrp

  • Class: telemetry:FilterPolicy

  • Class: telemetry:FlowServerP

  • Class: pol:RsFabricSelfCAEp

  • Class: fabric:PodDhcpServer

  • Class: fabric:SetupAllocP

  • Class: fabric:AssociatedSetupP

  • Class: cloud:AEPgSelector

  • Class: fv:VmmSelCont

All Switch Nodes In vPC

High availability (HA) is always the key in network design. There are multiple ways to achieve this, such as with server configurations like NIC teaming, virtualization technology like VMware vMotion, or network device technology like link aggregation across different chassis. ACI provides high availability using virtual Port Channel (vPC) as the link aggregation across chassis.

It is important to keep the traffic flowing even during upgrades by upgrading one switch in the same HA pair at a time. In ACI, that will be a vPC pair unless you have other HA technologies on the server or virtualization side.

The Pre-Upgrade Validator checks if all switch nodes are in a vPC pair. This check is done when you upgrade APICs instead of switches because in ACI, APICs are upgraded first prior to switches, and configuring a new vPC pair potentially requires a network design change and that should be done prior to any upgrades. If you have other HA technologies in place, you can ignore this validation. vPC is not a requirement for the upgrade to complete, but the built-in tools to prevent leaf switches in a vPC domain from upgrading at the same time will not work if they are not in a vPC. If you are not using vPC, you must ensure the switches being upgraded will not cause an outage if done at the same time.

APIC Disk Space Usage (F1527, F1528, F1529)

If an APIC is running low on disk space for any reason, the APIC upgrade can fail. The APIC will raise three different faults depending on the amount of disk space remaining. If any of these faults are raised on the system, the issue should be resolved prior to performing the upgrade.

  • F1527: A warning level fault for APIC disk space usage. This is raised when the utilization is between 80% and 85%.

  • F1528: A major level fault for APIC disk space usage. This is raised when the utilization is between 85% and 90%.

  • F1529: A critical level fault for APIC disk space usage. This is raised when the utilization is between 90% and above.

You can run the following moqueries on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well. In the example below, with the faults against /firmware, you can simply remove unnecessary firmware images under Admin > Firmware in the APIC GUI. You should not perform the Linux command rm to remove an image directly from /firmware, as the firmware images are synchronized across APICs. If the fault is raised against a disk space that you are not aware of, contact Cisco TAC to resolve the issue prior to the upgrade.

Fault Example (F1528: Major Fault for APIC disk space usage)

The following shows an example situation where the disk space in /firmware is running low on APIC 1 (node 1).

admin@apic1:~> moquery -c faultInst -f 'fault.Inst.code=="F1528"'
Total Objects shown: 1
 
# fault.Inst
code             : F1528
ack              : no
annotation       :
cause            : equipment-full
changeSet        : available (Old: 5646352, New: 6036744), capUtilized (Old: 86, New: 85), used (Old: 33393968, New: 33003576)
childAction      :
created          : 2021-05-27T11:58:19.061-04:00
delegated        : no
descr            : Storage unit /firmware on Node 1 with hostname apic1 mounted at /firmware is 85% full
dn               : topology/pod-1/node-1/sys/ch/p-[/firmware]-f-[/dev/mapper/vg_ifc0-firmware]/fault-F1528
domain           : infra
extMngdBy        : undefined
highestSeverity  : major
lastTransition   : 2021-05-27T12:01:37.128-04:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : major
prevSeverity     : major
rn               : fault-F1528
rule             : eqpt-storage-full-major
severity         : major
status           :
subject          : equipment-full
type             : operational
uid              :

Note that all three faults look the same except for the utilization percentage and the fault’s severity.

ACI Switch bootflash Usage

ACI switches mainly have two different faults about the filesystem usage of each partition:

  • F1820: A minor level fault for switch partition usage. This is raised when the utilization of the partition exceeds the minor threshold.

  • F1821: A major level fault for switch partition usage. This is raised when the utilization of the partition exceeds the major threshold.

The threshold for minor and major depends on partitions. The critical one for upgrades is /bootflash. The threshold of bootflash is 80% for minor and 90% for major threshold.

On top of this, there is a built-in behavior added to every switch node where it will take action to ensure that the /bootflash directory maintains 50% capacity. This is specifically to ensure that switch upgrades are able to successfully transfer and extract the switch image over during an upgrade.

To do this, there is an internal script that is monitoring /bootflash usage and, if over 50% usage, it will start removing files to free up the filesystem. Given its aggressiveness, there are some corner case scenarios where this cleanup script could potentially trigger against the switch image it is intending to use, which can result in a switch upgrade booting a switch into the loader prompt given that the boot image was removed from /bootflash.

To prevent this, check the /bootflash prior to an upgrade and take the necessary steps to understand what is written there and why. Once understood, take the necessary steps to clear up unnecessary /bootflash files to ensure there is enough space to prevent the auto-cleanup corner case scenario.

The Pre-Upgrade Validator (both APIC and App) monitors the fault F1821, which can capture the high utilization of any partition. When this fault is present, we recommend that you resolve it prior to the upgrade even if the fault is not for bootflash.

The ACI Pre-Upgrade Validation script described earlier in this chapter focuses on the utilization of bootflash on each switch specifically to see if there are any issues with bootflash where the usage is more than 50%, which might trigger the internal cleanup script.

You can run either the Pre-Upgrade Validator or the script to check this issue. Following is the detailed information regarding the internal cleanup of bootflash with 50% threshold.

Validation

Once logged into a leaf switch CLI, /bootflash usage can be checked using df -h:

leaf1# df -h
Filesystem             Size    Used   Avail   Use%   Mounted on
rootfs                 2.5G    935M   1.6G    38%    /bin
/dev/sda4               12G    5.7G   4.9G    54%    /bootflash
/dev/sda2              4.7G    9.6M   4.4G     1%    /recovery
/dev/mapper/map-sda9    11G    5.7G   4.2G    58%    /isan/lib
none                   3.0G    602M   2.5G    20%    /dev/shm
none                    50M    3.4M    47M     7%    /etc
/dev/sda6               56M    1.3M    50M     3%    /mnt/cfg/1
/dev/sda5               56M    1.3M    50M     3%    /mnt/cfg/0
/dev/sda8               15G    140M    15G     1%    /mnt/ifc/log
/dev/sda3              115M     52M    54M    50%    /mnt/pss
none                   1.5G    2.3M   1.5G     1%    /tmp
none                    50M    240K    50M     1%    /var/log
/dev/sda7               12G    1.4G   9.3G    13%    /logflash
none                   350M     54M   297M    16%    /var/log/dme/log/dme_logs
none                   512M     24M   489M     5%    /var/sysmgr/mem_logs
none                    40M    4.0K    40M     1%    /var/sysmgr/startup-cfg
none                   500M     0     500M     0%    /volatile

Verifying /bootflash auto-removal

If you suspect that the auto cleanup removed some files within /bootflash, you can review a log to validate this:

leaf1# egrep "higher|removed" /mnt/pss/core_control.log
[2020-07-22 16:52:08.928318] Bootflash Usage is higher than 50%!!
[2020-07-22 16:52:08.931990] File: MemoryLog.65%_usage removed !!
[2020-07-22 16:52:08.943914] File: mem_log.txt.old.gz removed !!
[2020-07-22 16:52:08.955376] File: libmon.logs removed !!
[2020-07-22 16:52:08.966686] File: urib_api_log.txt removed !!
[2020-07-22 16:52:08.977832] File: disk_log.txt removed !!
[2020-07-22 16:52:08.989102] File: mem_log.txt removed !!
[2020-07-22 16:52:09.414572] File: aci-n9000-dk9.13.2.1m.bin removed !!

You can run the following moquery on the CLI of any APIC to check the usage of bootflash of each switch node.

f2-apic1# moquery -c eqptcapacityFSPartition -f 
'eqptcapacity.FSPartition.path=="/bootflash"'
Total Objects shown: 6

# eqptcapacity.FSPartition
name            : bootflash
avail           : 7214920
childAction     :
dn              : topology/pod-1/node-101/sys/eqptcapacity/fspartition-bootflash
memAlert        : normal
modTs           : never
monPolDn        : uni/fabric/monfab-default
path            : /bootflash
rn              : fspartition-bootflash
status          :
used            : 4320184

MD5sum Check for APIC and Switch Firmware

When performing an upgrade in an ACI fabric, there are multiple image transfers that must occur to prepare all nodes for upgrades. Most of these transfers perform a first level image validation. However, in the event of a failure, there is value in double-checking the image on each respective node.

Upgrade Image transfer touchpoints:

  1. Transfer the image onto your desktop/file server from cisco.com.

    Manually run MD5 against this image. You can validate the expected MD5 of the image from cisco.com.

  2. Upload the image from your desktop or ftp server onto one of the APICs.

  3. Once the image is added into the firmware repository, the initially uploaded APIC will copy that image to the remaining APICs in the cluster.

    You can manually check MD5 on each APIC's copy of the upgrade image by running the md5sum command against each APIC's copy of the image.

    For example:

    APIC1# md5sum /firmware/fwrepos/fwrepo/aci-apic-dk9.5.2.1g.bin
    f4c79ac1bb3070b4555e507c3d310826 /firmware/fwrepos/fwrepo/aci-apic-dk9.5.2.1g.bin
  4. The switches will eventually each get a copy of the switch .bin image when they are preparing for upgrade.

    You can run MD5 on the individual switch image within /bootflash.

    For example:

    leaf1# md5sum /bootflash/aci-n9000-dk9.15.2.1g.bin
    02e3b3fb45a51e36db28e7ff917a0c96 /bootflash/aci-n9000-dk9.15.2.1g.bin

APIC Firmware Synchronization Across APICs

After the image is downloaded to one of the APICs, the image is synchronized to all APICs in the cluster. This is critical especially for APIC images because each APIC needs the image locally to upgrade itself.

To do so, you can login to each APIC and check /firmare/fwrepos/fwrepo for the target image.

If the image is missing in one or more of APICs, wait for about 5 min if it’s right after the download. If the image is still missing, ensure that the APIC clustering status is healthy on all of the APICs and remove the image from the GUI or API (but not with the Linux command rm), then re-download the image to trigger the file synchronization again. If the image is still missing, contact Cisco TAC.

Filesystem on Standby APICs

Because a standby APIC is a cold standby and is not part of the cluster, it is not actively monitored for fault conditions. As filesystem full checks falls under this category, this means that any standby APICs exhibiting these conditions will not flag a fault and instead must be manually verified.

To do so, you can login to a standby APIC as rescue-user, then run df -h to manually verify the filesystem usage.

If any filesystem is found to be at 75% or above, contact TAC to identify and clear out the condition.

EPG Configuration on Ports Connected to APICs (F0467: port-configured-for-apic)

In a healthy ACI deployment, there should be no EPG or policy deployment pushed to any interfaces where an APIC controller is connected. When an APIC is connected to a leaf switch, LLDP validation occurs between the APIC and the leaf switch to allow it into the fabric without any configuration by the user. When a policy is pushed to a leaf switch interface that is connected to an APIC, that configuration will be denied and a fault will be raised. However, if the link to the APIC flaps for any reason, primarily during an upgrade when the APIC reboots, the policy can then be deployed to that leaf switch interface. This results in the APIC being blocked from re-joining the fabric after it has reloaded.

It is critical that you resolve these issues before the upgrade to prevent any issues. You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F0467: port-configured-for-apic):

The following fault shows an example of node 101 eth1/1 that is connected to an APIC that has some EPG configurations on it.

admin@apic1:~> moquery -c faultInst -x 'query-target-filter=wcard(faultInst.descr,"port-configured-for-apic")'
Total Objects shown: 1
 
# fault.Inst
code             : F0467
ack              : no
annotation       : 
cause            : configuration-failed
changeSet        : configQual:port-configured-for-apic, configSt:failed-to-apply, debugMessage:port-configured-for-apic: Port is connected to the APIC;, temporaryError:no
childAction      :
created          : 2021-06-03T07:51:42.263-04:00
delegated        : yes
descr            : Configuration failed for uni/tn-jr/ap-ap1/epg-epg1 node 101 eth1/1 due to Port Connected to Controller, debug message: port-configured-for-apic: Port is connected to the APIC;
dn               : topology/pod-1/node-101/local/svc-policyelem-id-0/uni/epp/fv-[uni/tn-jr/ap-ap1/epg-epg1]
/node-101/stpathatt-[eth1/1]/nwissues/fault-F0467
domain           : tenant
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2021-06-03T07:53:52.021-04:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F0467
rule             : fv-nw-issues-config-failed
severity         : minor
status           :
subject          : management
type             : config
uid              :

Conflicting Interface L2/L3 Mode (F0467: port-configured-as-l2, port-configured-as-l3)

This is another type of the F0467 fault code family that you should check before an upgrade. This fault alerts that an interface configured under a Layer3 Out (L3Out) has failed because the port that the policy is deployed for is operating in the opposite mode. For example, you might have configured a routed sub-interface under an L3Out, making the port an L3 port. However, there is already L2 policy on that port. A port in ACI can either be L2 or L3, but not both, just like a port on any layer 3 switches that can be either “switchport” (L2) or “no switchport” (L3), so this policy fails in this situation. The same rule applies if a port is already an L3 port, but you deploy L2 config onto it. After an upgrade, it’s possible that the previously working configuration will break if this faulty policy is deployed first after the switch reloads.

It is critical that you resolve these issues before the upgrade to prevent any issues. The interface that the fault is raised on should either be corrected or deleted in order to clear the fault. You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F0467: port-configured-as-l2):

The following fault shows an example of the configuration from L3Out OSPF under tenant jr has failed on node 101 eth1/7 because the same port is already configured as L2 by other components, such as EPGs or other L3Outs using the same port as SVI. It implies that, in this case, L3Out OSPF is trying to use node 101 eth1/7 as a routed port or routed sub-interface (L3) as opposed to SVI (L2).

admin@apic1:~> moquery -c faultDelegate -x 'query-target-filter=wcard(faultInst.changeSet,"port-configured-as-l2")'
Total Objects shown: 1
 
# fault.Delegate
affected         : resPolCont/rtdOutCont/rtdOutDef-[uni/tn-jr/out-OSPF]/node-101/stpathatt-[eth1/7]/nwissues
code             : F0467
ack              : no
cause            : configuration-failed
changeSet        : configQual:port-configured-as-l2, configSt:failed-to-apply, temporaryError:no
childAction      :
created          : 2021-06-23T12:17:54.775-04:00
descr            : Fault delegate: Configuration failed for uni/tn-jr/out-OSPF node 101 eth1/7 due to Interface Configured as L2, debug message:
dn               : uni/tn-jr/out-OSPF/fd-[resPolCont/rtdOutCont/rtdOutDef-[uni/tn-jr/out-OSPF]/node-101/
stpathatt-[eth1/7]/nwissues]-fault-F0467
domain           : tenant
highestSeverity  : minor
lastTransition   :2021-06-23T12:20:09.780-04:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fd-[resPolCont/rtdOutCont/rtdOutDef-[uni/tn-jr/out-OSPF]/node-101/stpathatt-[eth1/7]/nwissues]-fault-F0467
rule             : fv-nw-issues-config-failed
severity         : minor
status           :
subject          : management
type             : config

Fault Example (F0467: port-configured-as-l3):

The following fault shows an example of the opposite of above situation. In this example, L3Out IPV6 tries to use node 101 eth1/7 as an L2 port and it failed because other L3Outs are already using the same port as an L3 port.

admin@apic1:~> moquery -c faultDelegate -x 'query-target-filter=wcard(faultInst.changeSet,"port-configured-as-l3")'
Total Objects shown: 1

# fault.Delegate
affected         : resPolCont/rtdOutCont/rtdOutDef-[uni/tn-jr/out-IPV6]/node-101/stpathatt-[eth1/7]/nwissues
code             : F0467
ack              : no
cause            : configuration-failed
changeSet        : configQual:port-configured-as-l3, configSt:failed-to-apply, debugMessage:port-configured-as-l3: Port has one or more layer3 sub-interfaces;, temporaryError:no
childAction      :
created          : 2021-06-23T12:31:41.949-04:00
descr            : Fault delegate: Configuration failed for uni/tn-jr/out-IPV6 node 101 eth1/7 due to Interface Configured as L3, debug message: port-configured-as-l3: Port has one or more layer3 sub-interfaces;
dn               : uni/tn-jr/out-IPV6/fd-[resPolCont/rtdOutCont/rtdOutDef-[uni/tn-jr/out-IPV6]/node-101/
stpathatt-[eth1/7]/nwissues]-fault-F0467
domain           : tenant
highestSeverity  : minor
lastTransition   : 2021-06-23T12:31:41.949-04:00
lc               : soaking
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fd-[resPolCont/rtdOutCont/rtdOutDef-[uni/tn-jr/out-IPV6]/node-101/stpathatt-[eth1/7]/nwissues]-fault-F0467
rule             : fv-nw-issues-config-failed
severity         : minor
status           :
subject          : management
type             : config

Conflicting L3Out Subnets for Contracts (F0467: prefix-entry-already-in-use)

There is another type of the F0467 fault code family that you should check before an upgrade. This fault alerts that an external EPG defined under a Layer3 Out (L3Out) has a subnet with the “External Subnet for the External EPG” scope configured that overlaps with another L3Out external EPG in the same VRF. After an upgrade, it’s possible that the previous working configuration will break if this faulty policy is deployed first after the switch reloads.

It is critical that you resolve these issues before the upgrade to prevent any unexpected outages when the switch(es) upgrade. The subnet that the fault is raised on should either be corrected or deleted in order to clear the fault. You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F0467: prefix-entry-already-in-use):

The following shows an example of L3Out OSPF with an external EPG called all. In this external EPG, an L3Out subnet 112.112.112.112/32 is configured with “External Subnet for the External EPG” in the attempt to classify the source or destination IP address of packets to this external EPG for contracts application. However, it failed because the same subnet is already in use by another external EPG in the same VRF.

admin@apic1:~> moquery -c faultInst -x'query-target-filter=wcard(faultInst.descr,"prefix-entry-already-in-use")'
Total Objects shown: 1
 
# fault.Inst
code             : F0467
ack              : no
annotation       :
cause            : configuration-failed
changeSet        : configQual:prefix-entry-already-in-use, configSt:failed-to-apply, debugMessage:prefix-entry-already-in-use: Prefix entry sys/ctx-[vxlan-2621440]/pfx-[112.
112.112.112/32] is in use;, temporaryError:no
childAction      :
created          : 2021-06-22T09:02:36.630-04:00
delegated        : yes
descr            : Configuration failed for uni/tn-jr/out-OSPF/instP-all due to Prefix Entry Already Used in Another EPG, debug message: prefix-entry-already-in-use: Prefix
entry sys/ctx-[vxlan-2621440]/pfx-[112.112.112.112/32] is in use;
dn               : topology/pod-1/node-101/local/svc-policyelem-id-0/uni/epp/rtd-[uni/tn-jr/out-OSPF/instP-all]/nwissues/fault-F0467
domain           : tenant
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2021-06-22T09:04:51.985-04:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F0467
rule             : fv-nw-issues-config-failed
severity         : minor
status           :
subject          : management
type             : config
uid              :

Overlapping BD Subnets In the Same VRF (F0469: duplicate, F1425: subnet-overlap)

If at any point in time an overlapping IP address or subnet is deployed within a VRF, that policy will fail and a fault will be raised at the node level. However, on upgrade, it’s possible that this previously failing configuration will get pushed to the leaf switch before the previously working configuration. This results in a situation where the known working state before the upgrade is broken after the upgrade, and can cause connectivity issues for the previously working subnet.

There are two faults for this situation:

  • F0469 (duplicate-subnets-within-ctx) is raised when multiple BD subnets are configured with the exact same subnet in the same VRF

  • F1425 (subnet-overlap) is raised when BD subnets are not the same but overlapping

It is critical that you resolve these issues before the upgrade to prevent any issues. The subnet that the fault is raised on should either be corrected or deleted in order to clear the fault. You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F0469: duplicate-subnets-within-ctx):

admin@f1-apic1:~> moquery -c faultInst -f 'fault.Inst.code=="F0469"'
Total Objects shown: 4
 
# fault.Inst
code             : F0469
ack              : no
annotation       :
cause            : configuration-failed
changeSet        : configQual (New: duplicate-subnets-within-ctx), configSt (New: failed-to-apply), debugMessage (New: uni/tn-TK/BD-BD2,uni/tn-TK/BD-BD1)
childAction      :
created          : 2021-07-08T17:40:37.630-07:00
delegated        : yes
descr            : BD Configuration failed for uni/tn-TK/BD-BD2  due to duplicate-subnets-within-ctx: uni/tn-TK/BD-BD2 ,uni/tn-TK/BD-BD1
dn               : topology/pod-1/node-101/local/svc-policyelem-id-0/uni/bd-[uni/tn-TK/BD-BD2]-isSvc-no/bdcfgissues/fault-F0469
domain           : tenant
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2021-07-08T17:40:37.630-07:00
lc               : soaking
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F0469
rule             : fv-bdconfig-issues-config-failed
severity         : minor
status           :
subject          : management
type             : config
uid              :

Fault Example (F1425: subnet-overlap):

admin@apic1:~> moquery -c faultInst -f 'fault.Inst.code=="F1425"'
Total Objects shown: 1
 
# fault.Inst
code             : F1425
ack              : no
annotation       : 
cause            : ip-provisioning-failed
changeSet        : ipv4CfgFailedBmp (New: ipv4:Addraddr_failed_flag,ipv4:Addrctrl_failed_flag,ipv4:AddrlcOwn_failed_flag,
ipv4:AddrmodTs_failed_flag,ipv4:AddrmonPolDn_failed_flag,ipv4:Addrpref_failed_flag,ipv4:Addrtag_failed_flag,
ipv4:Addrtype_failed_flag,ipv4:AddrvpcPeer_failed_flag), ipv4CfgState (New: 1), operStQual (New: subnet-overlap)
childAction      : 
created          : 2020-02-27T01:50:45.656+01:00
delegated        : no
descr            : IPv4 address(10.10.10.1/24) is operationally down, reason:Subnet overlap on node 101 fabric hostname leaf-101
dn               : topology/pod-1/node-101/sys/ipv4/inst/dom-jr:v1/if-[vlan10]/addr-[10.10.10.1/24]/fault-F1425
domain           : access
extMngdBy        : undefined
highestSeverity  : major
lastTransition   : 2020-02-27T01:52:49.812+01:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : major
prevSeverity     : major
rn               : fault-F1425
rule             : ipv4-addr-oper-st-down
severity         : major
status           : 
subject          : oper-state-err
type             : operational
uid              : 

SSD Health Status on APICs (F0101, F2730, F2731, F2732)

Starting from APIC release 2.3(1), faults occur when the SSD media wearout indicator (life remaining) is less than a certain percentage on APIC nodes. An SSD with little lifetime may cause any operations that require updates for the internal database, such as upgrade or downgrade operations, to fail. The APIC will raise three different faults, depending on the amount of SSD life remaining. If the most critical fault (F2732) is raised on the system, you must replace the SSD by contacting Cisco TAC prior to performing the upgrade.

  • F2730: A warning level fault for APIC SSD life remaining. This is raised when the life remaining is less than 10%.

  • F2731: A major level fault for APIC SSD life remaining. This is raised when the life remaining is less than 5%.

  • F2732: A critical level fault for APIC SSD life remaining. This is raised when the life remaining is less than 1%.

Also, in very rare occasions, the SSD might have other operational issues than its lifetime. In such a case, look for the fault F0101.

You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

If your APICs are still running on a release older than the 2.3(1) release, contact Cisco TAC to check the SSD life remaining.

See the APIC SSD Replacement technote for more details.

Fault Example (F2731: Major Fault for APIC SSD life remaining):

The following shows an example of APIC 3 (node 3) with an SSD life remaining of 1% (the major fault F2731). In this case, although the critical fault F2732 for the life remaining less than 1% is not raised, it is close enough to F2732’s threshold and it is recommended to replace the SSD.

APIC1# moquery -c faultInfo -f 'fault.Inst.code=="F2731"'
Total Objects shown: 1
 
# fault.Inst
code             : F2731
ack              : no
annotation       :
cause            : equipment-wearout
changeSet        : mediaWearout (Old: 2, New: 1)
childAction      :
created          : 2019-10-22T11:47:40.791+01:00
delegated        : no
descr            : Storage unit /dev/sdb on Node 3 mounted at /dev/sdb has 1% life remaining
dn               : topology/pod-2/node-3/sys/ch/p-[/dev/sdb]-f-[/dev/sdb]/fault-F2731
domain           : infra
extMngdBy        : undefined
highestSeverity  : major
lastTransition   : 2019-10-22T11:49:48.788+01:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : major
prevSeverity     : major
rn               : fault-F2731
rule             : eqpt-storage-wearout-major
severity         : major
status           :
subject          : equipment-wearout
type             : operational
uid              :

SSD Health Status on ACI Switches (F3074, F3073)

Starting from release 2.1(4), 2.2(4), 2.3(1o), and 3.1(2m), faults occur if the flash SSD lifetime usage has reached a certain endurance limit on the leaf or spine switches. A flash SSD with little lifetime may cause any operations that require updates for the internal database, such as APIC communication, to fail, or the switch may fail to boot up. The ACI switch will raise two different faults depending on the amount of SSD life it’s consumed. If the most critical fault (F3073) is raised on the system, you must replace the SSD by contacting Cisco TAC prior to performing the upgrade.

  • F3074: A warning level fault for switch SSD lifetime. This is raised when the lifetime reached 80% of its limit.

  • F3073: A major level fault for switch SSD lifetime. This is raised when the lifetime reached 90% of its limit.

You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

If your APICs are still running an older release, contact Cisco TAC to check the SSD life status.

See theACI Switch Node SSD Lifetime Explained technote for more details.

Fault Example (F3074: Warning Fault for switch SSD lifetime):

The following shows an example of node 101 that has reached 85% of its SSD lifetime.

APIC1# moquery -c faultInst -f 'fault.Inst.code=="F3074"'

Total Objects shown: 4
 
# fault.Inst
code             : F3074
ack              : no
annotation       : 
cause            : equipment-flash-warning
changeSet        : acc:read-write, cap:61057, deltape:23, descr:flash, gbb:0, id:1, lba:0, lifetime:85, majorAlarm:no, mfgTm:2020-09-22T02:21:45.675+00:00, minorAlarm:yes, model:Micron_M600_MTFDDAT064MBF, operSt:ok, peCycles:4290, readErr:0, rev:MC04, ser:MSA20400892, tbw:21.279228, type:flash, vendor:Micron, warning:yes, wlc:0
childAction      : 
created          : 2020-09-21T21:21:45.721-05:00
delegated        : no
descr            : SSD has reached 80% lifetime and is nearing its endurance limit. Please plan for Switch/Supervisor replacement soon
dn               : topology/pod-1/node-101/sys/ch/supslot-1/sup/flash/fault-F3074
domain           : infra
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2020-09-21T21:24:03.132-05:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F3074
rule             : eqpt-flash-flash-minor-alarm
severity         : minor
status           : 
subject          : flash-minor-alarm
type             : operational

VMM Controller Connectivity (F0130)

If there is an issue in communication between the APIC and the VMM controller, the VMM controller status is marked as offline and the fault F0130 is raised. Ensure that the connectivity between them is restored prior to upgrades so that any resources that are currently deployed on switches based on the communication with the VMM controller will not be changed or lost due to the APICs not being able to retrieve necessary information after an upgrade.

You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F0130: VMM Controller connection failure):

The following is an example of the APIC failing to communicate with the VMM controller MyVMMControler with the IP 192.168.100.100 in VMM domain LAB_VMM.

apic1# moquery -c faultInst -f 'fault.Inst.code=="F0130"'
Total Objects shown: 1
 
# fault.Inst
code             : F0130
ack              : no
cause            : connect-failed
changeSet        : operSt (Old: unknown, New: offline)
childAction      :
created          : 2016-05-23T16:07:50.205-05:00
delegated        : yes
descr            : Connection to VMM controller: 192.168.100.100 with name MyVMMController in datacenter LAB1 in domain: LAB_VMM is failing repeatedly with error: [Failed to retrieve ServiceContent from the vCenter server 192.168.100.100]. Please verify network connectivity of VMM controller 192.168.100.100 and check VMM controller user credentials are valid.
dn               : comp/prov-VMware/ctrlr-[LAB_VMM]-MyVMMController/fault-F0130
domain           : external
highestSeverity  : major
lastTransition   : 2016-05-23T16:10:04.219-05:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : major
prevSeverity     : major
rn               : fault-F0130
rule             : comp-ctrlr-connect-failed
severity         : major
status           :
subject          : controller
type             : communications
uid              :

Missing LLDP/CDP Adjacency Between Leaf Nodes and VMM Hypervisors (F606391)

With On Demand or Immediate resolution immediacy as opposed to pre-provision in the VMM domain when attaching it to an EPG, for some VMM integrations such as VMware DVS integration, the APIC checks LLDP or CDP information from leaf switches connected to the hypervisors and also from the VMM controller managing the hypervisors. This information is required from both leaf switches and hypervisors to dynamically detect the leaf interface connecting to the hypervisor, even when there is an intermediate switch in between such as Cisco UCS Fabric Interconnect. Once the interface is detected, the APIC dynamically deploys VLANs only on the necessary interface(s) of the leaf switch that the hypervisor is connected to.

Prior to APIC release 3.0(1), VLANs used to be removed from the leaf interfaces if the APIC loses the connectivity to the VMM controller because the APIC can no longer compare the LLDP or CDP information from the hypervisor’s point of view. Starting from APIC release 3.0(1), VLANs will not be removed from the leaf interfaces even if the APIC loses the connectivity to the VMM controller to prevent transient management plane issues from impacting the data plane traffic. However, it may cause some churns in the APIC process by repeatedly trying to obtain the LLDP/CDP information. When the LLDP/CDP information is missing, the fault F606391 is raised.

Due to these reasons, regardless of the APIC release, it is important to resolve this fault prior to the upgrade. If the faults are raised on a VMM domain configured for the Cisco Application Virtual Edge (AVE), LLDP and CDP can be disabled entirely since the control plane built to program the switches is based on the opflex protocol and not LLDP/CDP. When LLDP and CDP are disabled, the faults should clear. The configuration to change the LLDP/CDP state for a VMM domain is configured under the vSwitch Policy for the VMM Domain.

You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F606391: LLDP/CDP adjacency missing for hypervisors):

apic1# moquery -c faultInst -f 'fault.Inst.code=="F606391"'
Total Objects shown: 5
 
# fault.Inst
code                      : F606391
ack                       : no
annotation                : 
cause                     : fsm-failed
changeSet                 : 
childAction               : 
created                   : 2019-07-18T01:17:39.435+08:00
delegated                 : yes
descr                     : [FSM:FAILED]: Get LLDP/CDP adjacency information for the physical adapters on the host: hypervisor1.cisco.com(TASK:ifc:vmmmgr:CompHvGetHpNicAdj)
dn                        : comp/prov-VMware/ctrlr-[LAB_VMM]-MyVMMController/hv-host-29039/fault-F606391
domain                    : infra
extMngdBy                 : undefined
highestSeverity           : major
lastTransition            : 2019-07-18T01:17:39.435+08:00
lc                        : raised
modTs                     : never
occur                     : 1
origSeverity              : major
prevSeverity              : major
rn                        : fault-F606391
rule                      : fsm-get-hp-nic-adj-fsm-fail
severity                  : major
status                    : 
subject                   : task-ifc-vmmmgr-comp-hv-get-hp-nic-adj
type                      : config
uid                       :

Different Infra VLAN Being Injected Through LLDP (F0454: infra-vlan-mismatch)

If you have interfaces connected back-to-back between two different ACI fabrics, you must disable LLDP on those interfaces prior to upgrades. This is because when the switch comes back up after the upgrade, it may receive and process LLDP packets from the other fabric that may be using a different infra VLAN. If that happens, the switch incorrectly tries to be discovered through the infra VLAN of the other fabric and will not be discoverable in the correct fabric.

There is a fault to detect if an ACI switch is currently is receiving an LLDP packet with infra VLAN mismatch from other fabrics.

You can run the moquery below on the CLI of any APIC to check if the fault exists on the system.

Fault Example (F0454: LLDP with mismatched parameters):

apic1# moquery -c faultInst -f 'fault.Inst.code=="F0454"'
Total Objects shown: 2
 
# fault.Inst
code             : F0454
ack              : no
alert            : no
annotation       :
cause            : wiring-check-failed
changeSet        : wiringIssues (New: ctrlr-uuid-mismatch,fabric-domain-mismatch,infra-ip-mismatch,infra-vlan-mismatch)
childAction      :
created          : 2021-06-30T10:44:25.576-07:00
delegated        : no
descr            : Port eth1/48 is out of service due to Controller UUID mismatch,Fabric domain name mismatch,Infra subnet mismatch,Infra vlan mismatch
dn               : topology/pod-1/node-104/sys/lldp/inst/if-[eth1/48]/fault-F0454
--- snip ---

Policy CAM Programming for Contracts (F3545)

The fault F3545 occurs when the switch fails to activate a contract rule (zoning-rule) due to either a hardware or software programming failure. If you see this, it's because the policy CAM is full and no more contracts can be deployed on the switch, and a different sets of contracts may deploy after a reboot or upgrade. This may lead to services that used to work before an upgrade begins to fail after an upgrade. Note that the same fault could occur for other reasons, such as an unsupported type of filter in the contract(s) instead of policy CAM usage. For instance, first generation ACI switches support EtherType IP but not IPv4 or IPv6 in contract filters. When this fault is present, check the Operations > Capacity Dashboard > Leaf Capacity in the APIC GUI for policy CAM usage.

You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F3545: Zoning Rule Programming Failure):

The following shows an example of node 101 with programming failure for 266 contract rules (zoneRuleFailed). Although it also shows the programming failure of L3Out subnets (pfxRuleFailed) in the changeSet, a separate fault F3544 is raised for that.

apic1# moquery -c faultInst -f 'fault.Inst.code=="F3545"'
Total Objects shown: 1
 
# fault.Inst
code             : F3545
ack              : no
annotation       : 
cause            : actrl-resource-unavailable
changeSet        : pfxRuleFailed (New: 80), zoneRuleFailed (New: 266)
childAction      : 
created          : 2020-02-26T01:01:49.256-05:00
delegated        : no
descr            : 266 number of Rules failed on leaf1
dn               : topology/pod-1/node-101/sys/actrl/dbgStatsReport/fault-F3545
domain           : infra
extMngdBy        : undefined
highestSeverity  : major
lastTransition   : 2020-02-26T01:03:59.849-05:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : major
prevSeverity     : major
rn               : fault-F3545
rule             : actrl-stats-report-zone-rule-prog-failed
severity         : major
status           : 
subject          : hwprog-failed
type             : operational
uid              :

L3Out Subnets Programming for Contracts (F3544)

The fault F3544 occurs when the switch fails to activate an entry to map a prefix to pcTag due to either a hardware or software programming failure. These entries are configured for L3Out subnets with the “External Subnets for the External EPG” scope under an external EPG in an L3Out, and used to map L3Out subnets to L3Out EPGs. If you see this because of the LPM or host routes capacity on the switch, such a switch may activate different sets of entries after a reboot or upgrade. This may lead to services that used to work before an upgrade begins to fail after an upgrade. When this fault is present, check the Operations > Capacity Dashboard > Leaf Capacity in the APIC GUI for LPM and /32 or /128 routes usage.

You can run the moquery below on the CLI of any APIC to check if these faults exist on the system. The faults are visible within the GUI as well.

Fault Example (F3544: L3Out Subnet Programming Failure):

The following shows an example of node 101 with programming failure for 80 L3Out subnets with “External Subnets for the External EPG” (pfxRuleFailed). Although it also shows the programming failure of contracts themselves (zoneRuleFailed) in the changeSet, a separate fault F3545 is raised for that.

apic1# moquery -c faultInst -f 'fault.Inst.code=="F3544"'
Total Objects shown: 1
 
# fault.Inst
code             : F3544
ack              : no
annotation       : 
cause            : actrl-resource-unavailable
changeSet        : pfxRuleFailed (New: 80), zoneRuleFailed (New: 266)
childAction      : 
created          : 2020-02-26T01:01:49.246-05:00
delegated        : no
descr            : 80 number of Prefix failed on leaf1
dn               : topology/pod-1/node-101/sys/actrl/dbgStatsReport/fault-F3544
domain           : infra
extMngdBy        : undefined
highestSeverity  : major
lastTransition   : 2020-02-26T01:03:59.849-05:00
lc               : raised
modTs            : never
occur            : 1
origSeverity     : major
prevSeverity     : major
rn               : fault-F3544
rule             : actrl-stats-report-pre-fix-prog-failed
severity         : major
status           : 
subject          : hwprog-failed
type             : operational
uid              :

General Scalability Limits

Check the Capacity Dashboard from Operations > Capacity Dashboard on the APIC GUI to ensure that any capacity does not exceed its limit. When it’s exceeding the limit, it may cause inconsistency on resources that are deployed before and after an upgrade just like it was warned for Policy CAM Programming for Contracts (F3545) and L3Out Subnets Programming for Contracts (F3544).

We recommend that you check the Capacity Dashboard for each switch through Operations > Capacity Dashboard > Leaf Capacity because those are typically the hardware limit instead of a software qualified limit. For instance, the number of endpoints such as MAC (Learned), IPv4 (Learned), Policy CAM, LPM, host routes and so on.

Overlapping VLAN Pool

Overlapping VLAN blocks across different VLAN pools may result in some forwarding issues, such as:

  • Packet loss due to issues in endpoint learning

  • Spanning tree loop due to BPDU forwarding domains

These issues may suddenly appear after upgrading your switches because switches fetch the policies from scratch after an upgrade and may apply the same VLAN ID from a different pool than what was used prior to the upgrade. As a result, the VLAN ID is mapped to a different VXLAN VNID than other switch nodes. This causes the two problems mentioned above.

It is critical to ensure that there are no overlapping VLAN pools in your fabric unless it is on purpose with the appropriate understanding of VLAN ID and VXLAN ID mapping behind the scene. If you are not sure, consider Enforce EPG VLAN Validation under System > System Settings > Fabric Wide Setting in the APIC GUI [available starting with release 3.2(6)], which prevents the most common problematic configuration (two domains containing overlapping VLAN pools being associated to the same EPG).

Refer to the following documents to understand how overlapping VLAN pools become an issue and when this scenario might occur:

L3Out MTU Mismatch

It is critical to ensure that MTU values on ACI L3Out interfaces and the routers that connect to them match. Otherwise, when the ACI switch boots up after an upgrade, it may cause problem during the routing protocol neighborship establishment or exchanging the route information between peers.

See below for example details on each protocol.

BGP is the protocol which would establish session without MTU consideration. BGP "Open and Establish" messages are small, but the messages to exchange routes can be huge.

OSPF will fail to form neighborship if the MTU from both end of the link does not match. However, although it is strongly not recommended, if the side with larger MTU is configured to ignore MTU to bring up OSPF neighborship, then OSPF neighborship will be formed.

During a border leaf switch upgrade, routing sessions will be teared off. When a border leaf switch comes online with a new version, it will bring up routing peer. After this, when it starts to exchange information about routing prefix, it will generate frames with a possibly bigger payload. Based on the size of the table, the update may need a larger frame size which might not reach the other side. The size of this payload will depend upon local MTU. If the MTU on the other side does not match (if it's smaller than the local MTU size), then these exchanges will fail, resulting in routing issues.

You can check and configure the MTU on L3Out interfaces through Tenant > Networking > L3Out > Logical Node Profile > Logical Interface Profile > Select interface type.

You can run the moquery below on the CLI of any APIC to check the configured MTU of all L3Out interfaces. Use grep for concise output if necessary, such as this example:
egrep “dn|encap|mtu”

In this example, an L3Out interface with VLAN 2054 is configured with MTU 9000 in tenant TK, L3Out OSPF, Logical Node Profile OSPF_nodeProfile, and Logical Interface Profile OSPF_interfaceProfile.

apic1# moquery -c l3extRsPathL3OutAtt
Total Objects shown: 1

# l3ext.RsPathL3OutAtt
addr      : 20.54.0.1/24
        --- snip ---
dn        : uni/tn-TK/out-OSPF/lnodep-OSPF_nodeProfile/lifp-OSPF_interfaceProfile/
rspathL3OutAtt-[topology/pod-1/paths-101/pathep-[eth1/12]]
encap     : vlan-2054
        --- snip ---
mtu       : 9000
        --- snip ---

Alternatively, you can run fabric <node_id> show interface on your border leaf nodes as well.

If the MTU shows inherit, the value is inherited from Fabric > Fabric Policies > Policies > Global > Fabric L2 MTU > default.

The script provided in this chapter checks the MTU of all L3Out interfaces for you. However, you have to run the script on APIC, and the APIC does not have visibility on the MTU value configured on the connected devices. Therefore, you should manually check the MTU on the connected devices.

L3Out BGP Peer Connectivity Profile Under a Node Profile Without a Loopback

Prior to upgrading to release 4.1(2) or later, you must ensure that one of the following two requirements are met:

  • A node profile with a BGP Peer Connectivity Profile has a loopback configured for all switches in the profile, or

  • BGP Peer Connectivity Profiles are configured per interface.

You can configure the BGP Peer Connectivity Profile per node profile or per interface. The former is to source the BGP session from a loopback while the latter is to source the BGP session from each interface.

Prior to release 4.1(2), when a BGP Peer Connectivity Profile is configured at a node profile without configuring a loopback, APIC uses another available IP address on the same border leaf switch in the same VRF as the BGP source, such as a loopback IP address from another L3Out or an IP address configured for each interface. This has a risk of the BGP source IP address being changed unintentionally across reboots or upgrades. This behavior has been changed based on CSCvm28482 and ACI no longer establishes a BGP session through a BGP Peer Connectivity Profile at a node profile when a loopback is not configured in the node profile. Instead, a fault F3488 is raised in these situations. This fault cannot be used as a pre-upgrade check because it is raised only after an upgrade.

Due to this change, when upgrading from an older version to release 4.1(2) or later, a BGP session is no longer established if the session was generated via a BGP Peer Connectivity Profile under a node profile and a loopback is not configured in the node profile.

When multiple interfaces in the same node profile need to establish a BGP peer with the same peer IP, you might attempt to configure a BGP Peer Connectivity Profile at a node profile without a loopback so that the same BGP Peer configuration is applied, as a fallback due to the missing loopback, to each interface in the same node profile. This is because a BGP Peer Connectivity Profile with the same peer IP address could not be configured at multiple interface profiles within the same node profile. This limitation was loosened based on CSCvw88636 on 4.2(7f). Until then, for this specific requirement, you need to configure a node interface per interface profile and configure the BGP Peer Connectivity Profile at each interface profile in a different node profile.

L3Out Incorrect Route Map Direction (CSCvm75395)

Prior to upgrading to release 4.1(1) or later, you must ensure that the route map (route profile) configuration is correct.

Due to the defect CSCvm75395, the following configuration might have worked prior to release 4.1(1) despite the wrong configuration (a mismatch in the direction):

  • A route map with export direction attached to an L3Out subnet with Import Route Control Subnet

  • A route map with import direction attached to an L3Out subnet with Export Route Control Subnet

Where the L3Out subnet means the subnet configured under an External EPG in an L3Out.

However, these wrong configurations will no longer work, which is the expected behavior, after you upgrade the fabric to release 4.1(1) or later.

Although this method is not the most common or recommended one among many ways to control the routes being advertised or learned by ACI L3Outs, the correct configuration with this method should be as follows:

  • A route map with export direction attached to an L3Out subnet with Export Route Control Subnet

  • A route map with import direction attached to an L3Out subnet with Import Route Control Subnet

Or alternatively, you can follow the recommended configurations below to control the route exchange in L3Outs instead:

  • default-export route map with IP prefix-lists

  • default-import route map with IP prefix-lists

With this configuration, you don’t need Export Route Control Subnet or Import Route Control Subnet in the external EPGs, and you can have external EPGs that are dedicated for contracts or route leaking while you have the full control of routing protocol through route maps just like a normal router.

Also note that any import direction route maps only take effect when Route Control Enforcement is enabled for import under Tenant > Networking > L3Out > Main profile. Otherwise, everything is imported (learned) by default.

EP Announce Version Mismatch (CSCvi76161)

If your current ACI switch version is pre-12.2(4p) or 12.3(1) and you are upgrading to release 13.2(2) or later, you are susceptible to a defect CSCvi76161, where a version mismatch between Cisco ACI leaf switches may cause an unexpected EP announce message to be received by the EPM process on the leaf switch, resulting in an EPM crash and the reload of the switch.

To avoid this issue, it is critical that you upgrade to a fixed version of CSCvi76161 prior to upgrading to release 13.2(2) or later.

  • For fabrics running a pre-12.2(4p) ACI switch release, upgrade to release 12.2(4r) first, then upgrade to the desired destination release.

  • For fabrics running a 12.3(1) ACI switch release, upgrade to release 13.1(2v) first, then upgrade to the desired destination release.

Intersight Device Connector is Upgrading

If you start an APIC upgrade while an intersight Device Connector (DC) upgrade is in progress, the DC upgrade may fail.

You can check the status of intersight DC from System > System Settings > intersight. If the upgrade of the DC is in progress, wait for a minute and retry the APIC upgrade. The upgrade of the Intersight Device Connector typically takes less than a minute.

Checklists for Downgrade

In general, the same checklists as upgrades should be applied to downgrades. On top of that, you need to pay attention to the new features that may not yet be supported on older versions. If you are using such features, you should disable or change the configurations prior to the downgrade. Otherwise, some functionality will stop working after downgrades.

The following lists some of the example features that you should pay attention to prior to downgrades. However, note that the following list is not complete and we highly recommend that you check the Release Notes or Configuration Guides to confirm that features that you are using are supported on older releases as well.

  • The ability to use the DUO application as an authentication method when logging in to Cisco APIC was introduced as part of the Cisco APIC release 5.0(1). If you are running release 5.0(1) and you have DUO set up as your default authentication method, but then you decide to downgrade from release 5.0(1) to a previous release where DUO was not supported as an authentication method, we recommend that you change the default authentication method from DUO to an option that was available prior to release 5.0(1), such as Local, LDAP, RADIUS, and so on. If you do not change the default authentication method before downgrading in this situation, you will have to log in using the fallback option after the downgrade, then you will have to change the authentication method to an option that was available prior to release 5.0(1) at that point.

    Navigate to Admin > AAA > Authentication, then change the setting in the Realm field in the Default Authentication area of the page to change the default authentication method before downgrading your system. You will also have to manually delete the DUO login domain after the downgrade.

  • Beginning in the 4.2(6) release, SNMPv3 supports the Secure Hash Algorithm-2 (SHA-2) authentication type. If you are running on Cisco APIC release 4.2(6) or later and you are using the SHA-2 authentication type, and then downgrade from Cisco APIC release 4.2(6) to a previous release, the downgrade will be blocked with the following error message:

    SHA-2authentication type is not supported.

    You can choose to either change the authentication type to MD5 or delete the corresponding SNMPv3 users to continue.

  • Changing the container bridge IP address on APIC is supported only on APIC release 4.2(1) or later. If the container bridge IP address on APIC for AppCenter is configured with a non-default IP address, change it back the default 172.17.0.1/16 prior to downgrading to the older versions than 4.2(1).

  • A static route (MO:mgmtStaticRoute) for Inband and/or Out-of-band EPG under Tenants > mgmt > Node Management EPGs is supported only on APIC release 5.1 or later. Delete this configuration and ensure the required service is still reachable via other means prior to the downgrade.

  • Newly added microsegment EPG configurations must be removed before downgrading to a software release that does not support it.

  • Downgrading the fabric starting with the leaf switch will cause faults such as policy-deployment-failed with fault code F1371.

  • If you must downgrade the firmware from a release that supports FIPS to a release that does not support FIPS, you must first disable FIPS on the Cisco ACI fabric and reload all the switches in the fabric for the FIPS configuration change.

  • If you have Anycast services configured in your Cisco ACI fabric, you must disable the Anycast gateway feature and stop Anycast services on external devices before downgrading from Cisco APIC 3.2(x) to an earlier release.

  • CiscoN9K-C9508-FM-E2 fabric modules must be physically removed before downgrading to releases earlier than Cisco APIC 3.0(1). The same applies to any new modules for their respective supported version.

  • If you have remote leaf switches deployed, and you downgrade the Cisco APIC software from release 3.1(1) or later to an earlier release that does not support the remote leaf switches feature, you must decommission the nodes before downgrading. For information about prerequisites to downgrading Remote Leaf switches, see the Remote Leaf Switches chapter in the Cisco APIC Layer 3 Networking Configuration Guide.

  • If the following conditions are met:

    • You are running the 5.2(4) release and the Cisco APIC created one or more system-generated policies.

    • You downgrade the Cisco APIC from the 5.2(4) release, then later upgrade back to the 5.2(4) release.

    Then, one of the following behaviors will occur:

    • If the Cisco APIC finds a policy with the same name and parameters as a system-generated policy that it is trying to create, then the Cisco APIC will take ownership of the policy and you cannot modify the policy. This occurs if you did not modify the policy after downgrading from the 5.2(4) release.

    • If the Cisco APIC finds a policy with the same name as a system-generated policy that the Cisco APIC is trying to create, but the parameters are different, then the Cisco APIC will consider the policy to be a custom policy and you can modify the policy. This occurs if you modified the policy after downgrading from the 5.2(4) release.

    Because of this behavior, you should not modify the system-generated policies after you downgrade from the 5.2(4) release.

  • You must decommission an unsupported leaf switch that is connected to the Cisco APIC and move the cables to the other leaf switch that is part of the fabric before you downgrade the image.

Examples of Pre-Upgrade Validator (APIC)

Example Error Messages and Override Options Through the GUI with APIC Release 4.2(5)

There are three situations where warning messages might appear through the GUI:

  • While the query is loading, where you might see a message similar to the following:

    This might occur because it sometimes takes a bit of time to load the data from a query. In this situation, be patient and wait for the system to finish loading the data from the query.

  • If the query fails for some reason, you might see a message similar to the following:

    This warning will appear if the query failed for some reason (for example, there might be so many faults that the system is overloaded). In this case, it is up to you to verify if there are any faults that might cause an issue with the upgrade.

    However, if you want to override the block and proceed with an upgrade or downgrade without addressing the issue with the failed query, check the box next to the I understand there may be active faults on the system which can lead to unexpected issues, proceed with the upgrade field. This allows you continue with the upgrade or downgrade process without addressing the issue with the failed query.

  • Once the fault query is complete, where you might see a message similar to the following:

    This warning message will appear when the fault query is complete and the system has found one or more faults. In this situation, click the Click Here link to get more information on the faults that the system found.

    If possible, we recommend that you resolve the issue that was raised in the fault before proceeding with the upgrade or downgrade process. For more information on these faults and the recommended action for each, see the Cisco APIC System Faults/Events Search Tool and the Cisco ACI System Messages Reference Guide.

    However, if you want to override the block and proceed with an upgrade or downgrade without addressing the issue that was raised in the fault, check the box next to the I understand there are active faults on the system which can lead to unexpected issues, proceed with the upgrade field. This allows you continue with the upgrade or downgrade process without addressing the faults that were detected.

Example Error Messages and Override Options Through the NX-OS Style CLI

When you attempt to upgrade the software through the NX-OS style CLI:

apic# firmware upgrade controller-group

You might see an error message similar to the following if faults on the fabric are detected:

Error: Migration cannot proceed due to 23 active critical config faults. Resolve the faults to proceed

If possible, we recommend that you resolve the issue that was raised in the fault before proceeding with the upgrade or downgrade process. For more information on these faults and the recommended action for each, see the Cisco APIC System Faults/Events Search Tool and the Cisco ACI System Messages Reference Guide.

However, if you want to override the block and proceed with an upgrade or downgrade without addressing the issue that was raised in the fault, use the ignore-validation option to proceed with the upgrade:

apic# firmware upgrade controller-group ignore-validation