Cisco ACI - Debugging Network Connectivity
Table of Contents
We run Cisco ACI on a large-ish platform hosting Openstack VM’s and a number of external network connections. Recently we encountered an issue where one of our internal teams reported that they had stopped seeing data from a specific external network. The traffic had just mysteriously dropped off one day.
Unfortunately for us, there was little documentation on how this link was configured, and everyone who had originally set up the environment was long gone.
Identifying root cause
- The trusty first port of call for any network debugging is the ping tool. ACI offers iping, to provide some additional functionality:
# Navigate to the APIC controller
ssh admin@apic1
# SSH onto the leaf node
ssh leaf1
# Ping <ip address> through MyVRF hosted on MyTENANT
iping -V MyTENANT:MyVRF <ip address>
- This quick test revealed that packets could not reach the destination via the specified VRF.
- So we checked to see if the interface was up first:
show interface brief | grep 1/31
The state showed “admin up”, which indicates that the interface was functional.
- By now it was clear that there was an configuration issue. So we need to inspect the running config to see what the issue is. Remember the running config can only be viewed from the APIC, as this hosts and manages configs for all physical components - similar to a configuration management tool, but for networking components. It ensures that a specified configuration is maintained across the environment.
# From the APIC:
apic2> show running-config
-
We knew that interface eth1/31 was the link that was previously configured to talk to this network, however we were not sure how it was configured.
-
We did however know the following:
- The IP of the remote server
- The vlan number configured to talk to the network (700)
-
Checking the running config, and searching for the interface number indicated that there were other VLANs configured to talk on that interface to another network, however VLAN 700 was missing. These VLANs were configured in another VRF.
-
Initially we though that there may have been an External Routed Network (L3) configured for this network, and the deletion of this may have caused the issue. So, we created a new one.
- This restored connectivity to the remote gateway, however the remote network still could not access any of the VMs on our internal network.
- We discovered later that a L3 external routed network establishes connectivity between the two gateways, but it needs something else like BGP in order to learn and advertise routes (specifically the route for our subnet containing VMs).
- Since we had never established a BGP connection with this external network, we knew the connectivity method had to be different.
-
Eventually we discovered the required missing configuration:
- An EPG (End Point Group) had been set up which allowed communication to the external network (via a Bridge Domain)
- There was a contract setup which allowed all traffic between the EPG for VM instances and the EPG for the external network.
- The EPG for the VM instances was already associated to a bridge domain containing the subnet used by all VMs
- However this EPG was not deployed to any Static Ports (i.e interface) which meant that VLAN700 was not set up on eth1/31, and no communication was allowed between the remote and local subnets.
We (re)deployed the EPG onto the interface on both leaves (eth1/31) and bobs your uncle! Traffic flows recovered immediately.
Now to discover who deleted this component in the first place!
Investigating who performed the change and when
Now in the old days of manual config changes - you would have to manually compare differences in the current config to a historic version from a specific date to see what has changed. And unless you have soecific users set up on the switches, auditing is a nightmare.
ACI makes this very easy. You can very easily see all events and audit logs relating to a specific component by navigating to it in the ACI UI.
In our case, the component that was changed was:
- Tenants
- MyTENANT
- MYTENANT_APP
- Application EPGs
- MyEPG
- Static Ports
- Pod-1/Node-101/eth1/31
- Static Ports
- MyEPG
- Application EPGs
- MYTENANT_APP
- MyTENANT
Clicking on this component will give you an option of viewing:
- Audit Logs
- Events
- Faults
Even if this component is removed and re-added, all historical events/audit logs will be preserved, and displayed when it is re-added.
Documentation / References
-
Cisco Documentation for Deploying EPG on a Specific Port: Cisco Deploy EPG to Specific Port
-
Cheat Sheet of useful commands: Cisco ACI Commands Cheat Sheet