Thinking outside the network box
It is amazing how many changes datacenter technologies have undergone in the last 20 years – and network management could be a good indicator. 20 years ago, with largely static physical and storage infrastructures, networks were the most dynamic and challenging of environments, there were many configuration options and technologies (L2 and L3 networks, various VLAN protocols, MPLS, optical networks etc. etc.). Many of these technologies introduced complex performance and availability dependencies and sophisticated practices were developed to deal with that.
With the first ten years of virtualization the center of gravity shifted to the datacenter itself, where many formerly physical-static boundaries now can be controlled in software; and the abundant network bandwidth rarely was an issue, once it is delivered to a configured perimeter, everything was OK.
But now once the datacenter expands its compute and storage borders, the imperfections in legacy network architectures begin to impact overall performance and availability all over again. Even if the network spine is based on fast non-blocking devices and has enough capacity, delivering the last mile bandwidth may uncover unexpected bottlenecks. What is also important to remember is that the actions related to network issues often lie outside of the network domain.
Image courtesy of W.R. Koss at siwdt.com
We already looked at the frequent talkers’ dilemma when it’d be beneficial to keep interacting VMs closer to each other vs separating them to achieve better compute and storage performance. If you implemented a large spanning VLAN (e.g. using one of the available VxLAN technologies), you may start experiencing a so called “tromboning effect”. 2 VMs which are part of the same large VXLAN talk to each other, reside on the same site and even host, but the underlying network topology puts the edge switch to the other side of a WAN link introducing huge latencies.
So what actions could you perform to address this? If the VMs need to stay close to each other (say, if performance or compliance requirements require it) then you could try to increase the available link bandwidth: if you used channel groups, you could just add more ports to it and compensate for the lack of bandwidth. But if you didn’t, you are limited by the physical capacity of the edge switch and its ports and the only thing you could do is to upgrade the hardware, which could be very expensive.
But if you decide to think outside the box then a more appropriate action could be move these VMs together as a whole but to the place where both the network link has more bandwidth (e.g, using a virtual switch within a hypervisor on the same physical host) and the compute and physical resources are available too. However, in order to accomplish this you need to be aware both of the network topology, physical constraints (storage and compute cluster boundaries), compliance constraints etc. So what used to be strictly a network management problem no longer is! And frankly, none of the today’s datacenter challenges belong to a single silo.
It is good to to think outside the box, but do you know how big your box is and where exactly it is in your datacenter?
Image source: Ryan Reynolds hatin’ the whole box thing in Buried