Like most things related to the cloud, you’ll often find that the less you start with, the easier things become. Take most corporate networks for instance. They are security driven by design; multi-layer by implementation; often include devices from multiple vendors, and typically have network inspection devices of some kind. It’s not strange to see upwards of six to nine layers of inspection between a corporate data network and the internet.
Inspection is great for securing a network from external attacks from the internet, and for filtering browsing traffic on its way out of the company. However, it does very little to help users adopt cloud services. Users often brag about how much faster their network is at home than at work, even though their internet capacity at work probably exceeds their home internet connection.
Therefore, this leads me to believe that a corporate network built to secure internet browsing traffic can create problems for users who are legitimately consuming cloud services. As we begin to unpack the latency issue, focusing on Office 365 will allow us to base our thinking on connectivity principles laid down by Microsoft. Some of these principles will be difficult to implement for a corporate network that’s been around a long time and built with practices lovingly crafted and defended by in-house security teams.
Everybody loves a villain
The nemesis of cloud-based user experience is latency. While we can point to several other factors that contribute to cloud performance in network terms such as TCP window scaling, session lifetimes, maximum segment size, etc – none add up quicker than the most obvious, which is network latency.
Latency is the foe we must defeat in order to deliver the best user experience possible. We may be tempted to believe that latency is a measure only applied to network-sensitive traffic such as Microsoft Teams voice or video calls, when in fact it also affects Outlook, OneDrive, and the plethora of Office 365 service portals.
For example, Outlook uses several long-running threads to connect to Exchange Online. When latency is introduced to the connection between Outlook and Exchange Online, at best it will create a slow experience while sending/receiving mail, changing folder views, etc; and at worst it will cause disconnections or force a user to reauthenticate several times.
To attain optimal connectivity and performance, Microsoft recommends several principles including local network egress, DNS, avoiding network inspection, and SD-Wan (or split VPN tunnels.) This article identifies what I believe to be the most cost-effective approach for implementing changes, and how you can measure the effects before and after.
It’s difficult to argue with physics
Ignoring other factors for a moment, let’s imagine a global or even national WAN which accesses the internet from the corporate head office. Traffic to and from Office 365 is forced to traverse the network between your users and head office until it breaks out at a single location.
Now think about the fact that Office 365 is built around the concept of service front doors, which include front doors for Exchange, SharePoint, Teams, and so on. If you force a user in Los Angeles to access the Internet via New York (and therefore the Office 365 service front door in the same location) this will certainly hamstring the user’s experience, adding well over 100ms worth of latency onto every transaction.
The result is an undeniably slow user experience. VPN solutions which force all internet traffic onto the corporate network, as opposed to excluding Office 365 or cloud destined traffic, are guilty of the same sin. The basic physics of the situation tells us that the quickest way between two points is usually the shortest.
What’s in a name
Domain Naming Services (DNS) allows us to find internet services by name, for example ‘outlook.office365.com.’ What we may not be aware of is the fact that Microsoft load balances namespaces such as outlook.office365.com, sharepoint.com, etc. behind several other namespaces, which in turn resolve to several physical locations via their IP addresses. This allows the user in their local geography to find and connect to the closest point of presence (or service front door) for their respective service.
Understanding this service design principle encourages us to use local DNS for name resolution, as opposed to the DNS services hosted by a corporate head office or even one of the internet favorites such as 8.8.8.8 (Google) or 1.1.1.1 (Cloudflare). While using Google DNS is convenient, it does very little to empower a location-based lookup for the closest services.
For example, our user in Los Angeles returns a service front door in Washington State as opposed to a local front door in California. Users in New York, London, Cape Town, or other parts of the world will also receive a similar result, i.e. a front door in Washington State, as opposed to their local state or geographic region.
Why should we care? Well, if traffic egresses locally (meaning it’s using the local ISP and not hairpinning traffic via other means such as WAN or VPN) while using services in another part of the world as opposed to local services – we add unneeded latency onto our user experience.
Read more: The Azure DNS Server Outage Problem Highlights Fallibility with Cloud Services.
Looking under the hood
Inspecting traffic can become the single most significant point of introducing latency when consuming cloud-based services, as network inspection tells us ‘Who did What and When’ for users browsing the internet. Using mechanisms like proxy servers to inspect traffic bound to YouTube, Facebook, or the internet, in general, allows us to understand what users do and restrict them where needed based on corporate policy.
Using our multi-vendor security policy, during these inspections we may find things like DLP technologies deployed to our user’s workstation; proxy servers performing SSL inspection; firewalls inspecting network traffic; as well as intrusion detection services silently looking for attack patterns in network traffic. Outside of our network, at minimum, we use a DDOS vendor to host our DNS records.
Unfortunately, each of the technologies I’ve mentioned increases latency, as they often force users to authenticate in some way to ensure that the source of the traffic is understood first. Then they can perform their role by performing some sort of inspection before routing the traffic to the next point in the corporate network.
What makes the situation even worse is that often these services are initially built to support a single user performing a single task, such as browsing a website, as opposed to Outlook connecting and maintaining many persistent service connections. Staying with Outlook for a second, opening multiple shared calendars such as resource calendars or several executive calendars, will break the performance of network inspection entirely.
On top of that, the random nature of proxy servers exacerbates the situation even more by running out of threads. Or, when firewalls forcibly end long-running sessions, which in turn forces Outlook to reauthenticate many times.
Let’s take into consideration the fact that nothing is free, including any kind of inspection, because latency increases for every layer of inspection we add. This can quickly build up to a random latency component between 10 to 200 milliseconds per transaction, with our random multiplier added depending on load and time of day. This makes network performance troubleshooting extremely difficult.
The quickest way to increase performance in this scenario is to exclude Office 365 traffic from inspection for critical service namespaces. Critical service namespaces which Microsoft has pre-prioritized for us for significant service families, including Exchange, SharePoint, and Teams. Public guidance is available for generating PAC files for proxy server exclusion, as well as understanding which namespaces and IP ranges are critical to achieving the best possible performance. These public endpoints change from time to time, so it’s worth investing in change notification using mechanisms like Microsoft Flow or whatever your favorite method is to consume Web Services.
Measuring before and after
In a previous article, Paul Robichaux wrote about one of the public-facing tools offered by Microsoft, specifically https://connectivity.office.com/. This connectivity analyzer provides a technical view of network connectivity between the user and Office365. It will call out specific items such as proxy usage, network routing, DNS usage, closest service front-door acquisition, and more.
For a technical person who can understand the output and needs to prove to a manager that proxy inspection and other factors such as hairpinning internet egress is impacting performance, this tool is invaluable. While troubleshooting, many other tools like Fiddler, Wireshark, and command-line tools will be used. However, using https://connectivity.office.com/ is invaluable for creating a before/after view, and for measuring the specific impact of a change like DNS or ISP peering to find the closest front door.
Conclusion
Little value exists in inspecting the network traffic to a trusted cloud location, especially when other mechanisms are capable and in place to report on specific behavior in the cloud service. For example, ‘Who did What and When,’ as opposed to proxy servers or firewalls reporting that a particular user managed to connect to an address like outlook.office.com.
However, network latency is the number one issue worth pursuing when it comes to improving Office 365 service performance, and adopting the changes needed to lower latency due to security inspection is sure to cause intense arguments amongst security and IT teams. But change might be possible when user pain points are high enough amongst VIP users, forcing advocates of network inspection to reconsider their position.
Great article Nic.
Good article thanks Nic
Thank you so much for your kind words Andrew