Radio Resource Management, or RRM, is a key tool for large multi-site organizations to
efficiently manage their RF spectrum. Legacy controller-based implementations build
their channel plan on how the APs hear each other. Usually, late at night, and decisions
on channel and power are then made and implemented.
The frustrations we hear from our large customers is that these systems focus solely on
channel reuse and don't take into account changing conditions during the day, and then
would overreact for no clear reason. Mis-listened. And about two years ago, we
completely redesigned RRM.
Instead of just following the how the APs hear each other vector, we wanted to take
into account the user experience. So we already had the capacity SLE, service level
expectation, which is an actual measurement of every user minute, whether they had
enough usable RF capacity available, taking into account client count, client usage, aka
bandwidth hogs, Wi-Fi, and non-Wi-Fi interference. So we implemented a reinforcement
learning-based feedback model.
We monitor the capacity SLE to see if a channel change and or power change actually
made things better for the users, or if it didn't have any impact. We train the system
on these types of changes and validate it with the capacity SLE to make sure there was a
measurable improvement. This auto-tuning will continue on an ongoing basis.
Rather than setting 50 or more different thresholds based on raw metrics available from
some vendor's controller-based system, from experience, we know there is no perfect
value that works across all environments. Each environment is different and probably not
even consistent during the course of a single day. Picking static values and letting the
system just run isn't feasible and won't scale.
If the capacity SLE is showing 90%, then there isn't much to gain by making changes.
The client usage classifier tracks excess bandwidth hogging by certain clients. So if we
see a two-sigma deviation for bandwidth usage among clients, then the higher usage
clients would get flagged in the client usage classifier.
If the bandwidth usage is pretty much ubiquitous across all clients, then the client
count classifier is where that would be counted. These two events would not cause a
channel change, but they would be visible in MARVIS. If the capacity SLE is taking a
hit, not based on client usage, but on Wi-Fi or non-Wi-Fi interference, then your
end-user experience is taking a hit.
Our system is agile and dynamic. Rather than just setting min-max ranges and being
purely focused on channel reuse, we can let the system learn and adapt based on what the
end-users are experiencing. This is the underlying architecture for MIST AI-driven
RRM.
Let's take a look at the available configuration options. You can choose power range,
your list of channels. These are the only things exposed as everything else is
auto-baseline, so you don't need to set a bunch of thresholds on each of your different
sites.
The system will self-learn per site based on the capacity SLE. MIST has implemented RRM
as a two-tier model, first being global optimization, which runs once a day. It collects
data throughout the day on an ongoing basis and creates a long-term trend baseline.
Then, every day around 2 or 3 a.m. local time, it will make changes if those changes
are warranted. The second tier is event-driven RRM, or as we call internally, local RRM.
This is monitored by the capacity SLE and will act immediately upon any deviation from
baseline.
So, both of these are running in parallel. Conventional systems aren't able to leverage
the compute available in the cloud to constantly crunch the long-term trend data and the
ability to cross-pollinate information from all your different sites, different client
types, and different RF environments. An example would be buildings around an airport
where we have seen radar hits triggering DFS events.
The cloud learns the geolocation and the specific frequencies of these events and then
cross-pollinates that learning to other sites that may also be close to that airport.
Existing systems have no memory and no concept of long-term trend data. They just make
changes once a day.
Here you can see events happening throughout the day. All of the events with a
description are event-driven, and the scheduled are the optimizations that happen at
night. Some systems try to implement a pseudo-local event type RRM, usually
interference-based.
But the problem we run into over time is drift. And as there's no learning going on,
eventually you'll need to manually rebalance the system and clear the drift and start
all over again. The reason for this is there's no memory of what happened or the compute
space to understand context and learn from it.
Mist to RRM might also try to make a similar channel change, but first we're going to
go back and look at the last 30 days. And even though these three available channels
look great now, we know one has had multiple issues in the past. So we move that one to
the bottom of the pecking order.
This makes our global RRM less disruptive than any legacy implementation. Using DFS as
an example, clients don't respond well to DFS hits. They might not scan certain channels
and they might make poor AP choices.
In our implementation, we reorder the channels in a pecking order based on what we've
seen in that environment over time. So certain channels are automatically prioritized.
So you might see channels that appear to be a good choice based on current channel and
spectrum utilization, but we know there exists a high degree of risk of DFS hits based
on what we've learned over time.
So these channels are deprioritized. This is truly a self-driving system and it's not
solely focused on channel reuse. Stepping back, legacy RRM systems lack the tools to
measure if things actually got better for your users.
With Mist, the capacity SLE is exactly that measurement that you've never had. If the
capacity SLE takes a hit and it's due to Wi-Fi or non-Wi-Fi interference and RRM is not
able to make any changes, then you obviously know there's something in your environment
you need to take a look at. Or if RRM is making changes and things are not getting
better, then you have some other issue that needs to be addressed.
But at least you know, being able to quantify the system is getting better is super
important, especially once you start deploying a lot more devices. Today's requirements
may not warrant this level of sophistication, but once you start throwing a lot of IoT
devices and other unsophisticated RF devices on the network, our system will learn to
accommodate them. To see the channel distribution, you can take a look at this
graph.
This is from our office and it's not the perfect RF environment. This graph shows you
what the channel distribution looks like, but when you have hundreds of thousands of APs
and thousands of sites, you need automations that baseline and monitor using metrics
that you trust. What we've done is added this top-level metric into RRM.
So instead of pulling all of your APs and manually inspecting channel assignments, you
can simply use our API to pull a single metric. We have a distribution and a density
score. We have an average co-channel neighbors, average number of neighbors.
So if you have a standard deployment policy, which an installer did not follow, you
will see the site isn't in compliance based on these values immediately. You can pull
this from the API and create a post-deployment report. So if any of these metrics are
deviating, you will know exactly where to focus.
These SLEs and metrics are available on an ongoing basis. Compare this with existing
vendors, where you would have to pull raw metrics and create your own formula to see if
you need to take any actions. We don't want you to pull raw data.
We just want you to use site-level metrics. But if you want to maintain your own
reports, we already have done the dedupe and aggregation for you. From a deep
troubleshooting perspective, why is this AP on a particular channel is a common question
asked when chasing an RF issue that you suspect to be due to Wi-Fi interference.
Each Mist AP has a dedicated radio that scans all the channels all the time and
continually maintain a score for each of the channels that it scans. This is the data
that RRM uses to score the channel. So whenever it gets a trigger from the capacity SLE
to make a change, it uses this AP and site score to determine the channel to assign.
If an AP is on a channel that doesn't seem optimal, you can look right here and then at
the capacity SLE to see if the decision making makes sense. If the SLE doesn't show a
user hit, that explains why the AP hasn't changed channel yet. It will defer to the
global plan and make the change at night.
If there were user impact, the system would have made a change right away. In short, we
have a self-driving reinforcement learning-based RRM implementation. At the same time,
we're also providing you with the visibility into the decision making process so you can
validate decisions made by RRM.
You also have the ability to pull information at scale via our APIs and maintain a
baseline and trend data for all your sites. This is valuable if you're asked to deploy a
bunch of new devices and the question comes up, hey, do we have the capacity to support
this? With the baseline and trend information, you can make informed decisions without
having to pull all kinds of raw data and make a guess. Typically, you want to make
adjustments in two to three DBM increments so you have enough wiggle room.
Unlike Cisco and Meraki, we will go up and down in increments of one, so there's more
granularity, but as best practices suggest, we always give it a range plus minus three
DBM from a median value, typically the target used by your site survey predictive
design. We had one customer ask us why their coverage SLE was 99% when they had
excellent coverage in their warehouse, which was full of Wi-Fi driven robots. In the
past, when there was a robot problem, the client team would inevitably blame the
infrastructure team.
The infrastructure team would request detailed logs from the client team, and most of
the time that led to no actions. When MIST was installed and we saw the 99% coverage
SLE, we looked at the affected clients and it always seemed to be the same robot. When
they asked the client team about it, they said, yeah, that robot has always been a
little quirky.
So when they took the robot apart, they found a damaged antenna cable. This was
eye-opening to this customer, and their quote to us was, you guys solved the needle in
the haystack problem. Coverage SLE is a powerful tool.
In another customer, a driver update was pushed to some of their older laptops. They
have over 100,000 employees, so they did do a slow rollout, but they started getting
Wi-Fi complaints almost right away. Their laptops are configured with 5 gig and 2.4 gig
profiles already installed because each of their sites are a little different in their
capabilities.
What happened in this update, it caused laptops to choose 2.4 when they normally would
have chosen 5 gig. So the SLEs immediately showed a significant deviation from baseline
that correlated to those specific device types and the sites that were having the
problem. They immediately stopped the push because the correlation was obvious.
This customer told us that in the past, they would have asked a user to reproduce the
problem so they could collect the telemetry they needed to diagnose the problem. Now
they realized Mist already has the telemetry needed to tell them they have a growing
problem what that problem was and save them a ton of time. That is the power of Mist AI
RRM.