AIOps in Action
SUMMARY To gain a deeper understanding of AI-native operations, watch how an operations engineer, François, troubleshoots user issues. Compare different approaches including the Marvis conversational assistant and the Service Level Expectations (SLE) dashboard. See where to go for technical details, audit logs, and dynamic packet captures.
Scenario 1: Troubleshooting with Marvis Queries
In this scenario, François uses Marvis queries for help with troubleshooting.
-
Often, you can get the information you need with only a basic query.
-
Optionally, you can make a few extra clicks to view more details.
-
If more questions come to mind while you're troubleshooting, you can refine the query.
-
If you want more technical information, you can easily navigate to other Juniper Mist pages to investigate further.
Entering a Basic Query
To get started, François enters a basic query. Marvis provides fact-based, action-oriented answers in plain English. François quickly gets the insights that he needs to address the issue.
So a user just called me and she's telling me that you know the Wi-Fi doesn't work and she cannot get to her emails. So what I'm going to do now in the Mist dashboard is use Marvis to do some reactive troubleshooting. So we know we're having an issue, the user contacted us and now we can ask Marvis to, you know, to see what's wrong and get to the root cause of that potential issue.
So in the Mist dashboard, I can go back to the Marvis tool on the left here and I can click on Marvis Actions, and then, all the way on the top right there is a button called Ask a Question, and I can use this menu to ask Marvis a question. So, now I know that I have a problem with a specific client and what I could simply do is ask Marvis to troubleshoot that specific issue. So I can enter a specific query and you'll see if you click on it, it will pop up the different queries available.
The one I want to use here to do reactive troubleshooting will be the troubleshoot query and then if I press space, it will ask me the context right will ask me do you want to troubleshoot a site, an access point, a client. In this case, I'm going to go to the client tab and the MAC address that I want to troubleshoot is this one here, the B280, and I can click on it and here I'm already learning a piece of information. I'm learning that the user is using a Pixel 7 and I can always go back and forth with the user and validate that piece of information.
And so, I can simply ask Marvis can you please troubleshoot that specific device and then press enter. And so Marvis here in the background it will gather all of the information that it knows about that specific user and it will kind of summarize what it thinks is happening with that specific device. And here you can see that, apparently, we have a problem with the connection.
It says that the successful connect is failing and what Marvis is giving us is pretty much you know what the problem is in plain English. So it's telling us that the client failed to connect a hundred percent of the time due to DHCP failures on the co-op Wi-Fi WLAN and on the 5 gigahertz band because of unresponsive DHCP server. The problem is affecting a small number of clients.
Most failures across the site occurred on 5 gigahertz and the client is currently offline. So here we are learning a few pieces of information. We're learning that the client is still offline.
It's still not working for them and we're learning that apparently we have some sort of DHCP issues right and this is the power of Marvis. The power of Marvis is that it gives you in plain English what the problem is. You don't have to go and, you know, to a whole bunch of different menus to get the information.
You can simply get the information here.
Viewing More Details
Continuing this scenario, François clicks the Investigate button to learn more.
If you do want to get more information and drill down a little bit more in that data, you can always click on the Investigate button here and it will, you know, Marvis will show you, you know, how it got access to that information. You can look at the service levels. So these are the SLEs.
So you can see that we had a couple of attempts and it gives you like the time and the date. It will tell you, you know, why it failed. So you have multiple reasons why a connection can fail.
In this case, 100% of the time that was because of a DHCP issue. Then, it will try to do some correlation to understand, you know, is it a client issue? Is it an AP related issue? Is it a WLAN related issue? And show you, you know, like the probable causes here. In this case, its everything is 25% because we only had one client issue.
So, we cannot really learn much here in this case. Then it will tell us, you know, when changes have been made. You know, we had the RRM change.
We had some admin changes here. And then, the events, this one is interesting. It will take all of the client events, everything that happened with that specific client.
And it will, you know, kind of show us on a timeline when we had a bad experience. In this case, you can see we had the bad experiences, you know, around 7:10 AM this morning, 7:20 AM this morning. And we can go back and take a look at these events here.
Refining Your Query
François refines the query to focus on a specific timeframe.
What I can even do here, when I ask Marvis to troubleshoot the device, is I can specify a timeline. So I can say troubleshoot Pixel 7 during, let's say this week, and we can try to get more information that way. What Marvis is telling us is pretty much the same thing here.
We're having a DHCP, unresponsive DHCP server. And when I investigate, if I go see my events, we'll be able to see what happened this week. And you can see that, indeed, we're having problems this morning, but we also had problems at 10 a.m. on Monday, actually, a couple of days ago.
And so we can see what happened across the week. We can go back in time, and we can also look at what changed, what happened on the network. Did we have any changes? Here it looked like we had some configuration changes when the first problem occurred.
And sometimes that can help us to understand where to look next to figure out what the problem could be.
Investigating Further
Now François is curious to see more technical information. He easily navigates to other Juniper Mist pages to investigate client events, WAN edge performance, audit logs, and more.
From here, you know, I have a couple of options. I can either go straight to my unresponsive DHCP server, or if I wanted to get a little bit more information about that specific client, I could take a look at the Client Insights page. So, I can go back to the Monitor service level page and then select the Insights tab.
And then, under the scope here at the top, I can just select client and select the client that I want to study. And same thing here, you know, I can filter it for today, I can filter it for this week, and I can get information about what happened to that client this week. And you can see it kind of correlates to what Marvis was telling us.
And then here, I can kind of drill down into the events. And it looks like for us, like it looks like the events are repeating every time the client is associating to the Wi-Fi network. We can validate the SSID the client is connecting to, the RSSI, it looks like everything is working well here.
But then it looks like we're failing the DHCP, failing the DHCP discover on VLAN 10. We are not getting any response on that VLAN. And then therefore, the client gets disassociated.
So here, we kind of get a little bit more technical information related to what Marvis gave you, but we pretty much get the same, you know, outcome. And the problem seems to be related to the DHCP server. So, in this environment that I'm showing you, the DHCP server is actually the one gateway.
So, if I go and look at my WAN Edges here, you can see that I have a wan gateway here. And this piece of equipment is supposed to provide IP addresses to the Corp Wi-Fi network. So if I wanted to continue the troubleshooting, I could, you know, take a look at that device and see, you know, what's happening.
Under the DHCP statistics, you can see that, you know, we have a DHCP pool for the management VLAN. It looks like we have a DHCP pool for the guest VLAN, but we don't have any DHCP pool for the Corp VLAN here. Right.
So, what we could do here is we could go back to the actual configuration of the WAN Edges, the wan equipment, pieces of equipment. In this case, it's an SRX. And if I go back into the configuration template and I scroll down to the IP configurations or the DHCP configurations, it looks like, you know, we don't have any DHCP configurations for the Corp VLAN or the Corp SSID in our case.
Right. Which is, you know, which is required if we wanted to provide IP addresses to the Corp user. So if I wanted to, you know, understand and trace back, OK, you know, how did we get there? How did we not get a Corp DHCP set of configuration for our Corp users? Because it used to work. It's not something that we just set up. It used to work. And then at one point, everything stopped working.
If I want to go one step further and understand this, what I could do is go into the audit logs. Under the organization menu, I can go to the audit logs and this will tell me what happened and who did what changed on my organization. And here, you know, I can take a look at what happened yesterday, for instance.
You can see that we had an update on the WLAN template for the one edge. And if I click on the details here, I can see that it looks like we used to have some DHCP configuration for the Corp WLAN network and it has been wiped out. So it looks like someone, someone called François, I wonder who that is, completely removed the configuration, the DHCP configuration for the Corp SSID.
Right. So now it's starting to make sense. You know, we had a configuration change. DHCP configuration got removed. And then the users, when they come back and connect to the Wi-Fi network, they're not getting an IP address. Nothing works. They can't get the emails and everything.
Scenario 2: Troubleshooting with Service Level Expectations (SLEs)
In this scenario, François uses Service Level Expectations (SLEs) to get a quick snapshot of all issues affecting user experience and to explore the root causes of these issues.
-
Use the SLE dashboard to see how your organization is performing against various success factors. View the Root Cause Analysis for current issues.
-
Go to the Client Events page for deeper insights. Download a dynamic packet capture to learn more.
-
Investigate further by viewing the technical details for network devices and by checking the audit logs.
- Viewing the SLEs and Root Cause Analysis
- Getting Deeper Insights
- Using Dynamic Packet Captures
- Investigating Further
Viewing the SLEs and Root Cause Analysis
François gets started by going to the SLE dashboard and viewing the Root Cause Analysis for current issues.
So, a user called me this morning and she's telling me that Wi-Fi doesn't work and she cannot get her to her emails. And so what I want to do here in the Mist dashboard is try to troubleshoot the problem. And we're going to do this using the SLEs, the service level expectations.
So we're not going to use Marvis here, we're going to use the SLEs. And I'll show you a way that we can get to the bottom of things, to the root cause, using the SLEs instead. And so here, what we're doing, we're doing reactive troubleshooting, a user is complaining and we are going into the dashboard to try to understand what the issue could be.
So if I want to look at the SLEs, I can go back to the monitor and then service level page. And then at the top, next to the monitor, you'll have a tab called wireless, right? And so here, if I look at the, if I click on wireless, I'm going to have access to all of the wireless SLEs. But what I want to do is I want to change the context.
And instead of looking at the SLEs for the entire site, I want to go and select a specific client. So, I can go ahead and select a specific client. I believe this is the client that I'm looking for.
And you can see that now that I've changed the context, you know, the dashboard is only showing me the SLEs of that specific client. And this is very powerful because now we can just focus on one specific client and understand the user experience of that client. And here, you know, if I look at, you know, how this client is doing today, it looks like it was never able to connect to the Wi-Fi.
So, you know, the user experience is very poor. And I can see here that I have a 0% success rate for my successful connect. So it looks like the device was never able to connect to the Wi-Fi.
And if I look at my classifiers on the right, it looks like, you know, 100% of the time it failed because of DHCP related issues. So, what I can do from here is click on the SLE itself, successful connect. And that will bring me to this root cause analysis page where it will provide a little bit more information about that specific SLE and that specific client.
So you can see, you know, successful connect, we have a rate of 0%, then 100% of the time it failed because of DHCP. And here I can even see that we had four attempts to connect from that specific device. And then at the bottom here, I can get access to a little bit more information.
You can see we have some statistics, have a timeline so I can understand when these failures occurred. And here you can see, you know, it was around 7:10, 7:20, 7:50 this morning. I can look at the distribution.
So, distribution will help me to understand, you know, the scope. So here it happened when the client tried to connect to AP01 on the Corp Wi-Fi SSID on the five gigahertz frequency band. And here we don't have any information, but sometimes it would give you server's IP addresses, for instance, in this case, DHCP server.
And then we have the correlation tab. So here we'll try to kind of do some correlation and let us, you know, understand if it's a client-specific issue or maybe AP-related issue, or maybe like a site related issue or a WLAN related issue. So, it's trying to score, you know, the scope pretty much of that specific issue.
Here it looks like it's more related to that specific client, even though, you know, the score on the WLAN and the AP is 21 percent. And then I can look at the summary and the summary here. This is like a, I guess, a summary written in plain English telling us, you know, what the system thinks the problem is.
And here it's telling us that the client failed to connect 100 percent of attempts, primarily due to DHCP problems. The problem is client specific with most client failure occurring on AP number one and Corp Wi-Fi SSID. And so, here you can see that using the SLEs, I'm able to start at the beginning and then drill down based on the score of these different SLEs, drill down and understand what the root cause could potentially be.
So here, you know, if I believe what the system is telling me, it looks like the client was not able to connect to the Wi-Fi because it was never able to get an IP address from the DHCP server.
Getting Deeper Insights
Now François wants to see technical information about client events. Here, he also sees that a dynamic packet capture is available to download.
Now, if I want to get more information about this, what I can do is I can go to the Insights page. So from this root cause analysis page, because you have the client contact specified here, you know, I have this View Insight button that will bring me back to the insights page, having that very specific device selected. And here you can see that we're getting a little bit more information about this client.
You can see that we have some fail attempts, some fail events. You can see it on the timeline here. And if I look at the client events, I can try to study a little bit more in detail what happened to that client.
And it looks like we're having a pattern here, right? It looks like it's connecting and it's green, which means it's, it's working properly. It's connecting to the Corp Wi-Fi SSID. Looks like it has a good signal, successful connect on VLAN 10.
And then after that, though, we have a bad event and it's telling us that, you know, we have a DHCP timed out, right? And the description is telling us that the, you know, failing DHCP discover from that specific MAC address on VLAN 10. And it's telling us that, you know, we are not able to, you know, find pretty much a DHCP server. So the client is sending its DHCP discover, but it's not getting any answer back from any DHCP server.
And then after that, a little bit after that, you can see that, you know, the client is being disassociated because, you know, it didn't get an IP address. So it decided to just disconnect from that Wi-Fi network. And then it starts, it starts, it tries again, right? You can see that it's like a cycle.
It tries and every time it tries, it fails. Now, one thing that's interesting here in the event is sometimes you're going to have a little paper clip and that paper clip indicates that you have a packet capture that's available for download. And that's a dynamic packet capture that has been uploaded to the Mist cloud from the access point when the access point is noticing a bad event.
And as an operator, I can just click, you know, on that download packet capture and I can download that PCAP file. And later on, you know, I can retrieve it and open it with Wireshark.
Using Dynamic Packet Captures
François opens the packet capture in Wireshark and analyzes the data.
So, if you want to look at the PCAP frame here, I've downloaded it and opened it in Wireshark. You can see, you know, the frame exchange between the client and the AP. And if you do some Wi-Fi analysis, you can see that we have the authentication, the association. All of this seems to have worked properly. We also have the four-way handshake here. So, it looks like the person had the right password to connect to the Wi-Fi.
And then we can see, if we look at the frames down the list here, that we have some neighbor solicitation. This is usually, you know, the clients looking for an IPv6 IP address. And then, we have some boot requests, which is the DHCP IPv4 request. And you can see that we have one request here, but we don't have any responses.
And we have a couple of other requests before the clients are disconnected. So we can, you know, if we want to have more information or if we didn't get all of the information we wanted from the dashboard, we can also open the PCAP file and start looking deep into the frames.
Investigating Further
François views technical details for the DHCP server (the WAN Edge) and explores the audit logs.
So here, you know, if I believe what the system is telling me, we're having an issue with the DHCP server. So, the next step in the troubleshooting would be to go back to that DHCP server and then look at the configuration and make sure that it's configured properly. In my environment here, the DHCP server, it's supposed to be the WAN Edge, the gateway.
So, if I go back and look at my WAN Edges, you can see that we have wan edge device here, which is the SRX300. And this device is supposed to be my DHCP server. And so, if I look at it, you can see that we have some configuration here.
You can see that under the DHCP statistics, you know, it looks like we have DHCP configuration for the corp network, the guest network, the management network. So, you know, the corp network is the one we need. So, it looks like it's configured properly.
You can see that two IP addresses actually have been given to a couple of clients. But still, you know, our users was reporting was reporting problems. So, what I can do here in this case, it looks like it's configured properly.
If I wanted to look at the configuration below of the LAN, I could look at the DHCP configuration. It looks like we have some corp server configuration for the corp network. So it looks like everything is configured properly, but still yet we had an issue on the DHCP end.
So here, if I want to take it one step further, what I could do is from here, I could go to the WAN Edge Insights page. So that would be bring me back to the Inisghts page. But now that's going to change the scope of that specific, the scope here.
And it will show me only, you know, the inside of that WAN Edge. And then here I can look at the different events. And it looks like, you know, some configuration were changed and I can look at the details.
And if I look at the details here, it's telling me that, you know, someone added DHCP pool configuration for the corp network. And these configuration were added, you know, this morning at 7:55 after the user had issues. Right.
So DHCP, it looks like DHCP configuration were added after the client had issues. So, now what I could do is, you know, instead of looking at my logs for today, I could look at it, look at it for this week. And I can try to go back in time to understand what happened.
You know, and I go back to, you know, another configure event here that happened on the 4th of June. If I look at the details, well, here I can see that in this instance, the corp DHCP configuration were removed. Right.
So here we can understand that at one point, well, actually we know exactly when, on June 4th at 6:18 AM, the corp DHCP configuration were removed. And they were added back on June 5th at 7:55 in the morning. So here using the insight, we can actually go back and try to understand what happened to the configuration of that device.
What we could also do to understand this is go back to the organization and look at the audit logs. And here you will be able to see, you know, who did what on the organization and understand what happened as well. You can see that here where we see that the WLAN SRX template has been changed.
I can see the details and you can see that here the DHCP configurations were added this morning at 7:54am, which, you know, correlates to what we saw in the insight. And here we can see that, you know, someone called Francois added this configuration this morning. And if I want to look at it for yesterday or for a couple of days ago, I can go back to this week.
And then if I scroll down, we have another event here where, you know, the configurations were removed. Same person. So it looks like Francois here a couple of days ago thought it would be a good idea to remove the DHCP configuration for the corp network.
And then the devices were not able to get an IP address. And then it fixed it this morning by reconfiguring the corp network. And now we can, you know, we can see that under the client page, we actually have a couple of clients with IP addresses on the corp Wi-Fi.
So it's working well, but it was broken for a little bit. And we were able to figure all of this out, you know, starting at the service level expectation for that specific client that's not working. And then dwelling down all the way to the DHCP server, to the configuration, and to the audit log.
Right. So at this point, I could go back to the user and ask her to connect and she should be fine. I can make sure she has an IP address, make sure that I see her on the dashboard and it should work fine from now on.