Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

AIOps 的实际应用

总结 要更深入地了解人工智能原生运维,请观看运维工程师 François 如何解决用户问题。比较不同的方法,包括 Marvis 对话助手和服务级别期望 (SLE) 仪表板。查看技术详细信息、审核日志和动态数据包捕获的位置。

场景 1:使用 Marvis 查询进行故障排除

在此方案中,François 使用 Marvis 查询来帮助进行故障排除。

  • 通常,只需一个基本查询即可获取所需的信息。

  • 或者,您可以额外单击几下以查看更多详细信息。

  • 如果在故障排除时想到更多问题,可以优化查询。

  • 如果您需要更多技术信息,可以轻松导航到其他瞻博网络 Mist 页面进行进一步调查。

输入基本查询

首先,弗朗索瓦输入了一个基本查询。Marvis 以通俗易懂的英语提供基于事实、以行动为导向的答案。弗朗索瓦很快就获得了解决问题所需的洞察力。

So a user just called me and she's telling me that you know the Wi-Fi doesn't work and she cannot get to her emails. So what I'm going to do now in the Mist dashboard is use Marvis to do some reactive troubleshooting. So we know we're having an issue, the user contacted us and now we can ask Marvis to, you know, to see what's wrong and get to the root cause of that potential issue.

So in the Mist dashboard, I can go back to the Marvis tool on the left here and I can click on Marvis Actions, and then, all the way on the top right there is a button called Ask a Question, and I can use this menu to ask Marvis a question. So, now I know that I have a problem with a specific client and what I could simply do is ask Marvis to troubleshoot that specific issue. So I can enter a specific query and you'll see if you click on it, it will pop up the different queries available.

The one I want to use here to do reactive troubleshooting will be the troubleshoot query and then if I press space, it will ask me the context right will ask me do you want to troubleshoot a site, an access point, a client. In this case, I'm going to go to the client tab and the MAC address that I want to troubleshoot is this one here, the B280, and I can click on it and here I'm already learning a piece of information. I'm learning that the user is using a Pixel 7 and I can always go back and forth with the user and validate that piece of information.

And so, I can simply ask Marvis can you please troubleshoot that specific device and then press enter. And so Marvis here in the background it will gather all of the information that it knows about that specific user and it will kind of summarize what it thinks is happening with that specific device. And here you can see that, apparently, we have a problem with the connection.

It says that the successful connect is failing and what Marvis is giving us is pretty much you know what the problem is in plain English. So it's telling us that the client failed to connect a hundred percent of the time due to DHCP failures on the co-op Wi-Fi WLAN and on the 5 gigahertz band because of unresponsive DHCP server. The problem is affecting a small number of clients.

Most failures across the site occurred on 5 gigahertz and the client is currently offline. So here we are learning a few pieces of information. We're learning that the client is still offline.

It's still not working for them and we're learning that apparently we have some sort of DHCP issues right and this is the power of Marvis. The power of Marvis is that it gives you in plain English what the problem is. You don't have to go and, you know, to a whole bunch of different menus to get the information.

You can simply get the information here.

查看更多详细信息

继续此场景,François 单击“调查”按钮以了解更多信息。

If you do want to get more information and drill down a little bit more in that data, you can always click on the Investigate button here and it will, you know, Marvis will show you, you know, how it got access to that information. You can look at the service levels. So these are the SLEs.

So you can see that we had a couple of attempts and it gives you like the time and the date. It will tell you, you know, why it failed. So you have multiple reasons why a connection can fail.

In this case, 100% of the time that was because of a DHCP issue. Then, it will try to do some correlation to understand, you know, is it a client issue? Is it an AP related issue? Is it a WLAN related issue? And show you, you know, like the probable causes here. In this case, its everything is 25% because we only had one client issue.

So, we cannot really learn much here in this case. Then it will tell us, you know, when changes have been made. You know, we had the RRM change.

We had some admin changes here. And then, the events, this one is interesting. It will take all of the client events, everything that happened with that specific client.

And it will, you know, kind of show us on a timeline when we had a bad experience. In this case, you can see we had the bad experiences, you know, around 7:10 AM this morning, 7:20 AM this morning. And we can go back and take a look at these events here.

优化查询

François 优化了查询以专注于特定的时间范围。

What I can even do here, when I ask Marvis to troubleshoot the device, is I can specify a timeline. So I can say troubleshoot Pixel 7 during, let's say this week, and we can try to get more information that way. What Marvis is telling us is pretty much the same thing here.

We're having a DHCP, unresponsive DHCP server. And when I investigate, if I go see my events, we'll be able to see what happened this week. And you can see that, indeed, we're having problems this morning, but we also had problems at 10 a.m. on Monday, actually, a couple of days ago.

And so we can see what happened across the week. We can go back in time, and we can also look at what changed, what happened on the network. Did we have any changes? Here it looked like we had some configuration changes when the first problem occurred.

And sometimes that can help us to understand where to look next to figure out what the problem could be.

进一步调查

现在,弗朗索瓦很想看到更多的技术信息。他可以轻松导航到其他瞻博网络 Mist 页面,以调查客户端事件、WAN 边缘性能、审计日志等。

From here, you know, I have a couple of options. I can either go straight to my unresponsive DHCP server, or if I wanted to get a little bit more information about that specific client, I could take a look at the Client Insights page. So, I can go back to the Monitor service level page and then select the Insights tab.

And then, under the scope here at the top, I can just select client and select the client that I want to study. And same thing here, you know, I can filter it for today, I can filter it for this week, and I can get information about what happened to that client this week. And you can see it kind of correlates to what Marvis was telling us.

And then here, I can kind of drill down into the events. And it looks like for us, like it looks like the events are repeating every time the client is associating to the Wi-Fi network. We can validate the SSID the client is connecting to, the RSSI, it looks like everything is working well here.

But then it looks like we're failing the DHCP, failing the DHCP discover on VLAN 10. We are not getting any response on that VLAN. And then therefore, the client gets disassociated.

So here, we kind of get a little bit more technical information related to what Marvis gave you, but we pretty much get the same, you know, outcome. And the problem seems to be related to the DHCP server. So, in this environment that I'm showing you, the DHCP server is actually the one gateway.

So, if I go and look at my WAN Edges here, you can see that I have a wan gateway here. And this piece of equipment is supposed to provide IP addresses to the Corp Wi-Fi network. So if I wanted to continue the troubleshooting, I could, you know, take a look at that device and see, you know, what's happening.

Under the DHCP statistics, you can see that, you know, we have a DHCP pool for the management VLAN. It looks like we have a DHCP pool for the guest VLAN, but we don't have any DHCP pool for the Corp VLAN here. Right.

So, what we could do here is we could go back to the actual configuration of the WAN Edges, the wan equipment, pieces of equipment. In this case, it's an SRX. And if I go back into the configuration template and I scroll down to the IP configurations or the DHCP configurations, it looks like, you know, we don't have any DHCP configurations for the Corp VLAN or the Corp SSID in our case.

Right. Which is, you know, which is required if we wanted to provide IP addresses to the Corp user. So if I wanted to, you know, understand and trace back, OK, you know, how did we get there? How did we not get a Corp DHCP set of configuration for our Corp users? Because it used to work. It's not something that we just set up. It used to work. And then at one point, everything stopped working.

If I want to go one step further and understand this, what I could do is go into the audit logs. Under the organization menu, I can go to the audit logs and this will tell me what happened and who did what changed on my organization. And here, you know, I can take a look at what happened yesterday, for instance.

You can see that we had an update on the WLAN template for the one edge. And if I click on the details here, I can see that it looks like we used to have some DHCP configuration for the Corp WLAN network and it has been wiped out. So it looks like someone, someone called François, I wonder who that is, completely removed the configuration, the DHCP configuration for the Corp SSID.

Right. So now it's starting to make sense. You know, we had a configuration change. DHCP configuration got removed. And then the users, when they come back and connect to the Wi-Fi network, they're not getting an IP address. Nothing works. They can't get the emails and everything.

场景 2:使用服务级别预期 (SLE) 进行故障排除

在此方案中,François 使用服务级别预期 (SLE) 快速了解影响用户体验的所有问题,并探索这些问题的根本原因。

  • 使用 SLE 仪表板查看您的组织在各种成功因素方面的表现。查看当前问题的根本原因分析。

  • 转到“客户端事件”页面以获取更深入的见解。下载动态数据包捕获以了解更多信息。

  • 通过查看网络设备的技术详细信息和检查审核日志来进一步调查。

查看 SLE 和根本原因分析

François 首先转到 SLE 仪表板并查看当前问题的根本原因分析。

So, a user called me this morning and she's telling me that Wi-Fi doesn't work and she cannot get her to her emails. And so what I want to do here in the Mist dashboard is try to troubleshoot the problem. And we're going to do this using the SLEs, the service level expectations.

So we're not going to use Marvis here, we're going to use the SLEs. And I'll show you a way that we can get to the bottom of things, to the root cause, using the SLEs instead. And so here, what we're doing, we're doing reactive troubleshooting, a user is complaining and we are going into the dashboard to try to understand what the issue could be.

So if I want to look at the SLEs, I can go back to the monitor and then service level page. And then at the top, next to the monitor, you'll have a tab called wireless, right? And so here, if I look at the, if I click on wireless, I'm going to have access to all of the wireless SLEs. But what I want to do is I want to change the context.

And instead of looking at the SLEs for the entire site, I want to go and select a specific client. So, I can go ahead and select a specific client. I believe this is the client that I'm looking for.

And you can see that now that I've changed the context, you know, the dashboard is only showing me the SLEs of that specific client. And this is very powerful because now we can just focus on one specific client and understand the user experience of that client. And here, you know, if I look at, you know, how this client is doing today, it looks like it was never able to connect to the Wi-Fi.

So, you know, the user experience is very poor. And I can see here that I have a 0% success rate for my successful connect. So it looks like the device was never able to connect to the Wi-Fi.

And if I look at my classifiers on the right, it looks like, you know, 100% of the time it failed because of DHCP related issues. So, what I can do from here is click on the SLE itself, successful connect. And that will bring me to this root cause analysis page where it will provide a little bit more information about that specific SLE and that specific client.

So you can see, you know, successful connect, we have a rate of 0%, then 100% of the time it failed because of DHCP. And here I can even see that we had four attempts to connect from that specific device. And then at the bottom here, I can get access to a little bit more information.

You can see we have some statistics, have a timeline so I can understand when these failures occurred. And here you can see, you know, it was around 7:10, 7:20, 7:50 this morning. I can look at the distribution.

So, distribution will help me to understand, you know, the scope. So here it happened when the client tried to connect to AP01 on the Corp Wi-Fi SSID on the five gigahertz frequency band. And here we don't have any information, but sometimes it would give you server's IP addresses, for instance, in this case, DHCP server.

And then we have the correlation tab. So here we'll try to kind of do some correlation and let us, you know, understand if it's a client-specific issue or maybe AP-related issue, or maybe like a site related issue or a WLAN related issue. So, it's trying to score, you know, the scope pretty much of that specific issue.

Here it looks like it's more related to that specific client, even though, you know, the score on the WLAN and the AP is 21 percent. And then I can look at the summary and the summary here. This is like a, I guess, a summary written in plain English telling us, you know, what the system thinks the problem is.

And here it's telling us that the client failed to connect 100 percent of attempts, primarily due to DHCP problems. The problem is client specific with most client failure occurring on AP number one and Corp Wi-Fi SSID. And so, here you can see that using the SLEs, I'm able to start at the beginning and then drill down based on the score of these different SLEs, drill down and understand what the root cause could potentially be.

So here, you know, if I believe what the system is telling me, it looks like the client was not able to connect to the Wi-Fi because it was never able to get an IP address from the DHCP server.

获得更深入的见解

现在,François 想要查看有关客户端活动的技术信息。在这里,他还看到可以下载动态数据包捕获。

Now, if I want to get more information about this, what I can do is I can go to the Insights page. So from this root cause analysis page, because you have the client contact specified here, you know, I have this View Insight button that will bring me back to the insights page, having that very specific device selected. And here you can see that we're getting a little bit more information about this client.

You can see that we have some fail attempts, some fail events. You can see it on the timeline here. And if I look at the client events, I can try to study a little bit more in detail what happened to that client.

And it looks like we're having a pattern here, right? It looks like it's connecting and it's green, which means it's, it's working properly. It's connecting to the Corp Wi-Fi SSID. Looks like it has a good signal, successful connect on VLAN 10.

And then after that, though, we have a bad event and it's telling us that, you know, we have a DHCP timed out, right? And the description is telling us that the, you know, failing DHCP discover from that specific MAC address on VLAN 10. And it's telling us that, you know, we are not able to, you know, find pretty much a DHCP server. So the client is sending its DHCP discover, but it's not getting any answer back from any DHCP server.

And then after that, a little bit after that, you can see that, you know, the client is being disassociated because, you know, it didn't get an IP address. So it decided to just disconnect from that Wi-Fi network. And then it starts, it starts, it tries again, right? You can see that it's like a cycle.

It tries and every time it tries, it fails. Now, one thing that's interesting here in the event is sometimes you're going to have a little paper clip and that paper clip indicates that you have a packet capture that's available for download. And that's a dynamic packet capture that has been uploaded to the Mist cloud from the access point when the access point is noticing a bad event.

And as an operator, I can just click, you know, on that download packet capture and I can download that PCAP file. And later on, you know, I can retrieve it and open it with Wireshark.

使用动态数据包捕获

François在Wireshark中打开数据包捕获并分析数据。

So, if you want to look at the PCAP frame here, I've downloaded it and opened it in Wireshark. You can see, you know, the frame exchange between the client and the AP. And if you do some Wi-Fi analysis, you can see that we have the authentication, the association. All of this seems to have worked properly. We also have the four-way handshake here. So, it looks like the person had the right password to connect to the Wi-Fi.

And then we can see, if we look at the frames down the list here, that we have some neighbor solicitation. This is usually, you know, the clients looking for an IPv6 IP address. And then, we have some boot requests, which is the DHCP IPv4 request. And you can see that we have one request here, but we don't have any responses.

And we have a couple of other requests before the clients are disconnected. So we can, you know, if we want to have more information or if we didn't get all of the information we wanted from the dashboard, we can also open the PCAP file and start looking deep into the frames.

进一步调查

François 查看 DHCP 服务器(WAN 边缘)的技术详细信息并浏览审核日志。

So here, you know, if I believe what the system is telling me, we're having an issue with the DHCP server. So, the next step in the troubleshooting would be to go back to that DHCP server and then look at the configuration and make sure that it's configured properly. In my environment here, the DHCP server, it's supposed to be the WAN Edge, the gateway.

So, if I go back and look at my WAN Edges, you can see that we have wan edge device here, which is the SRX300. And this device is supposed to be my DHCP server. And so, if I look at it, you can see that we have some configuration here.

You can see that under the DHCP statistics, you know, it looks like we have DHCP configuration for the corp network, the guest network, the management network. So, you know, the corp network is the one we need. So, it looks like it's configured properly.

You can see that two IP addresses actually have been given to a couple of clients. But still, you know, our users was reporting was reporting problems. So, what I can do here in this case, it looks like it's configured properly.

If I wanted to look at the configuration below of the LAN, I could look at the DHCP configuration. It looks like we have some corp server configuration for the corp network. So it looks like everything is configured properly, but still yet we had an issue on the DHCP end.

So here, if I want to take it one step further, what I could do is from here, I could go to the WAN Edge Insights page. So that would be bring me back to the Inisghts page. But now that's going to change the scope of that specific, the scope here.

And it will show me only, you know, the inside of that WAN Edge. And then here I can look at the different events. And it looks like, you know, some configuration were changed and I can look at the details.

And if I look at the details here, it's telling me that, you know, someone added DHCP pool configuration for the corp network. And these configuration were added, you know, this morning at 7:55 after the user had issues. Right.

So DHCP, it looks like DHCP configuration were added after the client had issues. So, now what I could do is, you know, instead of looking at my logs for today, I could look at it, look at it for this week. And I can try to go back in time to understand what happened.

You know, and I go back to, you know, another configure event here that happened on the 4th of June. If I look at the details, well, here I can see that in this instance, the corp DHCP configuration were removed. Right.

So here we can understand that at one point, well, actually we know exactly when, on June 4th at 6:18 AM, the corp DHCP configuration were removed. And they were added back on June 5th at 7:55 in the morning. So here using the insight, we can actually go back and try to understand what happened to the configuration of that device.

What we could also do to understand this is go back to the organization and look at the audit logs. And here you will be able to see, you know, who did what on the organization and understand what happened as well. You can see that here where we see that the WLAN SRX template has been changed.

I can see the details and you can see that here the DHCP configurations were added this morning at 7:54am, which, you know, correlates to what we saw in the insight. And here we can see that, you know, someone called Francois added this configuration this morning. And if I want to look at it for yesterday or for a couple of days ago, I can go back to this week.

And then if I scroll down, we have another event here where, you know, the configurations were removed. Same person. So it looks like Francois here a couple of days ago thought it would be a good idea to remove the DHCP configuration for the corp network.

And then the devices were not able to get an IP address. And then it fixed it this morning by reconfiguring the corp network. And now we can, you know, we can see that under the client page, we actually have a couple of clients with IP addresses on the corp Wi-Fi.

So it's working well, but it was broken for a little bit. And we were able to figure all of this out, you know, starting at the service level expectation for that specific client that's not working. And then dwelling down all the way to the DHCP server, to the configuration, and to the audit log.

Right. So at this point, I could go back to the user and ask her to connect and she should be fine. I can make sure she has an IP address, make sure that I see her on the dashboard and it should work fine from now on.