Mesh: AP drops from the mesh and loses traffic randomly

Ok, just did…

Provisioning

Finished provisioning. but the mesh is too weak (not sure why… same spot, same everything… just disconnected from the wall and plugged it back again after trashing it from the portal)

Same behavior after reset.

image

I’m not sure if the order changed, but I can’t see the AP names. For me, the meshed AP is listed 2nd. Looks like the channel was changed as well.

I’ll grab some diagnostic data and send it to the devs.

the order changed when I´ve deleted and reprovisioned it… it´s now last. Same name, “Afuera”.

It has been more stable since the reprovisioning, still flaky, but more stable afterall … not as bad as before. What puzzles me now, is that I can´t see any upstream stats… upload speed stays at 0 pretty much all the time

image,

Not sure why the cams doesn´t like the AP anymore, they jump back to the Oficina Pucca AP after few minutes, so all I have on the Afuera AP is the traffic of 2 smart bulbs

Reloaded the page and Afuera is now back to second line. Issue persists

image

@izaguirre1285 I always like to ask permission before just up and doing things on some else’s network.

Would it be OK if I invited at least one developer to your site if needed?

Sure thing! Feel free and go ahead :smiley:

We have found why the mesh is shutting off and it’s by design.

A log entry that reads:

Received packet on m5 with own address

Indicates that there was a loop detected on the network. When a loop is detected, mesh is immediately shut off, for obvious reasons. I can see in the logs that there doesn’t appear to be Ethernet link (sounds counterintuitive but you can actually use the APs as a wireless bridge) which leaves only WiFi devices. Are there any devices on your network that are both connected to WiFi and Ethernet? If so, I would verify that they have Network Connection Sharing or equivalent terminology disabled.

If that’s not the case, there are a couple other options.

  1. Setting up a new SSID that would only exist on Afuera and remove your other SSIDs using the color codes. Then move all but 1 of the most common devices to connect to Afuera to the new, temporary SSID. Then, once a day, add 1 more common device to Afuera until you observe the drops occur again. Then you have your culprit
  2. This one’s probably easier, but does require the forced removal of devices from the network — Block all but one device on Afuera. Then each day, unblock another device. Of course, you would stop when the mesh starts dropping out, and that way we know the last device that you unblocked is the culprit.

Why would the behavior of other devices affect how the AP hooks up to its peers? I have few devices that are like that, mainly laptops and servers… there are usually at least 30 devices connected to the network, with a peak of 60 because I run my own company at home and I have smart devices, cellphones, laptops, and servers, these last 2 with redundant connections of ethernet and wifi.

So setting up a new SSID is not an option, and I can try to identify which device could be causing it, but it doesn´t make sense to me that reason. How would you prevent that in a bigger company´s network? you just can´t

This is pretty much the layout, didn´t add the single devices

That is how layer 2 loops happen, typically. It’s the equivalent of taking a single Ethernet cable and plugging both ends of the cable into the same switch. That creates an L2 loop. In this case, it’s more complex and thus, easier to do. Any network admin who’s honest will tell you they’ve done it at least once.

The key here is with “Internet Connection Sharing” enabled or if the device is simply in bridge mode, that will do it as well.

I’m not saying definitively that there is a L2 loop, I’m saying that’s what the logs are telling us along with the expected behavior of the AP should an L2 loop occur.

To clarify, I was suggesting adding an SSID for testing purposes only.

The typical way L2 loops are prevented is with STP or RSTP, a protocol that gives weight, usually dynamically but can also be manually adjusted. This has the added benefit of offering switch failure/failover capabilities. That’s how it’s prevent in not just large businesses, but regular home networks as well. STP = Spanning Tree Protocol, RSTP = Rapid Spanning Tree Protocol. If you’re familiar with OSPF routing, it’s roughly the same concept.

Thanks for the info, yeah, I´m familiar with RSTP. I was investigating about loop issues and the mac of the other device is logged at the OS level… could you please ask engineering if they have that?

Update after almost a couple of days…

I´ve checked and the router and all 3 switches have the RSTP enabled, issue persists.
I´ve disabled fast roaming, the thing got way better… but still with issues (happened about once or twice every hour, not minutes away in a loop).
I´ve ran a cable of 60 feet from the switch to the AP just for the sake of the testing, and everything has been rock solid.

Which makes me think:

  1. If all the switches and router have RSTP enabled, where is that coming from in the AP?
  2. Why is the issue affecting only the meshed part? Cabled works perfect.
  3. If the problem was already in my environment, why weren´t any noticeable errors before? Mikrotik worked just fine
  4. Did engineering tell what was the mac with the loop? That can be seen on the device logs

Thanks.

  1. The AP itself is preventing the loop, which is why it doesn’t hit the switches. If the AP didn’t prevent the loop, there’s a very high likelihood that the switch interface for Oficina Pucca would get shut down to prevent the loop from occurring.
  2. I’m not sure how to answer this question, to be honest. There’s the fact that most wired devices who are also wireless devices would typically prefer to send traffic over the wired connection despite wireless being available. I’m not sure what device you’re referring to when you reference wiring it vs. wireless. The AP? If so, that’s because there’s no secondary link to loop through (i.e. the mesh), the AP would send everything down the wire, that’s its only path out to the rest of the network. Perhaps adding more context to the question might help me understand it better?
  3. I’m not sure if MikroTik APs have a loop prevention mechanism in place, how it behaves, etc. MikroTik may just let the packets through and assume the switch is going to deal with it.
  4. No, we dont know what the MAC is, unfortunately. We could try a packet capture output to a file. If we did that, ideally, we’d want to revert all settings to the scenario where it occurs most frequently so we don’t run an hour (or more) long packet capture just to get the one packet whose source and destination MAC are the same.

Have you had a chance to drop some device from the network to see if the issue goes away?

In addition to what @Alta-Matt_v2 suggested you may want to see if there are any MAC flap logs on the switches in the broken state. Might help identify where loops are coming from. For example, if we suspect the right hand side host is causing a loop, you may see MAC moves on the connected Mikrotik switch indicating the AP MAC is moving between the host’s interface and the fiber interface connecting to the UI fiber switch.