I cant keep the network up for more than 5 hours?

Me too! I’m holding my breath because it has lasted overnight before, but we are certainly beyond the longest iv ever been able to keep it up.

I will ping back if it goes down. If it doesn’t then I will start playing with the transceiver see if its just bad. I have a few so I can just swap it. Maybe its some kind of strange incompatibility.

Thanks for the diligence!

2 Likes

I concur about not calling it resolved, or even “safely improved” until we have higher uptime. Based on the overall report, I would say 36 hours is a solid soft metric and then a solid hard metric being Monday, maybe?

I’m still trying to find the actual root of the issue, but if it is what I think it is, it’s very much timing related. e.g. there may only be brief periods measured in minutes in which I would be able to confirm my theory.

2 Likes

Yeah that sounds great to me, both would be longer than any amount of sustained workload iv been able to manage.

If you need me to try and do anything if it goes down again, I’m more then willing to help you get whatever you need!

Just passed the 36 hour mark.

3 Likes

I see! I’m excited. Now; what do you think? Is this a product of the super secret changes? Or the move to WAN1?

I see the DNS servers are manually assigned in LAN management. Im not sure what else you may have done as iv not really looked.

Do you think we should move in a certain direction to see whats failing it?

1 Like

The DNS change was the change I referenced earlier.

I think the key here was switching all the Alta gear to DHCP. After doing so, I checked the ARP tables across the Route10 and the S8. It showed addresses that were assigned to Alta gear long after the cache should’ve timed out.

The Route10 showed an IP assigned to the S8. The S8 showed a different IP assigned to an AP. In both cases, the MAC portion of ARP showed <incomplete>. They persisted for well over 15 minutes. I cleared the ARP cache to see if they would show back up and they haven’t thus far.

For now, I’ve reverted the DNS change to make sure that was a benign change. Other than that I’d recommend we let it sit until we get to that hard metric of Monday. If there are no anomalous issues between now and then, we can be confident that it’s the IP issue. Then we can start re-assigning static IPs, one device per day and see if the issue comes back again. On paper, it appeared to be an IP conflict, but I would’ve expected to see a complete ARP entry in that event; very peculiar situation.

1 Like

Any noteworthy changes @solaris17 ? I’m seeing 2d 8h uptime, but that was likely the firmware updates that were pushed last week.

Hey sorry Matt was a busy monday. I think your on the money because I left the FW on purpose but did see it wanted to update. Looks like it did all but the router by itself. I didnt run into any issues at all though; been totally smooth sailing.

If you would like I can begin slowly moving the IPs for the network gear back to static after updating the router firmware; unless you want to do it, or want to try something else.

I’d say you’re good to move one device per 24 hours. 24 hours without any communications interruptions I think is a safe schedule (it’s what I’d be doing on my own network).

I apologize, I took a screenshot of one device’s static IP configuration but not all of them so I don’t have a complete log of what device was what IP and the last thing I want to do is cause a new IP conflict. But yes, I’d say it’s good to re-implement slowly and if we suddenly start the comms loss loop again, we know the culprit.

1 Like

Awesome; your fine I have a method so its no big deal. I’ll start and I’ll let you know how it goes thanks Matt!

Update! So after monday this past week I began 1 at a time 1 a day moving over my network gear back to static IPs. No issues.

Today; I moved my WAN connection back to the SFP. After all, my LAN already hangs off of the other SFP. Well what do you know? After about 4 hours the connection dies.

I replaced the SFP and I will see if it lasts overnight, and then, over the weekend. They are both 10Gtek but two different models. If it was an SFP ill kick myself for not testing it earlier, if it goes down again, then id be curious to see if there is any kind of signaling I can look into to understand why the LAN does not.

In either case, link light activity remains the same.

Happy days otherwise though! Glad its down to this transceiver or port.

Well; that lasted 11min until the WAN went down with my second transceiver. Back on WAN1 now. I think I might know whats happening here. I am auto neg at 10gb/s and I wonder if googles ONT gets mad. WAN1 only trains 2.5. I am going to manually slow to 2.5 and see what happens.

EDIT: It hated that. So back on WAN1.

Sorry about the delay, I wasn’t notified you replied.

So with either transceiver you get link but only for a bit? Did you happen to notice what the link lights were doing on both sides?