Me too! I’m holding my breath because it has lasted overnight before, but we are certainly beyond the longest iv ever been able to keep it up.
I will ping back if it goes down. If it doesn’t then I will start playing with the transceiver see if its just bad. I have a few so I can just swap it. Maybe its some kind of strange incompatibility.
I concur about not calling it resolved, or even “safely improved” until we have higher uptime. Based on the overall report, I would say 36 hours is a solid soft metric and then a solid hard metric being Monday, maybe?
I’m still trying to find the actual root of the issue, but if it is what I think it is, it’s very much timing related. e.g. there may only be brief periods measured in minutes in which I would be able to confirm my theory.
The DNS change was the change I referenced earlier.
I think the key here was switching all the Alta gear to DHCP. After doing so, I checked the ARP tables across the Route10 and the S8. It showed addresses that were assigned to Alta gear long after the cache should’ve timed out.
The Route10 showed an IP assigned to the S8. The S8 showed a different IP assigned to an AP. In both cases, the MAC portion of ARP showed <incomplete>. They persisted for well over 15 minutes. I cleared the ARP cache to see if they would show back up and they haven’t thus far.
For now, I’ve reverted the DNS change to make sure that was a benign change. Other than that I’d recommend we let it sit until we get to that hard metric of Monday. If there are no anomalous issues between now and then, we can be confident that it’s the IP issue. Then we can start re-assigning static IPs, one device per day and see if the issue comes back again. On paper, it appeared to be an IP conflict, but I would’ve expected to see a complete ARP entry in that event; very peculiar situation.
Hey sorry Matt was a busy monday. I think your on the money because I left the FW on purpose but did see it wanted to update. Looks like it did all but the router by itself. I didnt run into any issues at all though; been totally smooth sailing.
If you would like I can begin slowly moving the IPs for the network gear back to static after updating the router firmware; unless you want to do it, or want to try something else.
I’d say you’re good to move one device per 24 hours. 24 hours without any communications interruptions I think is a safe schedule (it’s what I’d be doing on my own network).
I apologize, I took a screenshot of one device’s static IP configuration but not all of them so I don’t have a complete log of what device was what IP and the last thing I want to do is cause a new IP conflict. But yes, I’d say it’s good to re-implement slowly and if we suddenly start the comms loss loop again, we know the culprit.
Update! So after monday this past week I began 1 at a time 1 a day moving over my network gear back to static IPs. No issues.
Today; I moved my WAN connection back to the SFP. After all, my LAN already hangs off of the other SFP. Well what do you know? After about 4 hours the connection dies.
I replaced the SFP and I will see if it lasts overnight, and then, over the weekend. They are both 10Gtek but two different models. If it was an SFP ill kick myself for not testing it earlier, if it goes down again, then id be curious to see if there is any kind of signaling I can look into to understand why the LAN does not.
In either case, link light activity remains the same.
Happy days otherwise though! Glad its down to this transceiver or port.
Well; that lasted 11min until the WAN went down with my second transceiver. Back on WAN1 now. I think I might know whats happening here. I am auto neg at 10gb/s and I wonder if googles ONT gets mad. WAN1 only trains 2.5. I am going to manually slow to 2.5 and see what happens.