At approximately 11:51 local time we were alerted to degraded performance on paths preferring transit through Charter AS20115. We collected data to open a ticket and attempted to apply a BGP community to lower localpref and move traffic away from AS20115. Oddly, we noticed, the alerts continued and no change was observed.
After attempting to tag a BGP community to lower localpref on announcements to AS20115 we decided to simply shut down the BGP neighbor completely at 11:59. However, we were horrified to discover that even after shutting down the BGP neighbor – effectively withdrawing all routes – Charter continued to announce ours and customer prefixes from AS20115.
The original problem we wanted to work around turns out to be a malfunctioning attenuator in a link bundle somewhere upstream, but this behavior of continuing to announce prefixes after we have withdrawn them or shutdown the BGP neighbor is a catastrophic loss of control over the network announcements from our autonomous system. We did employ what we like to call “stupid routing tricks” like deaggragation in a last ditch effort to drive traffic away from AS20115. However this could not help customer prefixes that were already at the minimum accepted size.
At this time there is no resolution. We’re simply at a loss in stopping Charter’s prefix hijacking other than to wait for them to address it.
UPDATE: The prefixes appear to have finally withdrawn this morning. We will post a complete update later, it’s been a long night.
UPDATE 2: Charter had a second emergency maintenance last night on the same equipment. We haven’t reestablished BGP with AS20115 yet.
UPDATE 3: We’re told that an IOS upgrade was performed on the device that hijacked the prefixes. On the morning of the 29th the affected device was rebooted at approximately 02:30 local time. We were told this solved our problem and our ticket was closed. However, we delayed reestablishing BGP until we could confirm a fix as a reboot would only clear the immediate problem, not fix the underlying issue. a second emergency maintenance occurred the next morning on the 30th with two observed reboots at 05:23 and again at 06:01. We’re told these were due to an IOS upgrade (through two independent sources) that should provide a fix for the bug. We did not reestablish BGP with AS20115 until October 1 at 17:45 local time. The time between our withdraw of prefixes and Charter’s propagation of our withdraw was approximately 14.5 hours. As far as we are aware no traffic was completely lost but was still affected by ~25% packet loss, which initiated our initial desire to withdraw routes.
This information is provided in an effort to maintain transparency in network operations at Roller Network.