A meditation on Viral Chains and VPN nodes

2020-03-13

Come with me

On a journey under the skin

We will look together

For the Pan within

Close your eyes

Breathe slow we’ll begin” - Waterboys.

[Gentle, relaxing music playing in the background]

So running Windows brings such happiness and joy, with blue glares staring out of every monitor and spinning circles… reminding you that life isn’t all a rush and you should take things easy, rest a while and drink tea until you need to pee and only then start work.

Microsoft has your back (and short and curlies) - Relax and wait while we install yet another dodgy patch for you (and install an antivirus package that came out of a condom wrapper that was stapled straight through the middle)

When it eventually comes to life, feel the warming flow of air as the CPUs spin up to help make some script-kiddie in Russia a multimillion Monero mogul.

If ever there was a place of peace and tranquillity it will be a corporate network swimming with happy little obligate intracellular powershell parasites jumping from window to window, smb share to smb share.

[Breathe in]

Enter the spunky new swot team and their splunk.

“firewall logs consume we must, only then, whack-a-mole game commence, it can ” they say, in their weird black-dressed l33t speak.

So we faithfully point our Fortigate analyzer running in a VPC at their splunk and pull the trigger.

And the spunky new swot team blow their hunting horns, don their white, black, red, and kali hats and mount their laptops for a 24/(20..and counting) hunt to patchy and fixy those leaky OS buckets filled with what came straight out of Ballmer & Gates’ collective ass.

[Breathe out]

Some woke teams that usually complains a lot, complained some more:

  • “Our image pulling from our self-hosted gitlab monster in AWS is slow!” and we were like: “Sure, sure it is all latency related you guys. You know it is a thing. And why aren’t you using ECR?”
  • “But it is fast now, and then later it is slow” and we were like: “Yes, yes - the network must just have had a wobble”
  • “Why is public VPC endpoints faster than the VPN?” and we were like “Well like IKE is doing IPSEC, and you know encryption only makes things slower”

Eventually, even spreadsheet jockeys started to notice they weren’t getting their csv’s into their excells in double quick time anymore.

[Breathe in]

[Rope in the cavalry.]

“So like AWS dudes, how the fluffing F do you test dodgy network throughput when neither DX or the VPN is even remotely close to their limits?”

[copy and paste]

Use this

“Thanks dudes, here’s the output - the egress traffic is dodgy and does a lot of retries”

Days go by - lots of head-scratching…

[Rope in the network BOFH]

“This doesn’t make sense man, lets pcap that ass”

And 50 shades of pcapping we did, ad nauseam, uploads fast, downloads slow (most of them). Oodles of TCP “WTF man?! I didn’t receive that packet, send it again dipshit.” retries.

More days go by - lots more head-scratching…

ECMP transit gateway, tunnel balancing algorithms, tunnel metrics, firewall settings, channel snuffing, DX snuffing DR failovers.. we scratched in all the places

[Breathe out]

Cavalry dude #4:

“so like you might be hitting the Packet Per Second limits on our VPN nodes… What kind of traffic are you sending over the wire?”

Us:

“Well, ya know, we play like both kinds of music here - you dee pee and tea sea pee from all 57 VPC’s via a TGW”

Cavalry dude #4:

“so like what kind of egress traffic are you floating back to on-prem?”

Us:

“well we gots us some docker, obviously csv’s, and hey, recently some firewall logs”

Cavalry dude #4:

“is that Fortigate Analyzer sending UDP traffic?”

Us:

“sure it is! way more efficient don’t you know?”

Cavalry dude #4:

“Googled it for you… You know that Reliable Delivery tickbox in Fortigate Analyzer - turn it on. Namaste.”

Us:

<click>…

“No more complaining, just champagning”

[parties with the desert animals]

p

PPS, Mʌðəfʌkə. Do you speak it?

So turns out our overly butch EC2 instance of Fortigate analyzer has a rate of UDP packets that exceeds what both the VPN nodes can handle.

The VPN node then goes “Whoah there sailor, put some of those packets back in your pants” and makes the TCP stacks downgrade and wobble into a tailspin.

This, it turns out, is pretty invisible to CloudWatch’s Eye and it is pretty weird they don’t have a metric for it.

Anyhoos, if you ever see your TCP traffic intermittently taking a nosedive and not coming close to any 1.25Gbps max rate of the VPN you might be hitting your head against the Packet Per Second rate limit on the VPN Node.

Anyhoos #2, UDP is a nasty piece of work.