Chris Linden here. I wanted to briefly share an example of some of the interesting challenges we tackle on the Systems Engineering team here at Linden Lab. We recently noticed strange failures while testing a new API endpoint hosted in Amazon. Out of 100 https requests to the endpoint from our datacenter, 1 or 2 of the requests would hang and eventually time out. Strange. We understand that networks are unreliable, so we write our applications to handle it and try to make it more reliable when we can.
We began to dig. Did it happen from other hosts in the datacenter? Yes. Did it fail from all hosts? No, and this was maddening to be true. Did it happen from outside the datacenter? No. Did it happen from different network segments within our datacenter? Yes.
Sadly, this left our core routers as the only similar piece of hardware between the hosts showing failures and the internet at large. We did a number of traceroutes to get an idea of the various paths being used, but saw nothing out of the ordinary. We took a number of packet captures and noticed something strange on the sending side.
1521 9.753127 216.82.x.x -> 188.8.131.52 TCP 74 53819 > 443 [sYN] Seq=0 Win=29200 Len=0 MSS=1400 SACK_PERM=1 TSval=2885304500 TSecr=0 WS=128
1525 9.773753 184.108.40.206 -> 216.82.x.x TCP 74 443 > 53819 [sYN, ACK] Seq=0 Ack=1 Win=26960 Len=0 MSS=1360 SACK_PERM=1 TSval=75379683 TSecr=2885304500 WS=128
1526 9.774010 216.82.x.x -> 220.127.116.11 TCP 66 53819 > 443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2885304505 TSecr=75379683
1527 9.774482 216.82.x.x -> 18.104.22.168 SSL 583 Client Hello
1528 10.008106 216.82.x.x -> 22.214.171.124 SSL 583 [TCP Retransmission] Client Hello
1529 10.292113 216.82.x.x -> 126.96.36.199 SSL 583 [TCP Retransmission] Client Hello
1530 10.860219 216.82.x.x -> 188.8.131.52 SSL 583 [TCP Retransmission] Client Hello
We saw the tcp handshake happen, then at the SSL portion, the far side just stopped responding. This happened each time there was a failure. Dropping packets is normal. Dropping them consistently at the Client Hello every time? Very odd. We looked more closely at the datacenter host and the Amazon instance. We poked at MTU settings, Path MTU Discovery, bugs in Xen Hypervisor, tcp segmentation settings, and NIC offloading. Nothing fixed the problem.
We decided to look at our internet service providers in our datacenter. We are multi-homed to the internet for redundancy and, like most of the internet, use Border Gateway Protocol to determine which path our traffic takes to reach a destination. While we can influence the path it takes, we generally don't need to.
We looked up routes to Amazon on our routers and determined that the majority of them prefer going out ISP A. We found a couple of routes to Amazon that preferred to go out ISP B, so we dug through regions in AWS, spinning up Elastic IP addresses until we found one in the route preferring to go out ISP B. It was in Ireland. We spun up an instance in eu-west-1 and hit it with our test and ... no failures. We added static routes on our routers to force traffic to instances in AWS that were previously seeing failures. This allowed us to send requests to these test hosts either via ISP A or ISP B, based on a small configuration change. ISP A always saw failures, ISP B didn't.
We manipulated the routes to send outbound traffic from our datacenter to Amazon networks via the ISP B network. Success. While in place, traffic preferred going out ISP B (the network that didn't show failures), but would fall back to going out ISP A if for any reason ISP B went away.
After engaging with ISP A, they found an issue with a piece of hardware within their network and replaced it. We have verified that we no longer see any of the same failures and have rolled back the changes that manipulated traffic. We chalk this up as a win, and by resolving the connection issues we've been able to make Second Life that much more reliable.