Building Redundant Networks in Data Centers

Monday, 7. June 2010

Building Redundant Networks in Data Centers.

I recently was asked to put together a brief web presentation on the different methods of creating redundant networks. I couldn’t think of a better place to put it, then right here on my blog. After all, I was overdue for a post anyways…

What do I mean by redundant networks?

A redundant network is two or more distinct paths for data to travel to and from an upstream network. In it’s simplest form, it can be a piece of equipment that can be manually placed into service easily upon a failure. More often though it is set up so that any single device or connection can fail, and without user intervention, a backup system or connection will automatically step in and take over the job of the failed device, or connection. A redundant network does not mean that no mater what happens, your data will still be reachable. There are many factors that need to be considered, ranging anywhere from your providers, to your applications, that can cause a failure.

Budget Alert!!!

In this article, I’m going to be talking about some reasonably advanced hardware features and protocols. My company provides, supports, and manages Open Source based network firewalls and routers. So, for us and our clients, all we pay for is the commodity PC hardware and technical expertise. If you want to do this with Cisco or Juniper gear, get out the big check book… You’ve been warned!

Working with your data center.

The first step to any network project that concerns where networks meet, is to contact the network operators you will be connecting to and verifying what they are willing to provide you with. Many smaller data centers may not have the expertise to setup some of the more complicated configurations that are required for full redundancy. In almost all cases, they can deliver redundant connections to your equipment, but will expect you to manage your own fail-over. There are a few standard methods that most large data centers will allow you to setup.

Let’s start with something familiar…

So, I’m sure you recognize this little diagram. This is probably the most common setup that you will see in the IT world. A single data connection, fed into a single firewall, attached to a single switch, which then feeds the servers. We have the following failure points:

  • Data Center Connection
  • Firewall
  • Switch

Notice that I didn’t mention the servers here? That’s because the servers need to be redundant all on their own. I’m only concerned with the network. We have three distinct failure points. What can we do. Well, let’s take the “simple” approach for a second. We could just buy an additional firewall and switch. Then if one of them fail, we can run to the data center and put the new replacement in service. This would result in a down time of:

DiagTime + TravelTime + ConfigTime = Total Downtime

So, with this design, we have to look at whether this is acceptable. In some cases it might be, but for our little project, let’s say it isn’t.

The Next Level…

So, if we are willing to buy a second Firewall and Switch, we should probably use them. Here is the next level:

So, what do we have here? Looks like we are using Virtual Redundant Routing Protocol (VRRP). What is VRRP? VRRP is a service that runs on both firewalls that allows the firewalls to “elect” a master firewall. It can be configured so if either of the connections goes down on one firewall, the other can step in and take over. Problem solved! Or is it? See, here’s the deal, this works “most” of the time. The sad truth is, that sometimes, the firewalls can’t figure out if either of them are down. Sometimes, a firewall will hang, and the other firewall won’t be able to take over. Also, we are using Network Address Translation (NAT). NAT complicates things as the firewalls are actually managing the NAT connections, so if the active firewall goes down, all your connections will be reset. Worst of all, if one of the firewalls hangs without releasing the addresses, it could cause another trip to the data center, and that would put us back to where we were before:

DiagTime + TravelTime + ConfigTime = Total Downtime

So now what? Well, if we can get the data center to do it, we can use some routing protocols to make the fail-overs even more transparent. We also should stop using NAT, as we are running public servers.

I know some of you out there believe that NAT is a great thing for security, and I also know that you are probably getting ready to write me a nasty email regarding my total disregard for your opinion regarding NAT. I respect your opinion, and am sure you could bring an awesome argument to the table about this. I, however, know that a properly configured stateful firewall will kick NAT’s butt all day long. And since we are talking about a server farm in data center, I respectfully submit, that NAT is not necessary here.

BGP, OSPF, STP, the Universe, and everything!

So, we have arrived. This is pretty much the holly grail of data network redundancy. Here’s a quick breakdown:

  • Each Firewall / Router is talking to a different provider router via Border Gateway Protocol (BGP).
  • The two Firewalls are talking to each other via a cross-over cable exchanging Internal BGP (IBGP) and Open Shortest Path First (OSPF) exchanging local and learned routes to each other.
  • Both routers are each plugged into both switches via a bridge interface.
  • Both routers are using VRRP on the interior interfaces.
  • The switches are connected to each other via a cross-over, or even via stacking.
  • STP is managing all the cross connected switch connections so all the active loops are shut down.

But wait… Don’t you have to have an Autonomous System Number (ASN) and a direct allocation to do this? Well, no, you can do this if your provider is willing to accept address advertisements from you on a reserved ASN. This is actually a pretty common setup in most larger data centers.

That said, there are obsticles to doing this. You have to understand how to setup all the protocols we’ve been talking about. And, you need to understand how those protocols work so when something does go wrong, you can get to the bottom of it FAST.

The only thing I would add here is some sort of out-of-band connectivity to the inside network. The last thing you want is to have a BGP crash and not be able to get to your systems.

What’s next?

Well, I guess you could setup a hot standby data center somewhere else… Here are some methods of doing that:

Setup Extremely Short TTL for DNS Records.

If you set your Time To Live (TTL) to something really short, like let’s say 15 minutes, then you could attach to your DNS server and change the pointers to the hot standby IP addresses. This would get you switched over in about 15 to 30 minutes. It’s not prefect, but if you have to hot swap to a standby data center, 30 minutes of unavailability is the least of your worries.

Get a second location from the same provider.

You could get a second cabinet / cage from your current provider in a different geographical location, then they could re-route your traffic to your hot standby location. This could take anywhere from a few minutes, to hours, depending on the time of day and who is on call for your data center provider. If you are running the BGP setup above, it is possible for your provider to allow you to also advertise your production network on your hot standby routers, but this is probably unlikely to work at different locations. As a matter-of-fact, unless the provider is willing to redirect your production network to another data center, you will be stuck with the DNS record solution anyways. And, if the provider is having problems, is that really where you want your backup to be?

Get a real ASN and a direct alocation of addresses for your RIR.

You know, if you are big enough to need a hot standby data center, you are probably big enough to justify the paperwork¬† and expense to get your own address allocation. Then, all you would need to do is advertise your route out your hot standby data center routers, and in a matter of minutes, you are up and running…

I hope this helps you get a better understanding of your options when it comes to redundancy. As always, feel free to let me know your thoughts!

— Stu


Leave a Reply

You must be logged in to post a comment.