Last summer, the investment company I work for decided to build a disaster recovery site. This location (40 miles from New York city) would provide a mirror of our downtown Manhattan operation. We decided to utilize Linux as much as possible for this project for the following reasons:
- We are primarily a Linux operation already, so we could use our existing experience.
- We could customize the configuration as much as we wanted, since everything would be open-source.
- We hoped Linux solutions would be less expensive than other solutions (i.e. Cisco).
In this article, I focus on our use of Linux in wan routers. I define a wan router as a system which connects to both wide-area links (e.g. T1 or T3 lines) and local-area networks (e.g. 100baseT) and forward packets between both networks.
We purchased dedicated connections since this is a disaster recovery site and we need the connections to be as reliable as possible. Based on our calculations, a T3 (45mbit/sec) and 4 T1s (1.544mbit/sec each) would provide sufficient bandwidth for our operations. Ultimately we decided to use the T3 link as the primary connection and leave the T1s as a bonded 5.7mbit/sec backup link.
The choice of wan connectivity determined our network design. For redundancy, we installed two wan routers at each site. Every router is identical and contains hardware to connect to both the T1 and T3 links. With the use of splitter hardware, we hoped to connect all the wan links to all the routers, as shown in this diagram:
Redundant WAN links
However, that ultimately turned out to be extremely difficult to implement, due to technical issues I will discuss below.
In addition to the wan links, we also connected the remote site to the internet via the hosting company backbone. We operated on the principle that more connectivity was better, and this turned out to be very useful while we were designing the network. There's nothing like accidentally bringing down your T3 with a mistyped command to make you appreciate a backdoor to your routers over the internet.
Our space for servers at the hosting company was limited to one standard rack. This put space at a premium because we needed to install a good number of servers. Thus, we decided to use 1U systems for the wan routers. This was a difficult decision to make as hardware options are extremely limited in that form factor. In retrospect, it would have been much easier to use 2U systems for the wan routers.
The next step was the selection of T1 and T3 interface cards. The main choice here is whether to use a card with an integrated csu/dsu (i.e. connects directly to the incoming wan circuit), or or a card with a high-speed serial connection and a standalone csu/dsu. Given our space constraints, an integrated card made the most sense. For previous wan installations, we had always used Cisco 2620 router boxes with T1 cards installed. However, that was not appropriate for this project because we wanted to connect multiple T1s and T3s.
After much searching, the only vendor we found that could supply both T3 and multiport T1 cards was SBE, Inc. The market for these cards is small and the number of vendors limited. My suggestion for finding wan cards is to start asking tech support a lot of questions and see how they respond. Also, look over the driver and hardware very carefully before committing to a particular vendor.
Designing the Router Computers
With the T3 and 4T1 cards from SBE, we would require a system with two free full-height half-length PCI slots. We decided on Tyan S5102 motherboards with single Pentium 4 Xeon 2.4ghz cpu. For memory, we used 256MB of ECC RAM for maximum reliability.
To cut down on the chance of system failure, we used flash-based IDE devices. We found a device from SimpleTech that connects and operates like a conventional hard drive. We decided on a 256MB device as we thought that would be enough room for Fedora Core 1.
The complete computer systems (minus the wan cards) were purchased from a white box system supplier. This proved troublesome as the supplier was not able to produce four systems that were totally identical (there were variations in cpu fan manufacturers, memory speeds, etc.). My suggestion is to find the hardware you need, and assemble the systems yourself.
One area where the system supplier was helpful was in finding the right case. Only one of the numerous system vendors I contacted could supply a motherboard and case combination that will hold two full-height PCI cards. We had hoped to use a stock system from a supplier such as Dell or IBM, but none of the big names could give us a system that matched all our criteria.
Circuits and Cabling
It's critical to have redundant circuits connecting an office to a backup site. Determine who serves your sites, and find a backup site served by multiple providers. Our office is physically connected to two providers, so we ended up ordering the T3 from one and the T1s from the other. If you don't carefully research which providers have actual physical connections to your sites, you very likely will end up with all your circuits running through one vendor's cable at some point.
T1s come on standard RJ45 cables. Typically the provider will terminate the T1s (and their responsibility) at your demarcation point (demarc). The demarc is generally where all your phone connections are made. From the demarc, it is a simple matter to run regular ethernet cables to your racks.
T3s are more complicated: the physical connection is two coaxial cables (one for transmit and one for receive). T3s use RG-59A cable with BNC connectors. The T3 provider informed us that our server room was too far from their equipment in our building, so a T3 repeater was necessary. This required 4U of space and a 120 volt outlet in our rack. Luckily, this wasn't the case at the hosting facility.
Splitting the Circuits
Our goal was to connect all circuits to all wan routers (see figure one) and leave the circuits turned off on the spare system on each end. One router at each end would be the master for the T3, and the other would be the master for the 4 T1s. If either router at one end failed, the circuits could be brought up on the other router.
We also discovered that the T3 signals on the coaxial cables must be impedance-matched. The impedance on a T3 cable is 75 ohms. If you just split that connection, the impedance on the two resulting cables is 37.5 ohms, which may or may not work, depending on your hardware. The correct way to split T3 cables is to use what's called a "power splitter", which contains a transformer to properly balance the impedance at 75 ohms on all connections. We used passive power splitters from Micro Circuits, Inc.
Splitting the T1s was much simpler. It's sufficient to just use RJ-45 tee connectors to turn one incoming cable into two outgoing cables. Also, the SBE 4T1 card is designed to not turn the transmitter on until the driver is loaded, so you can share the connection between systems.
We were able to make all these split connections work. However, due to the startup problems with the T3 cards and other issues, we currently do not have the splitters installed. If you want to try that route, you have to get everything working rock solid without splitting before even attempting it. Otherwise, you will be continually removing the splitters every time there is a problem with a circuit because you won't have confidence in your setup.
Rolling Our Own Distro
The choice of a 256 megabyte flash drive for storage dictated a compact OS install. At Telemetry, we have standardized on Fedora Core 1 for all Linux systems. Thus it would be convenient to run FC1 on the router systems as well. The two goals:
- Create something similar to stock Fedora Core 1 that would fit on a small drive.
Change the system configuration to avoid unnecessary writes to the drive.
- (This is important because flash drives have a finite lifetime, so placing log files on them is bad idea.)
It turns out that it is relatively easy to build a custom Fedora system, especially compared to what was available in previous RedHat releases. The key is to build your own system image on another machine with a fresh rpm database, then transfer that image to the router. This script shows how to build a basic system image. The procedure is to create a new rpm database somewhere on your build system, install a minimal set of rpms to create the system, then install all other rpms you want. I use the '--aid' switch to rpm to tell it to satisfy all dependencies automatically by looking in a directory where I have placed copies of all the Fedora RPMs. This saves me the work of manually determining all the dependencies. Once you have the system image built, copy it over to the router for testing.
We were able to create a workable system that used 171 of the 256MB available on the flash drive.
Tweaking the Router Configuration
The goal with a linux router is to minimize disk reads and writes. This is necessary because the memory in a flash drive can only be written to a fixed number of times (typically in the hundreds of thousands). The way to minimize writes is to treat the router like a laptop. First, enable laptop mode in the kernel (which has been standard since kernel 2.4.20 or so). This causes the system to delay writes until a read is requested, instead of sending writes to the disk as soon as they occur.
Second, adjust your filesystem mounting options to delay writes as well. For ext3, set the commit interval to 60 seconds. Then, set the filesystem "noatime" so that reads on files don't generate a write of modified access times.
Third, move all log files off the drive and into a ram-based filesystem (tmpfs). This script shows how to restructure your filesystem to move all log files out of /var and into a tmpfs called /var/impermanent. For this to really be useful, you also need a script such as the one I wrote that saves all the log files in a tarball on system shutdown and restores them on boot. This script should be called as early as possible in /etc/rc.d/rc.sysinit on startup and as late as possible during shutdown in /etc/init.d/halt.
Configuring the Wan Links
Wan links are confusing! For example, the T3 and T1 drivers use different versions of the kernel HDLC stack. This means we have to keep two different versions of the sethdlc program (used to set the protocol on the wan circuit), one built against each hdlc stack.
There are many configuration parameters to set of a T3 or T1 circuit (external or internal timing? CRC size? HDLC mode? etc, etc). Fortunately SBE's tech support was very helpful and supplied many configuration and troubleshooting tips.
We decided to bond the four T1s into one virtual circuit, using teql. This worked, but performance was terrible if one of the T1s was removed (even after it was reconnected). My coworker Bill Rugolsky tracked the problem down to a lack of link status reporting. The SBE card could report whether the link was up or down, but this message was not propagated up the stack. Thus teql would continue to try to send packets out on down interfaces. Bill resolved this by patching the SBE driver and installing patches other had created to fix teql and linkwatch notification. The driver patches were provided to SBE and we hope they are included in the next revision of their driver.
Our boss Andy Schorr did the work to set up OSPF to handle routing over the wan links. The open-source package Quagga (a successor to Zebra, of course) provides the necessary framework. If one of the links goes down (remember there are two links: the T3 and the virtual link over the bonded T1s), Quagga detects this and starts routing packets over the other interface. Traditionally, point-to-point links are configured to borrow the address of another interface (typically eth0). However, we decided to use dedicated subnets for each PtP link. Andy had to modify the source code to make Quagga work properly in this setup.
We also had to figure out some iptables rules to make Quagga work correctly with the bonded T1s. The teql device is send only, so packets never appears on it. This causes Quagga to to drop the packets because they come in on the wrong interface. The fix is a couple of iptables rules to make packets arriving on all the T1 interfaces (hdlc0 through hdlc3) appear on teql0:
iptables -t mangle -A PREROUTING -i hdlc\+ -j TTL --ttl-inc 1 iptables -t mangle -A PREROUTING -i hdlc\+ -j ROUTE --iif teql0
The bottom line is that setting up wan links is tricky work and requires much study and tweaking. Don't expect things to just work when you connect the cables.
Obstacles Along the Way
We had to resolve a number of problems while configuring these wan routers. Some of the earlier ones were with the wan drivers, as mentioned above. As I was writing this article, we discovered that our T1 performance had deteriorated badly, with highly variable ping times up to 1 second instead of the usual 10ms. We traced this to one of the wan cards not generating interrupts because it had come slightly loose in the PCI slot. The widely varying packet delays were occurring because the other device sharing the same interrupt line (eth0) was sending interrupts. Thus in turn caused the SBE driver to wake up and process it's interrupts.
This type of non-obvious failure highlights the importance of link quality monitoring.
We are satisfied with the basic architecture, but there are a number of improvements to be made. Given the annoyances of managing multiple T1s in a bonded interface, we are now planning on upgrading the T1s to a second T3. When we do that, we may drop the circuit splitting entirely. Circuit splitting adds a whole new level of complexity to the entire system and we are unsure if it is worth it.
We have to continue to improve our monitoring of both line status and line quality. It is difficult to complain to circuit vendors about performance if you don't have historical data to back up your assertions.
It would have been convenient to use off-the-shelf servers for the router boxes. We have been investigating the latest 1U rackmount from a major manufacturer, but for several reasons it is unsuitable. The showstopper is that the BIOS will not allow booting from any flash IDE device. The vendor knows of this limitation but will not fix the BIOS. Thus we see ourselves building our own systems for the foreseeable future.
We will be building additional internal router boxes for handling lan traffic, based on the wan router model we have developed (1U systems with flash drives running a minimal Fedora kernel).
While this project is not complete, I feel we've accomplished enough to take a moment to evaluate it's success. The key question is: would we do it again? The answer is a qualified yes. Our wan routers perform the task of providing redundant connections between our office and backup sites. The usefulness of splitting the wan circuits for redundancy is a wash since it adds so much complexity to the design.
This project has taken significantly longer than we anticipated. This is a general symptom of developing solutions based on open-source software. The answers are there, but you will expend more time to find them. Having a sharp, dedicated team (as I did) is crucial to making it all work. Just make sure to budget extra time for all the annoying little problems that are sure to arise.
Phil Hollenback is a system administrator at Telemetry Investments in New York City.
While he's not upgrading linux servers or skateboarding, Phil spends his time updating his website www.hollenback.net.
Originally published in October 2004 issue of Linux Journal.