A coworker of mine from when I was at Tellme has started a new blog to show and tell what he’s learning and thinking about network configuration management. In his most recent posting it turns out he’s found a paper written by a friend of mine from college! What a small world!
This problem is an important one. Automated system administration is a well accepted concept at this point. You simply cannot run 10,000 node clusters without extensive automation, and there are a lot of clusters that size these days (not to mention service providers who use over 10,000 computers to implement their service). Networks were growing in absolute size due to the pervasive nature of the computers they connect now. But as if that wasn’t enough to demand automation, network admins are adding a myriad of devices on top of (and embedded within) their networks, all of which have configurations that depend on the overall configuration. Over and over again, we see in outages in production networks precipitated by seemingly unrelated changes in far-off parts of the network. Assuming you can predict the dependencies and write some software to address them, automated network configuration management should be able to nip some of these outages in the bud.
Unfortunately, that “assuming” is a big one. There’s a chance that smart people will go work hard on this and come back saying, “by the time we limited the problem enough so that it was solveable, the solution wasn’t saleable”. I don’t have a gut feeling how this will turn out. It’s going to take people going and doing it to see. It seems Brent, who left Tellme in January, is getting ready to relaunch his consulting practice, focusing on automated network administration.
Cool! Good luck, Brent!