Facebook's Open Compute Plans to Remake Servers

by Sean Michael Kerner

The social networking site’s lead systems engineer details his company's approach and plans for changing server infrastructure.

Facebook isn't just a massive web property, it's also a massive server infrastructure consumer and builder. For over a year, Facebook has been open sourcing many of its server and data center designs under the auspices of the Open Compute Project.

In an interview with ServerWatch, Amir Michael, Facebook's system engineering team lead, detailed his firm's approach with Open Compute. Among the most recent Open Compute specifications is one known as Open Rack, which was first announced in May.

Currently most server racks have an outer dimension of 24 inches and an interior width of 19 inches for servers. With Open Rack the interior width would be resized to 21 inches to provide greater density.

While Facebook is a leading proponent of the Open Rack specification, it is not yet in use at the social media giant. Michael said thatAmir Michael, Facebook Facebook's current racks are neither 19 nor 21 inches, but are instead around 20 inches.

He explained that with the initial Open Compute servers, Facebook optimized around power and let the dimensions work out themselves.

"Looking forward with Open Rack, the idea is to keep as much flexibility as possible, so we don't need to revisit the design of the rack and the power infrastructure every couple of years," Michael said.

He added that the reality is that rack and power technologies don't have rapid refresh cycles in comparison with the commodity components that go into racks. Infrastructure refreshes cost money and have environmental impact, so the idea is to keep it in place for a few generations of servers. By open sourcing the specification, the goal is to further commoditize rack technology to keep the refresh cycle to a minimum.

"Facebook is still relatively young as far as its infrastructure goes, and we don't do a whole lot of refresh right now.  Most of it is net new builds at this point," Michael said.

When it comes to determining when a server refresh is required, the decision is one that involves an examination of multiple factors. Among them is how many web requests the infrastructure can service or how many photos can be put on a particular device.

"We look at the energy required to maintain that infrastructure and how much power is going into generating each page view and storing each photo," Michael said. "At some point, it really falls out of favor, where the amount of energy we're spending, when compared to newer technology, might be higher for the older technology and that's where we make the decision to cut over to new technology."

ARM vs x86

Big hardware vendors like HP are beginning to explore the potential of ARM chip-based server architectures for web scale out deployment. From a Facebook perspective, the choice of architecture is always about supporting the required workload at the lowest power and cost that is possible.

"Today the majority goes to x86," Michael said. "We are always looking at alternate architectures."

He added that code written for x86 isn't always the way that Facebook will write software for its infrastructure. As such, Facebook is looking at ways that will allow its infrastructure to shift to non-x86 platforms at some point in the future.

Server Management

From a server management technology perspective, Michael said that Facebook today only uses a very limited set of functions.

"Really what we care about are remote console access and a remote method for rebooting the server," Michael said. "Those two basic functions allow us to do everything we want to do."

The first version of Open Compute had something known as Reboot On LAN, which took the standard Wake on LAN packet and then wired it to the reboot system on a server motherboard.

The second round of Open Compute servers provides a serial console over a network port.

"Without using any additional hardware, we used some features from the Intel chip management engine, tied it into our network interface, and can now get console access to servers," Michael said.

From a reporting mechanism perspective, any server failures are reported over the network and then aggregated into a database. That data is used for repair efforts as well as a method for predicting when certain types of hardware are likely to fail.

What's Next for Open Compute?

"We're just a year and a half into the project and we're already seeing a lot of momentum and interest," Michael said. "It's a project like any other open source project in that it requires time for momentum to build and people to gain interest."

The interesting part for Michael is seeing how people look at what Facebook has done, then apply the same principles and then come up with something new.

"I know this for sure, our designs weren't the best [initially], and getting people to contribute and provide commentary is really valuable feedback that you don't get otherwise," Michael said.

Watch the full interview:

Sean Michael Kerner is a senior editor at InternetNews.com, the news service of the IT Business Edge Network, the network for technology professionals Follow him on Twitter @TechJournalist.

Follow ServerWatch on Twitter and on Facebook

This article was originally published on Tuesday Sep 4th 2012
Mobile Site | Full Site