The proliferation of mobile devices, rapid generation of data, and cheap broadband access (all contributing to the “Digital Tsunami”) has forced many data center providers to re-think how they build capacity, specifically public cloud capacity. Only a few of the traditional providers have transitioned to public cloud. For our part at CenturyLink in the last few years, we’ve expanded our reach to 14 public cloud nodes around the world. Each one offers a compelling mix of elastic services at a competitive price. And we are expanding in each data center constantly. Because of this expertise, many enterprise customers offload infrastructure management to us. But even as many of these businesses shift to public cloud, IT leaders often ask us about how they can run their on-premises deployments more effectively. To paraphrase “how can I bring some public cloud automation pixie dust to my private cloud deployment?” With that backdrop, we wanted to share some of the best practices behind one of our internal automation projects, Tinman. Tinman is a collection of APIs and scripts that ingests a rack of gear, then quickly transforms it into usable compute services. Tinman is how we bring new gear online, in a completely automated way.
Where to Get Started?
Bryan Friedman, the product owner for the newly formed Tinman team, points to the server lifecycle.
"We know how to build out capacity, but we wanted to get more efficient and, where possible, faster. We started by re-examining the actual states of a server end-to-end, from gear arrival at the site, to live in the platform."
The “server state” transitions, shown below, provided the map to automating three parts of the process (purple boxes).
Once a server is ready to add to the CenturyLink Cloud underlying infrastructure, Tinman needs to “discover” that it exists.
"Discovery is not only about identifying that a server exists but also registering the specific configuration of that server with Tinman. To simplify our automation as much as possible, we try to have a very specific set of configurations," says Friedman.
This is the paradox of cloud automation: the constraints set you free.
“We narrowed the number of hardware SKUs we buy from our vendors to optimize for speed of delivery and lower cost. Where possible, we look for multi-purpose configurations - SKUs that we can use to deliver a wide variety of products on top of.”
A server could be identified as one of three potential types: public cloud servers for customer use, public cloud servers for internal use (i.e. as part of CenturyLink’s management network to administer the platform), or a bare metal server for customers.
Next up for Tinman: getting the server into an “automatable state.” The tech press advocates for “immutable infrastructure” and “infrastructure-as-code”, and that’s exactly this step.
“Each server needs to look the same to its consumers; it needs to be standardized based on the configuration Tinman detected in discovery.”
Specific remediation steps include standardizing the firmware and the BIOS configuration.
“The underlying tech we use to enable this step has been around for a long time - network booting or PXE (Preboot Execution Environment) - allowing us to run our automation in memory and without provisioning anything onto a local disk, or even requiring any local disk.”
The magic of Tinman, according to Friedman, is in the orchestration – the sequencing of these steps.
“The remediation tasks Tinman goes through are conditional, based on the configuration of the server. If Tinman sees a local disk, then we know to kick-off the Bare Metal disk sanitization method. If Tinman sees 2 or 4 NIC adapters, then we know if the server is a newer or older (byte-order mark) BOM, respectively. For the most part we use REST APIs that work through the server’s Intelligent Platform Management Interface (IPMI) and are defined as part of the Redfish standard, but in some cases we’ve had to wrap REST APIs around some command line tools that run against the server itself to perform the automation.”
If there is an anomaly – when something doesn’t match up with the standard expected configurations – Tinman will flag this, too. When something in the server is incorrect or otherwise not functioning, the engineers know about it immediately. Remediation doesn’t just apply with adding a new server to the compute pool – these same commands are also run to rebuild servers when upgrades become available.
“At scale, it is often unmanageable and risky to patch servers. Instead, we just modify Tinman’s configuration to use the updated software, and then do a clean install of the new version. There is a much lower chance of failure that way. The ‘cattle’ approach is really what you need to be moving towards, away from the ‘pets’ concept.”
After these steps are completed, the infrastructure can now be added to the platform, and start hosting customer applications.
Provisioning, the last step, is also the simplest. Depending on the configuration, Tinman will install VMware’s ESXi (for public cloud) or install the customer’s chosen OS (for Bare Metal), Windows, Red Hat, CentOS, or Ubuntu.
“We use Cobbler, a tried-and-true open source tool that also leverages PXE, for managing the provisioning. Tinman gives the server its network configuration and hostname, all through an API.” “Once the server has those things, it’s in the pool and ready to be claimed by a customer.” Let’s re-visit the server state diagram, but with some of the Tinman tooling shown alongside:
What is Tinman doing now in the real world? Friedman highlights three areas where Tinman has delivered:
• Speed: Build tasks that previously would take 2 weeks, now take 2 days – an 80% increase. Overall data center build-outs used to take around 4 months, now take closer to 1 month.
• People: The technical skills required to add capacity is now lower, since many of the tasks are automated. That means that more engineers can perform key tasks, instead of requiring the attention of more senior technical staff.
• Consistency: Everything now looks the same going in our infrastructure builds going forward. That simplifies future build-outs dramatically, and introduces predictability when discussing future capacity expansions with leadership.
Three Best Practices Before You Embark on Automation
Of course, Tinman alone wasn’t enough. You have to change your approach to hardware to make sure you can even perform the automation and run at scale in the first place. Friedman has three rules he uses:
The secret to great hardware? Software. “Our hardware has to have an API, ideally one that’s REST-based. This is our top requirement. If a vendor comes to speak with us, and they don’t have an API, the conversation ends very quickly.”
Expect failure – at every level. Another truism for application developers in public cloud – “build for failure.” The same rationale extends to infrastructure. In CenturyLink’s case, for every 1,000 servers purchased, only 80-90% are implemented. “The rest we have as a buffer so we can failover without impacting machines,” said Friedman.
When failure happens, the failure state needs to be determined. Gear can not go dark, so the chipset is now mission critical for CenturyLink. The engineering team uses IPMI management tooling to determine what’s wrong. Friedman adds that this requirement can be easily met since most products these days, thankfully, do not shutdown the power to the chipset.
One More Thing…
When asked for a single piece of advice to give to other data center operations teams, Friedman immediately responded. “Simplify everywhere you can. Don’t use vendor-specific tools or proprietary software controls. And make sure you standardize the hardware. Once you have a baseline for immutable infrastructure, life is a little less painful.”
So What’s Next?
Friedman notes that the quest for full automation continues. While the server lifecycle described here represents a giant leap forward, there are still many areas for Tinman to automate. Among other things, automation of the network and related configuration is critical to improving upon the process, cutting down build-out times even further and minimizing failures.
Learn More about CenturyLink Cloud
Migrate to the CenturyLink Platform with free on-boarding assistance and receive a matching spend credit based on your commitment for your initial period of platform use with us.
Read the Analyst Reports on CenturyLink Cloud.
We’re a different kind of cloud provider – let us show you why.