Pantheon is the professional website platform top developers, marketers, and IT pros use to build, launch, and run all their Drupal and WordPress sites. Pantheon powers 70,000+ sites, including Intel, Cisco, Arizona State University, and the United Nations.
Recently Zack Rosen, Pantheon CEO, published a blog post entitled Why we built Pantheon with Containers instead of Virtual Machines. We were intrigued and wanted to learn more about Pantheon's Docker-based platform. David Strauss, Pantheon's CTO, met with me (virtually) so we could learn more about Pantheon.
David Strauss, CTO at Pantheon
After co-founding Four Kitchens, a successful web development shop, David found himself gravitating away from custom client work and toward infrastructure solutions. Large clients like Creative Commons, Internet Archive, The Economist, and Wikimedia had already benefited from his scalability and database optimization work.
In addition to his role as Pantheon CTO, David also co-maintains the systemd/udev layer that runs on most of the world’s Linux systems, serves on the Advisory Board for the Drupal Association, contributes to the infrastructure and security teams at Drupal.org, and leads the development of Pressflow.
Why did you build Pantheon with Linux Containers? What was it about containers that drew you away from virtual machines?
This goes right to the primary drivers we had for designing the product. This included having a consistent experience and it is hard to implement this using VM’s. This is due to the fact that production may be a fleet of VM’s while a developer instance may be a single VM or a set of shared VM’s. Then you are going from local database access to network DB access, you’re going from local file system access to something like GlusterFS. These introduce major consistency issues.
Even the tools used for local development might be different from production; locally you may not have access to memcached or SOLR. The way to overcome these limitations and provide a consistent experience at a reasonable price is to use containers.
You can’t deploy a fleet of VM’s for every developer in every single environment along the way to deployment at a reasonable cost. There is huge overhead, and costs, associated with replicating databases servers, etc.
With containers we can still spread the application over the network but only take a small slice of each system. You can have the application in a small container, have the database server in another small container, have SOLR in another small container and the overall footprint is small. In addition, the containers can be started and stopped on demand when they are actually accessed the footprint of memory and CPU is not even persistent.
This allows the dev and test environments to be representative of the later stages of deployment in an economical way. This architecture also pays off in production where the on-demand nature allows for scaling without the costs of over provisioning. Since the overhead costs in resources is small the Pantheon grid system can contain the blips in usage from their customers with little effect on the overall edge traffic that they process.
Why don’t you use Docker?
For one, our container approach pre-dates Docker by a number of years. We’ve been using containers since 2011 on Pantheon’s platform.
The Linux kernel doesn’t actually have the concept of containers. There is a set of API’s that, used in unison, gives you something that looks like a container. It’s not like Zones on Solaris or Jails in BSD, where there is a monolithic thing that you configure that creates that isolation.
On Linux there are mandatory access controls, there are capability filters, there are cgroups, there are namespaces, and probably one or two other things other than standard users and groups. If you configure enough to isolate a process you get something that looks like a container. It’s called a container on Linux because, for all practical purposes, they are the same as containers on other platforms that have the concept at the kernel level.
What Docker does is configure all these resources for you. We use systemd to do the same thing.
Docker also provides a packaging and deployment model and infrastructure for publishing containers for other people to be able to install them. Unfortunately, some of the design goals for Docker have not aligned with some of our needs. We already have technology for configuring what is in the container, we use Chef to set the context of the container itself rather than the base system. Also, our density needs are considerably more advanced than what Docker provides right now.
Our technology that allows containers to be activated on usage is done by a technology called “socket activation”. systemd sits on the base system, it opens a socket (the “listener”), it puts an “epoll listener on there that allows it to identify that a connection is coming in but not actually process it. Knowing that a request is coming in systemd can activate a service or container on the system.
This also allows containers to be hibernated, by identifying processes that have not been accessed or processed a request in a while and then being “reaped”. This allows these processes to not consume any resources, when a new request comes in it takes a few seconds to restart the container. We deeply thousands of containers to any particular box, but only about 5-10% are running any any particular point.
Is any of this open source?
All of the stuff we do for socket activation is entirely open source and built into the systemd project. This is shipping widely now, back when we started on systemd it was only on Fedora, now it is on SuSE, it’ll be on the next Ubuntu release, it’s on the latest Red Hat Enterprise 7. So access to this architecture is available to everyone except Ubuntu LTS.
And it is in systemd, or is it in a different project name?
It’s in systemd. You basically separately configure a listener as a socket unit and then, in the service itself, you can have it start a container with something called systemd-nspawn. It also integrates with RedHat’s libvirt infrastructure to do activation of LXC containers. At that point it is a matter of configuration, I have published articles on the systemd wiki about this.
You can actually trick a lot of daemons into thinking they are reloading their configuration and they will accept an inherited socket. For example, the way that nginx or php-fpm handle their safe reload without interrupting their request availability is by forking off a new child and then handing off the current listeners to the new child. Once the child appears to be fully operational the parent just dies off. They are kind of socket activating their child, and that chain can continue indefinitely.
This can go on indefinitely with no performance impact because there is no proxy involved, nothing is actually copying the bytes from one to another. When you hand off a socket this way it is completely, natively in the kernel. There is no performance degradation versus the application having created the listener itself.
What about security? Is the security profile of using containers this way any different from using Docker for security?
The systemd-nspawn project has full support for mandatory access controls with SELinux, the reason this is important for security on containers is there are root escalation vulnerabilities all the time. People figure out if you call this kernel call with this data you can escalate your current user to root. With SELinux access controls are paired with the concept of operating with a “Label”, which contains the process from only consuming the resources allocated to that label.
This means to attack within SELinux systems you not only have to find a way to become root, you also have to manipulate the SELinux state machine or find a way to alter your label. Labels are great for containers, you just label everything in the container to a value and you launch with permissions to only be able to access through that label. Even if the stuff in the container got root it wouldn’t have permissions to read or write to anything on the base system, except those thing you have granted it access to.
Are you thinking about porting back some of this to Docker, and potentially adopting Docker in the future?
We are running a couple of services we are running more centrally in Docker containers because it is a convenient way to package applications that are always running. We haven’t looked at using it directly for our user’s containers yet, because of that lack of on-demand launching. So far, the Docker upstream project has not been receptive to accepting the type of changes necessary to support on-demand launching.
However, it looks like RedHat is starting to maintain a patch set for Docker for the version it is going to be shipping on Fedora and RHEL, then of course CentOS and derivative ones too. This patch set will support on-demand launching.
Red Hat has some use cases where this really fits, such as the OpenShift project. OpenShift is moving to Docker as its foundation, but they don’t run Docker directly. They run Docker with Geard, which is more like CoreOS. The difference is instead of launching a container that dockerd launches and manages, it creates a systemd service and then it runs Docker within that systemd service. If this is in place you can patch Docker to hand off the socket into the container and support socket activation.
OpenShift has a need to be able to launch containers on-demand. Currently, they do this at the application level by utilizing an HTTP proxy. This requires the proxy and there isn’t a clean way to hold the request until the container launches. At Pantheon we are hoping they see socket activation as the future model for how you say a container needs to listen on something and then have it start when a request comes in, so you can have an on-demand model that doesn’t require a lot of complexity.
What is you opinion about the whole PaaS space?
We looked at Cloud Foundry a couple of years ago, but haven’t taken a fresh look at it. So I can’t really comment on Cloud Foundry, although they have had some uptake.
I have more connection with Red Hat and the Fedora community and so am more familiar with OpenShift. We are probably not going to use OpenShift directly because it is pretty heavy-weight and requires running a lot of components to get the infrastructure set up.
We are much more interested in something lighter-weight for container orchestration like CoreOS. We want to provision, hibernate, migrate, etc. containers without having the overhead of a GUI, authentication layer and manager, and the other infrastructure that goes hand-in-hand with something like OpenShift.
That being said, OpenShift will be much more interesting once they have the Docker and on-demand launching abilities.
Have you looked at any of the Docker PaaS’s, like Flynn, Deis, or Dokku?
We haven’t and the main issue for us to run our product is that these systems are based on a billing model where you are paying for the container, and you are paying for the container to be persistent and run indefinitely until you stop paying.
Our model is based on several kinds of compounded efficiency. One is that we are moving more to running containers on bare-metal which is massively cheaper than occupying an entire VM in the cloud. This is due to the fact that we have very predictable provisioning needs in terms of our customer growth path. We can know 30 days in advance what our container growth is going to be. Systems like Rackspace OnMetal are providing API provisioning access to bare-metal where there is no downside as long as you have a way to slice up the servers.
We get our compute cheap, because we eat all of it by not using VM’s as our isolation layer. We get tons of density by using the on-demand container stuff. We are twenty-fold more dense than running all the containers all the time. We also get massive underlying efficiencies by how we do our packaging and distribution of binaries. We have the binaries and libraries on the base OS, have the containers use those, and take advantage of the fact that Linux will map the running binaries to a single image. This provides gigabytes of RAM saving by not having every container instance run its own set of binaries and libraries.
Do you use containers primarily on virtual machines or on bare-metal today?
Primarily on virtual machines. We had a project called “Pod One” that was a rack of bare-metal hardware that was provisioned in a traditional way, and it was a pain to manage. We really like API-based provisioning even though we have pretty predictable needs. Rackspace OnMetal is changing that, we are looking for OnMetal to come to some of the data centers we use. We were launch partners with Rackspace and participated in the design of the system, where Rackspace came to us and asked us what we needed, and what did we think about future needs for the infrastructure.
I wouldn’t be surprised if we move almost entirely to this system with Rackspace data centers because it is 2-3 times cheaper than cloud infrastructure. It turns out that due to the size of the VM’s on the cloud we were essentially the only tenant on the box. You have to remember that with cloud infrastructure you are paying on-demand for your compute and other services, but there is an overhead that you are charged to allow the cloud provider to have the idle machines standing by.
Do you think that the “container revolution” will push more people away from VM-based clouds and toward bare-metal over the next 3-5 years?
Certainly products like OnMetal where you provision a machine like you provision a VM, that will change a lot.
It depends on what people are running. Many people wouldn’t know what to do with an entire machine, even if they could slice them up into containers. If you have an application that takes 4 his, and another application that takes 4 gigs, what do you do with a machine that has 128 gigs of RAM?
Providers want to get the best match of machine to their datacenter profile; racks, power consumption, cooling requirements. This means they will have a “sweet spot” for the profile of their machines. Bare-metal machines that only have 4 Gig of RAM are not in the sweet spot or economical. VM’s have utility when the size of the machine you require is less than the sweet spot of the bare-metal server. VM’s will persist and be offered to satisfy this demand.
Complementary to this issue are the super light-weight VM’s. These are VM’s with a minimal kernel and just the application. Over the past 5 years the emergence of high level virtual device drivers (instead of existing device emulation) has led to “thin VM’s” that only support these types of virtual devices. This allows for less complexity, they look the same no matter where you deploy them, and allows for clean aggregation.
Since there is no user land required, the thin VM only has an OS plus the application and can be deployed in less than one hundred megabytes for the OS. The OS can be very much stripped down due to the fact that it is specialized for running a single application. It doesn’t give all the efficiency we get with containers, but it gives better isolation for security and performance due to the fact that it gets things like it’s page table protected by hardware, the hypervisor is protected by the processor as a separate system, and you’re only paying a couple of percentages for the virtualization overhead.
In either case, containers or light-weight VM’s you will want to run these on bare-metal as opposed to trying to run virtual systems on virtual systems. Either for cost or capability, this is the future.
One last question, what do you think about Hack and the Facebook alternative to PHP?
The great thing about Hack is that it is not an alternative to PHP. Hack is an additional layer within their HHVM that allow you to port existing code and write new code that supports the new semantics of Hack. Typing is a key concept, it is just better for complex projects, but it is just running in the JIT compiled virtual environment.
You can run existing code, you can add new code with new semantics and features and there isn’t an overhead associated with running this mixed program. It retains complete compatibility with existing code.
You can iterate gradually by porting code over, that is neat, the biggest problem with moving code around with PHP the only option with the Zend runtime is to write a C extension. Which has a really nasty API, its hard, its like halfway C and halfway macros. Its some of the ugliest stuff I’ve ever seen.
With Hack and HHVM you don’t have to change the way your developers work.
So you think Hack is here to stay?
The biggest concern people have with Hack is that it is not supported at all from the Zend runtime. So the only way you can include Hack stuff in your code is to run it with HHVM. Facebook is trying to mitigate this with a new initiative where they are trying to jointly standardize the PHP language with the Zend engine. This would allow Hack or HHVM to be used without having to “never look back”. There is a pretty big step right now.
We demoed a version of some experimental support of Pantheon using HHVM at the Austin DrupalCon. We showed benchmarks, it is a pretty massive improvement in performance. It is pretty exciting, while it is not ready as a customer thing we have built the system to be able to support it long term.