What if your laptop asked you which core you would like to run Word in today? Sounds archaic, but that's the current state of the art with server infrastructure. You still have to pick with virtual machine you want to put your Rails app in.

Mesosphere offers a layer of software that organizes your machines, VMs, and cloud instances and lets applications draw from a single pool of intelligently- and dynamically-allocated resources, increasing efficiency and reducing operational complexity.

It provides an HA and fault-tolerant platform, and powers production environments at some of the most agile companies in the world. Mesosphere is based on Apache Mesos, an open source technology invented at UC Berkeley's AMPLab and run at large scale at Twitter.

Enjoy my interview with Florian as I learn about Mesosphere and find out why it making big waves in the Docker world.

Background

Florian Leibert, CEO and Founder of Mesosphere

Florian Leibert's mission at Mesosphere is to create a smart infrastructure mesh that lets developers to stop thinking about which piece of code should run where.

Watch the Interview

Listen to the Podcast

You can subscribe to this podcast on:

Show Notes

[01:37] Can you tell us about your background?

I came to Silicon Valley in 2009 and started with a company called Ning. Later I moved from Ning to Twitter, at the time that Twitter's user base was scaling up. The demand for the site was going crazy, but at the same time the company was resource constrained in terms of hardware and engineering resources.

My host family when I came to the US in 2000 included Benjamin Hindman, one of the co-creators of Mesos [http://mesos.apache.org/] was working over at UC Berkeley. He was invited to talk to Twitter, and some of the ex-Google employees recognized something in the technology that they used at Google to really help them scale. This was Google Borg which became Google Omega.

We convinced Ben to join Twitter, eventually full-time, and Twitter invested heavily into this technology. This solved a lot of issues for them. One of the main issues was that it really increased the level of automation and simplified the infrastructure.

We were pulling micro-services out of the monolithic Rails application (called Monorail), and creating micro-services written in Scala or Java and needed to be able to deploy them. It took forever to coordinate with Ops to get the Puppet config files correct for deployment. The whole cycle took two or three months at the time to get a piece of code deployed to production.

Mesos is very different from many of the technologies out there. It gathers all of your resources together and presents them as one large pool. With software like Marathon or Aurora you can actually deploy new applications in minutes to this shared pool of resources.

After two years of adopting Mesos within Twitter I joined Airbnb. At Airbnb we also started using Mesos and we built a system called Chronos. Chronos targeted the analytic stack it really was an ETL tool that allowed you to construct dependency chains of individual jobs.

After we successfully created Chronos and got traction within Airbnb, and outside Airbnb, we decided to start a company around Mesos, Chronos, Marathon and a number of other technologies. It was Tobi (Knaup) and myself, and Ben has now joined us full-time and we are working on Mesosphere.

[06:00] Can you explain what Mesos does for people?

Yes certainly. Our laptops today have multiple cores, if you think back in time when we first started working in Assembler and low-level programming languages wed didn't really specify the resources we were using. As we moved up the stack to higher level programming languages we were able to use abstractions that the kernel provided, that allowed us to fire off threads that ran on an arbitrary CPU.

In our data centers today we write against individual machines, often times in virtualized environments these are virtual machines. For a distributed application we have to manually wire these machines together. If one of these machines fail we, as developers, have to manually take action to recover. What Mesos does is provide a higher level abstraction that allows us to think about only the business logic when writing a distributed application.

That's well and good when you are writing a brand new application, think Apache Spark [https://spark.apache.org/]. Spark is often touted to replace Map-Reduce, but Spark was a sample application for Mesos when it was invented over at UC Berkeley as well. They argued that they could only write Spark and get so much adoption so quickly because they were building on top of Mesos. They were able to use the abstractions that Mesos provides and forget there was a network underneath.

We are trying to use Marathon to be able to run existing programs that run on a particular box, like a Ruby on Rails application that runs on a single machine, and schedule those in the cluster scale those in the cluster and also deal with failures of individual nodes.

Marathon is a native Mesos application and uses the Mesos abstractions to carry out these tasks.

[09:36] Has job scheduling become more important to cloud application development?

Yes, as you are building out the higher levels something has to be scheduling the lower levels. If you think of the hybrid environment where you may have some nodes at Amazon, some nodes in your private data center, if something were to abstract all of these resources away and you as a developer only had to worry about telling the system "Hey I want to launch this application five time, and I don't really care". Under the hood is a scheduler that takes the application and makes sure its running five times across all this hardware, and if it fails somewhere it will restart it.

On a laptop with multiple cores, that is exactly what the kernel does. What Mesos is trying to bring is this metaphorical kernel for your data center or cloud.

[11:56] There is Mesos the open source project, and there is Mesosphere the corporation that you are the CEO of. What is the difference?

Mesos is an Apache project that has a large community behind it and has a lot of adoption. A lot of companies, like Twitter, Airbnb that have adopted it have a lot of engineering resources they can go in, they can fix bugs, they can install it everywhere.

When you look at a lot of other companies that might not be focused around engineering as much might want to just have a drop that is efficient. They may want to install their own PaaS layer they can use our products and services to get up and running with Mesosphere. Which is more than Mesos, because it encompasses a series of products all of which are currently open source.

[13:15] Mesosphere is a way to do Mesos easier?

If you think of Mesos itself, it is like the metaphorical kernel for your data center. Most users don't just download the kernel, they want the full system. That's what we want to deliver.

[13:58] For Mesosphere is that a project as well, that you can do for free, or is it pay only?

Right now all of the public products are free. A number of them are heavily used, for example Mesosphere for GCE. You go to google.mesosphere.io and you can launch a Mesosphere cluster in less than 10 minutes. It comes with things like a file system installed, Marathon installed, it comes with Mesos of course installed, and we are going to add more and more components to that. It really allows you to get up and running with a cluster that we built up at Airbnb over the course of a year, in ten minutes or so.

[15:00] Who should be focusing their time and efforts learning how to use Mesos?

Mesos has multiple entry points. If all you care about is launching and keeping a Rails application running it less interesting. However, if you are building the next distributed system, let's say you want to build the next distributed database, ot the next distributed message bus, you can look at Mesos and the abstraction and write a fully distributed system much easier and with very little work.

Chronos is a good example. It is a fully distributed system, its elastic, highly available and fault tolerant and the core of it has about 2500 lines of code. Even though it is a distributed system it has almost no lines of networking code.

Mesos is the base for all of that and it has hooks in for most modern programming languages Go, JVM, Java, Scala and so forth, Python, you can write new Mesos frameworks in no time at all in most any language.

[16:43] If you are Rails developer and want to run a Rails application, should you be looking at Mesosphere or is that overkill?

We want to make it as easy as possible to use Mesos and Mesosphere products so we argue that if you can build for scale with zero overhead, why not do that from day 1?

You can actually run Marathon in local mode or you can start a small cluster on GCE or on Amazon. You can start the application via Marathon on then maintain it and I argue that even if you are a shop with two to three developers it is important to run something that is fully automated because you probably don't want to be paged in the middle of the night when one of your machines goes down. Maybe its more cost effective to have three machines running and then is one goes down you can deal with it during lunch break.

[17:49] How is Mesosphere different from Platform as a Service?

One aspect of Mesos, and particularly if you look at Marathon on top of Mesos, then we are talking about this very programmable PaaS layer. But we don't want to limit it to that, we think the spectrum is much wider. When you think of Mesos as this base of launching tasks into the cluster it becomes much more than just a PaaS, it becomes Infrastructure as a Service.

If you are thinking of running a truly multi-tenant cluster you can think of running Spark, maybe your Hadoop distribution of choice, and Marathon; then you run long running services alongside your analytics batch services. That's when it becomes very compelling, because you have shared hardware, everything runs on the same cluster, you are really simplifying your infrastructure quite dramatically.

[19:27] How is Mesos different from Kubernetes, or CoreOS with Fleet?

That's a great question. Kubernetes is more like Marathon and Mesos is a layer below. As a matter of fact, we have ported Kubernetes onto Mesos. Because Kubernetes really allows you to run existing applications, such as your Rails app, and if you want to run it at scale you still need a good scheduler underneath and that is where Mesos comes in.

Fleet is similar to Kubernetes, it is not as feature rich as either Marathon or Kubernetes, I don't know of anyone who uses it at scale. We have a customer running Marathon at scale on tens of thousands of physical servers, Twitter runs Mesos on 30,000+ servers as far as I know. Its really a case of scaling, and Marathon is this PaaS-like component. Kubernetes you can look at as a PaaS-like component, Fleet is this PaaS-like component, but you cannot run Spark on top of Kubernetes or Fleet, you can run it on Mesos.

[21:25] How does Mesos and Mesosphere deal with data, such as MySQL databases and data and the file system underneath?

Today if you want to deploy MySQL on top of the Mesos architecture, using Marathon or Kubernetes, you would use static associations between hosts and the database. A database is not going to be a system that is very dynamic, for example if there is a network blip you cannot just move the database to somewhere else. You would want to wait until an event occurs and then declare the database host to be dead. Then, right now, you would have to do manual recovery.

Nobody has a universal solution that really deals with databases and transactional systems. That being said we are building abstractions into Mesos itself that make it much easier to build new frameworks to build transactional systems on top of Mesos directly. We call these "persistent offers" where a certain application will always be offered a host with certain data. This is on our very short term roadmap and later this year we will have a very good solution for this as well.

[23:14] Why did Mesos move to Docker, rather than continue to use cgroups directly?

What we actually did was to make the containerizer pluggable. Mesos has been using cgroups for three or four years, since it inception, we've alway been able to use the isolation. We've seen a lot of momentum with Docker and the Docker ecosystem so it was natural for us to allow someone to plug in the Docker container as one of the implementations on how to do isolation packaging.

We are seeing a lot of companies that are really jumping onto Docker for that. But we also see companies that are really happy with having a tar archive and they still use cgroups for isolation.

[24:38] Do you have to think differently about building applications to use Mesosphere, or do you build your application as you normally do and use the Marathon deployment mechanisms?

If you want to write and run a Ruby on Rails application you don't have to change anything. All you need to do, once you are ready to deploy it onto a server, you upload it to the cluster into HDFS or S3, and then point Marathon at the binary that represents your application. It takes zero additional effort to start your application on a Mesos/Marathon cluster.

That being said, if you are writing a new application, let's say you are trying to write the next distributed message bus, you could write directly against the Mesos abstractions and write a fully distributed system that is fully elastic, that is self-healing, and then launch it against a GCE cluster just like you could on your laptop.

The great thing about Mesos is that it really turns your compute resources into this pluggable system. You can just add resources, you don't have to restart any applications, resources are added to the pool and then applications can be scheduled against the resource pool.

[27:00] Can you have Mesosphere/Marathon deploy Docker containerized applications, instead of tarballs?

Absolutely, you can just create a Docker image and put it into your Docker registry of choice and then specify the URI to Marathon to launch it. You can scale it up and down, and if one of the container instances fails the system will self heal and restart the application instance.

[28:03] What is the future of Mesos, where is it going and what will it look like?

We really want to enable developers to write stateful applications, specifically transactional applications directly against the Mesos API. We are really aiming to have a rich ecosystem of native Mesos applications. We want developers to be able to leverage the underlying system to reduce their complexity and make distributed systems programmatic and as easy as programming against a single machine.

With the Internet of Things and the need for more frameworks and models for machine learning applications, we will see more need for writing distributed applications. We aim to be the platform for all of these distributed systems.

[29:26] How does a developer choose/decide between Marathon and Kubernetes?

I think it is similar to the discussion of whether one programming language is better. A lot of these things are very opinionated, for Marathon I can say it is running in production at a very large scale at a number of companies. It really is around your opinion on how you want to deploy your application. You can just try it out, go to the GCE cluster and try Marathon, you can install Kubernetes on the same cluster and try it also.

[30:42] How much does it cost to try out a simple cluster?

We don't actually charge anything, your Google charges will carry forward. It's roughly 60 cents a minute for a small cluster and you can start clusters of arbitrary size. We have instructions on how you can add nodes if one dies.

[31:14] What is a small cluster size?

I think it is four nodes.

[31:43] What is the future of cloud application architecture(s)?

In my opinion, we need to start architecting for fault tolerance and scale out. We need to stop thinking about individual machines that all have their snowflake configuration. We should treat compute as a resource, as a big aggregate resource pool. Particularly in larger enterprises it is really inefficient, every department does their own purchasing of hardware, and you end up with these clusters that are terribly underused.

[32:44] Do you think that containers will become more prevalent than virtual machines in the future?

I think they will co-exist for a while. Changes in IT happen slowly, there are early adopters but there are many laggards. It will be a while until the whole landscape will have shifted.

In some industries, such as financials, we are seeing a shift now.

[33:44] What do you think is driving that shift?

Linux for the enterprise is driving a lot of this. Data centers are fundamentally changing as well. In the past, the way we thought about applications was exactly that, we thought about individual machines and connecting them together. Next generation hardware will be these disaggregated hardware, these racks, Open Compute is going in this direction as well.

[34:40] What is the difference beween this and the Amazon public cloud strategy?

Amazon has given developers the ability to start a cluster, whether it is a Hadoop cluster or a cluster of Rails instances, with a couple of mouse clicks. What we are doing is very similar to that. You can run your analytics, you can run your long running services, all on a shared infrastructure.

The advantage of Mesos and Marathon is that these are open source projects, they are not tied to a cloud provider. They are not tied to specific hardware, they are not tied to virtual machines. Supporting hybrid configurations, some running in the cloud some running on-prem, and being able to merge them together logically and make them look like one big pool.