The world is in a golden age of information technology (IT) innovation. The mega-forces of cloud, mobile, big data analytics powered by machine learning, IoT, and massively scalable apps are re-shaping all aspects of business and society. At the center of this IT renaissance is an unprecedented global data center (DC) buildout in public, private, and hybrid cloud computing. According to Synergy Research Group, the number of hyperscale DCs around the globe grew from 300 at YE16 to 390 at YE17, with another 69 hyperscale DCs that are in various stages of planning or building.
In this paper, we briefly review the three major waves of DC infrastructure innovation to date. We then introduce the fourth wave of IT infrastructure: Application-Defined Infrastructure (ADI), and the technology forces and operational challenges driving its adoption by large enterprises.
A BRIEF HISTORY OF DC INFRASTRUCTURE
A DC is a purpose-built structure used to house computer systems and associated components, such as networking equipment, storage systems, and telecommunications equipment. It is the brains of the knowledge economy and to our connected world. The modern DC has its’ roots in the main-frame room of the 1960s, the telecommunications central office, and the enterprise IT wiring closet. The past two decades have seen an explosion of creativity perfecting the art of the modern DC.
1997-2007, First Wave – Bare Metal Servers
Bare metal servers are single-tenant physical servers. Their strengths are high application (app) performance and predictability. Their weaknesses are high cost, medium complexity to provision apps, and low flexibility once apps are deployed. They continue as a solution of choice for specific performance sensitive workloads that merit dedicated infrastructure (i.e., databases). Bare metal is still often used for dedicated computer clusters that are built to support specific scale-out distributed computing apps (e.g., a Hadoop cluster). The requirements for greater flexibility and improved economics have made this approach limiting given the continually evolving app landscape.
2005-Present, Second Wave – Virtualization with Hypervisors
Virtualization is an emulation of a computer system that enables one physical computer to run one or more virtual machines (VMs).
Figure 1: Making One Computer Look Like Many with Hypervisor-Based Virtualization
While this concept goes back to the 1960s and the era of mainframe computers, it was brought to the forefront of IT efficiency gains by VMware in 1998 with their commercialization of the modern hypervisor. Before VMware, a significant amount of expensive computer resources was underutilized. The VMware hypervisor helped solve the need for greater IT efficiency by making one computer look like multiple computers, each one with their guest operating system. As of July 2017, VMware had grown into a highly profitable $50B market cap company with $1.9B of revenue. With hundreds of thousands of businesses around the world running important portions of their operations on VMware’s virtualized systems, VMware is the market share leader in hypervisor-based virtualization solution for enterprise private clouds. Other hypervisors include Microsoft Hyper-V, Linux KVM, and Xen.
The strengths of virtualization with hypervisors include technology maturity; broad adoption; improved computer utilization by enabling multiple VMs; infrastructure software to build and operate clouds based on VMs.
The weaknesses of virtualization with hypervisors include: high complexity; hypervisor resource overhead; client operating system resource overhead for each guest VM; non-negligible application performance hit when compared to bare metal infrastructure; the “noisy neighbor effect” when one user impacts the performance and stability of other users within the same physical server; the “IO blender effect” when multiple VMs send their IO requests at the same time and degrade storage performance; and the time required to instantiate new VMs.
2010-Present, Third Wave – Hyper-Converged Infrastructure (HCI) (also based on hypervisors)
A Hyper-Converged Infrastructure (HCI) is a wholly software-defined IT infrastructure that virtualizes all of the elements of conventional “hardware-defined” systems. HCI includes, at a minimum, virtualized computing (a hypervisor), a virtualized software-defined storage (SAN) and virtualized networking (Software-defined networking).
Figure 2: Hyper-Converged Infrastructure (Source: Nutanix Definitive Guide for HCI)
Simply stated, HCI integrates compute, storage, and network connectivity into a “cloud-in-a-box,” then provides a unified management view of both hardware and software assets to hide the complexity of the cloud. HCI uses sophisticated infrastructure software on top of bare metal commodity parts to simplify management and increase ease-of-use for end users in certain high-value apps (e.g., virtual desktop). HCI vendors include Dell/EMC, IBM, Lenovo, HP, Nutanix, Stratoscale, and Cisco.
The strengths of HCI include: ease of use by pre-packaging hardware and software together and hiding the underlying complexity of virtualization with the hypervisor, graceful scaling of infrastructure by growing clusters of HCI appliances and, simplification of the do-it-yourself approach to building a private cloud.
The weaknesses of HCI include: the ratios between compute and storage are locked in at the time these system resources are packaged together into an appliance; lack of support for some important classes of stateful apps (e.g. relational databases), limited support for massively scalable modern data stack distributed computing apps like Hadoop, MongoDB, Cassandra, and Spark and, little to no deployments in larger clouds.
The 4th Wave of IT Infrastructure
In the first three waves of IT transformation every IT project started with planning out the underlying Infrastructure first. For example, to deploy a database, planning first started with procuring and configuring severs or VMs, networks and storage. The chosen infrastructure components had to be planned to ensure they meet the application’s current SLAs and anticipated growth. Only after all this infrastructure was planned and configured were apps brought online. But infrastructure exists to serve apps, not the other way around. Wouldn’t it be better to start an IT project at the application – by describing just its needs and letting the infrastructure self-assemble and configure itself to meet those needs (current and anticipated)?
We are entering the 4th wave of IT infrastructure innovation where apps will define the infrastructure that serves them. In this 4th wave both apps and people are liberated from the shackles of the specifics of the underlying IT infrastructure. The underlying infrastructure itself might change, from bare metal to VMs to private or public cloud, but the interaction with an application remains unchanged. This 4th wave is an era of Application Defined Infrastructure (ADI), where infrastructure becomes increasingly invisible, and simplicity is once again the ultimate sophistication. The drivers for this 4th wave of IT infrastructure are given below, starting with a discussion of a significant secular trend – containers.
Containers is a technology that packages an application and its dependencies in a manner that allows it to be reliably moved from one computing environment to another.
Figure 3: Containers vs. Virtualization with Hypervisor (Source: Docker)
Unlike VMs, which package an entire operating system along with the application, multiple containers running on a machine share the operating system. Each application running inside its own container continues to enjoy an isolation boundary that makes it appear like it is the only one running on that machine.
Containers became widely popular with Docker’s introduction of application containers in 2013. Docker is playing a leadership role in popularizing the concept of containers by making application packaging in the form of Docker images an industry standard.
Containers, specifically Docker, significantly simplify how apps are configured. They strike the right balance between merging or, (where applicable) separating configuration from the application payload (the container image). Containers operate at close to zero overhead because containers do not virtualize in the hypervisor sense of the word. Instead they isolate apps (or portions of apps) from one another in secure partitions that run in a shared user space over one host operating system. This means apps run at bare metal speeds without consuming any additional resources.
However, containers by themselves are not sufficient as an IT infrastructure management paradigm because they are not “infrastructure aware.” Organizations are discovering that existing data center infrastructure is not capable of dealing with large numbers of containerized apps since a single modern microservices-based web application can easily span hundreds or more containers. Organizations run many apps and often find their systems administration teams overwhelmed attempting to match resources with containers.
Containers improve server utilization by allowing multiple apps to run on the same server. But since all apps share the same storage, storage performance can be erratic, which impacts overall application performance. To combat this, some organizations deploy critical apps on siloed infrastructure to ensure good performance, which leads to overprovisioned hardware and poor resource utilization. Cloud computing is evolving to address this. Cloud service providers have long offered Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS). The first wave of Containers-as-a-Service (CaaS) on bare metal is beginning to be developed by the largest Cloud service providers. The concept of “application state” is important to understand when discussing a CaaS offering.
Stateless and Stateful Apps
Understanding the concept of “app state” helps to understand the evolving requirements of the IT infrastructure that serves apps. App state is the data that application components need to perform their intended function. Apps may require configuration information, user credentials, user profile information, user history, clickstream data. Data associated with apps can be stored in many different physical locations: local server cache; in a file system; in a database table; or in a storage resource. There are many elements that contribute to a full understanding of app state: app persistence requirements (i.e. uptime, re-start requirement, data loss windows); configuration state; session state; infrastructure state (e.g. networking addresses, cluster state).
Stateless apps do not save client data generated in one session for use in the next session with that client. Each session is carried out as if it was the first time and responses are not dependent upon data from a previous session. Protocols like HTTP are stateless because the web server does not remember any state across page requests that is processes.
In contrast, Stateful apps store data from previous sessions, limiting the data that needs to be stored on the client end and retaining information on the server from one use to the next. Applications that needs to perform real-time work typically maintains some state locality to get very fast response times. Examples include content delivery networks, streaming media servers, identity management and authentication servers, and core transaction systems for payment processing. Many of the most important mission-critical applications often need to preserve and manage state. Complex, distributed Big Data, NoSQL and Database apps are stateful and need to run both on premises and in the cloud. Simply attaching a storage volume to Docker is not sufficient to support stateful apps because that doesn’t address performance predictability, app portability and high availability, lifecycle management etc. Thus, there is a pressing need for cloud computing infrastructure to do a much better job of supporting mission-critical stateful apps.
Putting applications into containers becomes interesting only when those containers can be efficiently deployed, managed, and scaled efficiently. Container orchestration engines perform the important function of managing clusters of containers, and is an important building block of the 4th wave of IT infrastructure – important enough to inspire a “container orchestration war.” Sample players include – Docker’s Swarm, Redhat’s OpenShift, Rancher’s Cattle, Mesosphere’s Marathon, AWS’ ECS, CoreOS’ Fleet. Container orchestration is strategic, but it is a component of a broader infrastructure management solution. The focus of these container orchestration engines is mainly on stateless cloud-native apps. However, there are efforts such as Kubernetes StatefulSet that hope to improve the ability to support open source databases like MySQL and PostgreSQL in containerized environments, but full infrastructure control is required to meet the service level agreements and quality-of-service and high availability requirements of demanding stateful apps.
The 4th Wave – Application-Defined Infrastructure
ADI Requirements Summary
With the rise of containers, and the need to gracefully run both stateless and stateful applications on a shared multi-tenant infrastructure, the Application-Defined Infrastructure (ADI) is now required.
The ADI can be described as a container-based, application-aware computes and storage platform. The software efficiently abstracts underlying server, VM, network, and storage boundaries to produce a compute, storage, and data continuum. Many different containerized apps can run in this continuum without impacting one another’s performance.
Application portability and scalability are increased because compute and storage are decoupled; apps can be freely moved around the continuum without moving or copying data. Complex distributed apps like NoSQL, Hadoop, Cassandra, and Mongo can be deployed quickly and easily.
The ADI enables the intelligent provisioning of containers and storage based on individual application requirements as well as the topology of the environment, and it configures the application to make the best use of those components. The ADI ensures that all apps get sufficient compute, storage, and network resources to meet user-defined quality of service requirements; the result is predictable performance for all apps.
The ADI should provide the ability to automatically recover failed nodes and disks, and seamlessly move workloads between servers. As a result, hardware can be used more efficiently, and less hardware is required in reserve for inevitable performance spikes.
Robin Systems ADI Description
Robin Systems Application-Defined Infrastructure (ADI) is a 4th generation IT infrastructure solution designed specifically to meet the fast-moving emerging requirements that are not well served by bare metal, virtualization with hypervisors, or HCI. Robin ADI is enabled by containers but goes well beyond container orchestration or conventional notions of container-based IT infrastructure.
Figure 4: Robin’s ADI Functional Block Diagram
Robin’s ADI is a container-based, application-aware compute and storage platform. The software efficiently abstracts underlying server, VM, network, and storage boundaries to produce a compute, storage, and data continuum. Many different containerized apps can run in this continuum without impacting one another’s performance. Application portability and scalability are increased because compute and storage are decoupled; apps can be freely moved around the continuum without moving or copying data. Complex distributed apps like NoSQL, Hadoop, Cassandra, and Mongo can be deployed quickly and easily. Robin’s application-level snapshots that include the entire application environment make it easy to quickly create a copy of a production environment without impacting production performance and to quickly roll back to a previous point in time to correct a problem.
The key to Robin’s technology is its Application-Aware Fabric Controller, which serves as the management layer for all application deployment and data movement. It controls and manages two primary assets, the Application-Aware Compute Plane and the Application-Aware Storage Plane, which virtualize the compute and storage separately, eliminating silos of capacity. The controller enables the intelligent provisioning of containers and storage based on individual application requirements as well as the topology of the environment, and it configures the application to make the best use of those components. The result is one-click deployments of complex, multi-container apps like Cassandra and Hadoop. Further, the Fabric Controller ensures that all apps get sufficient compute, storage, and network resources to meet user-defined quality of service requirements; the result is predictable performance for all apps. Since Robin controls the entire I/O path, it manages the priorities of read/write requests and so can provide guaranteed minimum and maximum levels of IOPs to ensure that application performance is maintained, despite any noisy neighbors.
Robin Systems also offers data lifecycle management through one-click granular snapshots and thin clones, which can be created in seconds. Unlike other implementations of clones and snapshots, Robin clones the entire application environment: the storage, the OS, the application configuration, and the topology. If something goes wrong in an application, the data can be rolled back to a known-working snapshot in seconds. To ensure high levels of availability, the Fabric Controller monitors the infrastructure and automatically recovers failed nodes and disks. With Robin’s ability to seamlessly move workloads between servers, hardware can be used more efficiently, and less hardware is required in reserve for inevitable performance spikes. Robin’s ADI technology innovations include:
- 100% container-based. No hypervisor-based virtualization, thus the performance, memory, and provisioning time required to instantiate a full client operating system in each VM is eliminated.
- Support of current and future distributed computing apps and analytics frameworks.
- Agnostic to underlying hardware components with de-coupling of compute and storage to allow these resources which are on different innovation curves to be scaled independently. In HCI the compute and storage are tightly coupled at the time of purchase and are guaranteed to be wrong as application requirements evolve. HCI use cases where data and storage locality in the same appliance as compute is an advantage can be fully supported by Robin’s ADI solution, but it is possible to optimize compute nodes and storage nodes differently to meet the very different needs of both compute-intensive workloads and/or data heavy workloads.
- Robust Quality-of-Service and traffic management mechanisms to enable Service Level Agreements (SLAs) for each application that is concurrently running across the ADI. This is delivered via application-to-spindle infrastructure control requiring deep storage tier innovations that are not possible from any container management/orchestration framework alone. Just as networking evolved with strong packet-level quality of service (QoS) to enable Voice over IP, ADI is evolving IT infrastructure with strong QoS to enable massively scalable stateless and stateful apps to be gracefully supported on a unified platform.
- A purpose-built storage stack that is application-aware. This enables several application-centric lifecycle management workflows that would otherwise be relegated to plain old storage volume management. This is hard to bolt on to off the shelf storage solutions. Note that no storage can be application-aware by itself. Robin’s ADI solution has a storage stack with programmable primitives and application-to-spindle integration that configures it to make it app-aware. This end-to-end and top-to-bottom infrastructure goes well beyond what container orchestration engines achieve when combined with third party commodity storage.
- Very high storage stack performance, proven in third-party testing and customer environments to perform at the same level as a high-end EMC VMAX array at a fraction of the cost using all commodity components. Robin’s ADI solution storage stack is proven to provide higher performance benchmark results than alternative storage stacks such as Ceph, Gluster, Nutanix, Rubrik, and Cohesity.
- An application-aware orchestration fabric which has application-centric primitives built in to enable PaaS experience for a broad and fast- growing library of data-driven apps.
- The industry’s first ADI that provides PaaS for both stateless and stateful data-driven apps. CloudFoundry, Kubernetes, Mesos do not support a comparable breadth of BigData, NoSQL, and ACID-compliant relational databases apps.
- Ease-of-use for leading open source distributed computing apps. A broad and fast-growing library of strategic open source apps is supported, including MongoDB, Couchbase, Hortonworks, Elasticsearch, ELK (ElasticSearch + Logstash + Kibana), Solr, Oracle, Cassandra, Hadoop, Cloudera, Redis, Spark, VoltDB, MariaDB, MySQL, PostgreSQL, and Kafka.
- Extensibility as a stable infrastructure management substrate to support a growing library of apps. Third party partners and customers have already independently built app manifests for popular third-party software packages such as Splunk.
- Application support with single click deployment and an app store-like experience with full application lifecycle management. An application deployment profile/template called a “bundle” is customizable. Extreme ease-of-use user that elevates every single complex operation to a 1-click experience. This is a conscious design philosophy at Robin that is rigorously enforced as they add new functionality to the product.
- Performance that is very close to bare metal. Third party benchmark results from the Enterprise Strategy Group while running seven different executions of the Yahoo Cloud Serving Benchmark (YCSB) for a multi-node Cassandra database:
Figure 5: ESB Group Benchmark Results of Gen 1 Bare Metal, Gen 2 VM, and Gen 4 Robin’s ADI
Robin’s ADI solution is deployed and proven at significant scale in mission-critical production environments at Fortune 50 companies. A notable deployment is concurrently running a Cassandra virtual cluster, Hadoop virtual cluster, an ELK virtual cluster, a complex analytics pipeline, and multiple ACID-compliant relational databases on one shared infrastructure built from commodity hardware. Each of these complex applications was instantiated with an Apple app store-like click on an icon.
Robin’s ADI and Kubernetes KS8 Comparison
Kubernetes (K8S) was discussed previously and is a container orchestration framework that targets stateless apps. Kubernetes is an important emerging technology with strengths for stateless container management – for example, service discovery, and load balancing. These are less relevant in the stateful world and/or the class of modern data stack apps that Robin’s ADI serves. Kubernetes (K8S) was developed at Google where the approach to addressing contention for shared resources was to aggressively over-build cloud hardware resources. This approach of throwing hardware and CS PhDs seems unlikely to meet the needs of most private and hybrid clouds moving forward.
In contrast, Robin’s ADI is a Big Data, NoSQL, Database deployment and application life cycle management framework that targets complex distributed and legacy apps. Robin’s ADI has brought rock solid QoS and SLAs to each container operating in a warehouse-scale computing facility.
Robin’s ADI has specific strengths in 5 key areas when compared to Kubernetes (K8S):
- Container Management: Robin has done extensive work in virtualizing Linux namespaces and cgroups – without this, one can’t seriously consider containerizing complex apps like Cloudera, DB2, SAP HANA, etc. which Robin’s ADI supports out of the box. Robin has done extensive work in working around Docker’s limitation of managing state preserved to the container’s rootfs. Because of this, they are uniquely positioned to seamlessly migrate containers of complex apps from one physical host to another (for High Availability). Robin has extended container configuration management via app-specific “hooks,” a concept that is lacking in Kubernetes. Because of this Robin’s ADI can put together bundles of complex stateful apps quickly.
- Network Management: Kubernetes uses overlay networking. This makes IP address management tricky for complex distributed apps. For example, when a container fails over from one host to another in Kubernetes it is given a new IP address. This breaks the network topology view of a distributed app. There are complicated, app-specific workarounds to get around this limitation. Given the Kubernetes focus on stateless apps which don’t suffer from this, this is not addressed by Kubernetes. Robin uses bridge networking (high performance than overlay networking) and manages IP address binding to containers as they move around from one physical host to another. This is a big plus for Robin ADI and a key enabler for a broad range of capabilities such as massive infrastructure, apps, and container scalability.
- Application portability: Robin’s ADI allows full application and data portability within clouds, and across clouds. This capability to take an active cluster of containers executing in run time and move it from one host to another without service interruption is novel. For example, if you have started a job, let it run for 3 days and realized that the job is running too slowly. Because Robin’s ADI provides infrastructure control from the application to the storage spindle, you can snapshot the entire run-time application and data, migrate all components of the job to a larger cluster, and re-start the job where you left off with no loss of work done so far. With Kubernetes (KS8) migrating an application running on a container cluster mid-job a larger cluster requires stopping the job and re-starting the job from scratch.
- Use Experience: Kubernetes is widely considered complex to install and maintain. Robin’s ADI installs in minutes and has an Apple-like use experience – where every operation has been simplified to a 1-click experience. This is enabled by fine grain application-to-storage infrastructure control.
- Storage Management: Discussed in detail above.
Robin’s ADI Solution Enables the 4th Wave of Infrastructure Innovation
A key challenge facing the IT community today is to enable many distributed computing apps to peacefully co-exist on a highly efficient shared infrastructure, where each application can be gracefully scaled in near-real time as workloads demand without damaging the performance of others.
The traditional notion of business continuity and disaster recovery in today’s private, public and hybrid clouds is centered on data portability. The requirement moving forward is not only data portability, but full application and data portability across these environments. This means not just movement of an application executable, but movement of the actual run-time application and associated data while in use.
Infrastructure tends to get added to and extended rather than getting fully replaced. Thus, it will take many years for the 2nd and 3rd waves of IT infrastructure based on virtualization and HCI to be replaced with what comes next. However, the transition to the 4th wave of IT infrastructure has begun.
Robin Systems has built the first true ADI solution. It is a turning point in the way that both stateless and stateful apps are run and gracefully scaled on a stable, high performance, and highly cost-effective shared infrastructure built completely from cost-effective commodity hardware. There is no need to hide the complexity or drive down the cost of the hypervisor found at the foundation of other approaches because it is eliminated completely from the cloud management stack.
With Robin, you can start an IT project at the application – by describing just its needs and letting the infrastructure self-assemble and configure itself to meet those needs. Robin’s ADI solution contains significant innovations in the storage tier. Finally, Robin’s ADI enables entire active clusters built from containers to be migrated from one host OS to another and scaled up or down with no interruption of service. This enables applications and data to have a new level of cloud independence, as they can be migrated between private, public, and hybrid clouds in a seamless way. The unique insight and breakthrough innovation is application-to-spindle QoS and robust SLAs on a per container cluster basis. This requires innovations that span the compute, storage, and network tiers all collaborating together. This application-centric workflow management driving a new level of simplicity for both cloud-native applications and legacy applications as the ultimate sophistication.
Christopher Rust – Clear Ventures Founder & General Partner, Robin Systems Series A Investor and Advisor