Optimizing Today’s Data Centers: Metrics that Matter


Dave Wagner
OpsDataStore

Dave Wagner is Co-founder and Chief Technology Officer for OpsDataStore.
Chuck Rego is Chief Architect for Intel.

Over the last decade, huge growth in demand for Internet and mobile services has driven rapid transformation in digital businesses.  This growth has been highly disruptive, and it has created new business opportunities and challenged the status quo.  In the data center, two forces have created much of this change:  the evolution of virtualization and the rise of cloud computing.

Latest-generation technologies in computing hardware and software platforms, including but not limited to unified computing, pervasive virtualization, containerization, new rack designs, disaggregation of compute resources, improved telemetry and analytics have all added to lowering the total cost of ownership (TCO) but also greater return on investment (ROI).  This has set the stage for agile infrastructure and a further explosion in the number and type of instrumentation metrics available to today’s data center managers.

Optimization, as applied to data centers, means always having the right amount of resources, to cost-effectively enable the business use of those data centers. Right resourcing means, in effect, enough to get the data center “job” done, but not so much as to waste money. Everything from enough power and floor space to enough “computes,” and everything else. Easily said, but increasingly challenging to accomplish.

It used to be that one would optimize any given data center resource by measuring resource utilization; for example, how busy a CPU is, and then make a considered determination of what level was sufficiently busy to be upgraded or extended, or sufficiently non-busy to warrant consolidation. This approach was used, and useful, for everything from CPUs, memory and other server metrics, to things like power consumption, where metrics like PUE (power utilization effectiveness) were created and applied. However, these types of optimizations were always done in domain isolation – silos in effect.

On the software side, pervasive virtualization, containerization and software automation have completely changed the measurement landscape. The increasingly rich metrics embedded in server chipsets open exciting new possibilities.

Metric That Matters #1: Transaction Throughput and Response Time

Because high levels of virtualization abstract the real hardware resources (CPU, memory, storage, network) from the workloads using them, and those resources are ever more dynamically allocated to the workloads running upon them, the measurement of specific resource utilization is becoming increasing unrelated to how those workloads actually perform. This has driven an explosion in the adoption of Application Performance Management (APM) solutions that measure how much work applications are getting done and how quickly they are responding. This, at the end of the day is one of the key metrics of today’s data centers: how much business work is being accomplished and how responsively.  And, where such metrics are simply not available, reasonable proxies need to be found; typically anything that can measure “waiting” in and across the environment.

Metric That Matters #2: Cost – Especially OpEx

In a perfect world, one would be able to completely cost-optimize the entirety of the infrastructure used to deliver applications and services in an acceptably performant fashion. In other words, if each and every transaction is simultaneously cost AND performance optimized, and, importantly can be kept in that state, then one has reached the pinnacle of data center optimization.

This implies a need to have a good basis for OpEx costing; a mechanism by which to allocate/ascribe a cost measure to transactions.   Again, because of the dynamic complexity of the environment when considering the “map” from transactions to the underlying supporting infrastructure, some type of management framework is clearly required to accurately accomplish this allocation. To be useful, system management frameworks need to be as dynamic and scalable as their underlying systems.  Management tools built on industry standards and open source reference implementations provide powerful automation solutions that increase performance, boost efficiency, and speed time to market.  New power and cooling management and metrics provide monitoring, optimization and prediction, capabilities all leading to reduced cost of ownership.

Research now suggests that in many real world applications, storage performance is the main limiter of application performance.  New rack designs enable large gains in storage performance and cost without imposing a significant cost burden.  Like compute and network resources, storage resources are now increasingly software defined.

Over the last few years, quantum leaps in technology and telemetry availability has provided excellent proxies for determining costs.  Metrics such as dollar-per-watt, power-per-transaction, dollar-per-VM, performance-per-watt, and many other performance metrics provide details in the actual operational expense of running the data center.

Summary

The closer that data centers can get to complete instrumentation of important metrics like transactional response time and throughput, the lower the overall associated costs will be to successfully deliver those business services, and the more efficient such data centers can become.

This entry was posted in Data Center. Bookmark the permalink.