The Data Center Perfect Storm is my term for summing up different factors that are combining in a way that will require data centers to make significant (maybe radical) changes to handle the onslaught of greater (and different) demands, higher expectations, and constrained resources.
In this section, we're going to lay out what we see happening and the associated impact on data centers.
Rapidly Rising Compute Demand
- Natural increase: driven by economic/organizational growth, but also by the increased instrumentation of, well, everything. Nearly everything that goes into modern business today is tracked, stored, analyzed, and used to make decisions. As we know, the world is increasingly driven by digital technologies, meaning more users generating more data that will be typically stored/manipulated somewhere other than on their local device.
- The Rush To AI: This is by far the biggest driver of growth and will remain so for many years. Embracing AI is seen THE strategic imperative today. It's hard to name a segment (government, private sector, academic, consumer) that isn't either actively implementing or considering some sort of AI initiative. Whether it's machine learning, generative AI, or a large language model, the rush is real and it has major consequences for data centers.
- AI processing is a different animal for many data centers. It has more in common with HPC/scientific computing than it does with typical transactional or database processing. Training AI models is very compute intensive, running vast amounts of data through complex and multi-layered models in order to make the model useful for finding unique data relationships, supporting decisions, recognizing natural language, or generating new content.
- Models are retrained to further refine the model, increase accuracy, and take account of new information. You can train an AI model on pretty much any computer - even a laptop. But this assumes you have literally years (decades?) of time to devote to processing and getting results. To train a model in a reasonable time frame, you need fast systems with accelerators (like GPUs), lots of memory, fast networks, fast I/O, essentially the same systems that are used in scientific computing.
- After a model is trained, it is put into production for users. The process they use to get usable predictions, conclusions, or anything out of the model is called inferencing. This can also be compute intensive, depending on the size/complexity of the model and the size/complexity of the queries it has to handle. Time to answer in inferencing is also an important factor. Users need their answers as quickly as possible (if not quicker!), which means the inferencing systems also need to be configured for fast performance along with varying levels of demand.
- AI isn't going to go away, but the terms will change as AI mechanisms are folded into nearly every type of software. This means that using that new "AI enabled" version will require the same hardware capabilities we've outlined above.
Much Higher Electrical Demand
Computer components are really good at converting electricity to heat. They are so efficient at doing this that the term TDP (Thermal Design Power, expressed as the maximum heat an item can generate in watts) is typically used as the measure how of much electricity a component can use at its upper limit. Actual power consumption of, say, a CPU, will vary with workload, but for reliability purposes, the system design point is for the maximum potential draw, which makes sense, right? Your infrastructure should also be designed to handle the maximum power draw of your IT devices if they all hit peak load at the same time. Improbably? Yes. But a good idea? Also yes.
REDUCE AND FIX NUMBERS - USE RECENT DATA
Some trends:
- 2007 to 2014, average CPU TDP was 98.78 watts. From 2014 to 2023, the average increased to 200 watts. Not so bad, eh? But it's the rate that's concerning. From 2019 - 2023, CPU TDP increased at a 17& CAGR. Today, the average AMD 9000 series server CPU has a TPD of 320 watts. In 2025, the top CPUs from Intel will hit 330 watts and 500 watts for their Sierra Forest and Granite Rapids CPUs respectively. AMD will introduce Zen 5 processors that hit 500 watts.
- GPUs started coming into wider use starting in 2010 or so, their average TDP from then until 2017 or so was abound 300 watts. The top Nvidia GPUs now have TDPs of 700 watts and the upcoming B200 will hit 1,000. In 2025-26, Intel's Gaudi 3 will carry a TDP of 1,200 and their Falcon Shores GPU (probably 2026) will probably hit 1,500 watts. Finally, AMD's MI25x GPU, available in 2025, will push over 1,000 watts TDP.
- Other server components also generate heat, including memory, NICs, local storage, and even PCIe ports. In fact, Intel is finding that they're forced to develop mechanisms to throttle the speed of PCIe 6 slots due to excessive/damaging heat when they're running flat out.
The average 42U server rack today is probably generating something like 15-18kW of heat and is air cooled. What kind of TDP would you be looking at for a new rack today? Here are some scenarioss:
CPU server only: Assuming 2U form factor servers, each equipped with dual AMD EPYC 9000 series CPUs at a total TDP of 586 watts. This is 2x the average TDP of the AMD 9000 product line. Add another 40 watts per server for memory (32 sticks of DDR4), another 10 watts for local storage (assuming a couple of NVMe drives), and a NIC at about 15 watts. Total server TDP = 752 watts, total rack TDP = 752 x 21 = 16kW, not too bad, right in the middle of the average today. (However, CPU TDP's are heading to 500 watts in the next few years - at least for the highest performing SKUs. Lower performance SKUs will be increasing as well.)
Dipping your toes in AI (experimental training, small models/parameter count, experimental inferencing): Same server as above, but adding four Nvidia H100 (SXM5 NVLink connected) GPUs to four 4U nodes, with 17 nodes of CPUs-only. So we have a server TDP of 13 x 752 watts, plus 16 x 700 watts for the GPUs. Total TDP for the rack is 24kW, which is 50% higher than the average rack today.
More AI capabilities (light training, small models/parameter count, light inference): Same server model, but four GPU nodes as above, each with eight GPUs. Total TDP is 35kW,
Production: But this won't be enough training or even inferencing compute power for a production model. While there isn't much definitive information on what you will need to train a moderately sized model or what it will take to run production inference, I would (conservatively) estimate that a typical rack today will contain six GPU nodes, each with 8 GPUs, and have a total TDP approaching 50kW - which is about 210% higher than racks today. And you'll need way more than just one rack with this type of configuration. In fact, it's not hard to argue that every rack will be configured like this in the near-ish future.
Yet another problem for many data centers will be getting access to enough electrical supply to cover their greatly increased need.
In our Infrastructure Efficiency list (LINK), we have several sections that hit on getting the most performance out of your electricity. In supercomputing, power efficient compute is the most important goal - and there is a lot to learn from the technologies, techniques, and practices that their maniacal drive for more FLOP/s per watt has spawned
Radically Increased Heat Load
As noted above, with computing, electricity going in pretty much equals heat coming out. Up until now, heat in most data centers was able to be removed with air cooling alone, but no longer. The era of air cooling is over. Let that sit for a minute. It's true.
Liquid is much more efficient at capturing and transporting heat than air. We all know this through experience, jumping into a 60F lake will cool you much faster and more thoroughly than standing in front of an air conditioner blowing cool air at 60F, right?
Adding liquid cooling to your IT infrastructure will cost money, absolutely. It's a significant chunk of change upfront, but over time your operating costs for electricity and maintenance will be lower. Why? Very briefly...
Consider what you're doing: With air cooling, you're cooling the entire cubic footage of your data center in order for that cool air to be pulled through your systems to keep the components cool. With liquid cooling you're only cooling the components that actually generate heat - square inches of material per server using DLC, or cooling entire racks of servers with immersion cooling.
With liquid cooling, you directly remove 80-95% of the heat generated by your servers and take it out of your data center using pipes and pumps, rather than large air handlers. This dramatically lowers the heat load in the entire data center, meaning that air conditioners (yeah, you'll still need them for human comfort) will run at a much lower level and use significantly less power.
You can also take advantage of natural cooling, which will reduce costs associated with using chillers.
We have a Cooling category in our Infrastructure Efficiency (LINK) list that explains various cooling technologies and approaches, plus shows you who provides what.
Rising Costs
Every category of IT is striving to provide more performance per watt and per dollar. But the easy gains from shrinking and/or cranking up the frequencies are a thing of the past. Even so, the performance you're getting from new gear is better on nearly every measure, but it's not getting any less expensive - and it won't because the physics are just getting too difficult to handle given the materials we have today. Plus the move to AI and our future of near universal AI-enabled applications means that you will need some level of accelerated computing (with GPUs, or other accelerators) in your data center. These are not inexpensive items, as you probably know. The cost of accelerators completely swamps the cost of a server, and you'll need plenty of accelerators to handle training and inferencing workloads.
Some of you are probably thinking, "hey, I'll just go to the public cloud and sidestep all of these problems, let them deal with it..." In the short term, this is a viable strategy, but long term? Not so much, unless your IT needs are small and you're very flexible when it comes to getting accelerator-centric jobs done.
Public clouds are computing hotels, and best utilized for bursting to handle demand spikes and special situations where you simply don't have the capacity (or the 'right' capacity) for a particular job. They're also great for testing new workloads, new hardware, or new techniques. But for day in, day out heavy utilization, they're significantly more costly than on-prem IT and less flexible. When you consider your AI training or inferencing workloads, the costs can spiral a lot higher due to high demand for accelerators in the cloud, the massive (and often unpredictable) amount of data and processing needed for AI, and the time and performance demands you'll be placing on inference operations. There are also security issued to address, particularly when it comes to custom LLM (Large Language Models.)
Co-location or the co-location-like programs that major vendors are promoting can be solid alternatives, particularly in cases where you can't get more electricity into your data center (or can't free up enough juice by managing it better.) But carefully investigate what you're getting into and make sure you've identified the lock-ins (and there are always lock-ins) and understand what you're agreeing to.
On our Infrastructure Efficiency page (LINK), most of the categories have some sort of impact on cost efficient computing, particularly the Datacenter Assessment, Workload Management, and Performance (System and Task) sections.
Upbeat Summary: Is Your Data Center Doomed to a Bleak Future?
Nah, it's not. But it does have to change. The folks in supercomputing have been facing these issues for decades. Their mission is to drive science (including computing science) at the very highest level possible. The same is true for private supercomputing organizations like those in the energy sector, life sciences, aerospace, financial services, and several others segments. While their computing budgets are large, they aren't unlimited, and they're in competition with others to get their share of the spending. All of this drives them to maximize their performance bang for the buck and, as a result, they have the most efficient IT organizations.
But the same technologies and methods that bring them way higher IT efficiency are available to any data center that can spend a little time seeing what's out there and how it might help in their particular situation. Our whole goal here is to explain how.