Paragon Fellowship

Federal Government

The Current State of
USG Supercomputer Capacities and Utilization



Project Description

The supercomputers owned and operated by the United States federal government (USG) constitute a fleet of sophisticated computational resources that can be deployed at a moment’s notice to address critical use-cases. Historically, these supercomputers have played an important role in solving complex problems and modeling predictions. They have achieved scientific breakthroughs that led to opportunities for multidisciplinary applications across both the private and public sector industries in the United States. In this era of increasing computational complexities, our dependency on legacy systems and the need to expand our supercomputer fleet is consequential if the U.S. is to remain globally competitive while satisfying its domestic needs.

To help meet the computational needs of today and the demands of tomorrow, the 2024 Fall Cohort Energy Team of Paragon Policy Fellowship has aggregated and consolidated pertinent information regarding the highest performing USG supercomputers. We then processed and analyzed the data and created a Wiki site.

Project Aims

By rule, federal agencies are required to publish data on their facilities and utilization. Yet, there is no integrated database that provides a bird’s-eye view of all high-performance computing (HPC) facilities in the federal government. Our team set out to fill this gap in 3 ways:


1) Data Consolidation: Conduct a comprehensive survey of USG supercomputers, their operational purposes, and their current capabilities.

2) Performance Comparison: Calculate the approximate amounts of compute and time required to train a large language model (LLM) similar to OpenAI’s GPT-4 on the newest supercomputers owned and operated by the U.S. Department of Energy (DOE) and compare the results to industry-funded supercomputers. This supplements CSET Georgetown’s May 2024 analysis on the National Artificial Intelligence Research Resource (NAIRR) and its compute power.

3) Resource Allocation Analysis: Analyze the changes in compute allocation data from 3 main DOE research initiatives from 2019 to 2024: Innovative and Novel Computational Impact on Theory and Experiment (INCITE), The ASCR Leadership Computing Challenge (ALCC), and The Energy Research Computing Allocations Process (ERCAP).


This project charts the capabilities and utilization of federal HPC resources in recent years in a manner that is comprehensive yet accessible to the public. At the same time, the database may serve as an impetus for more coordinated discussions on future HPC investments among federal agencies.

Methodology

Our research process began with identifying pertinent data on USG supercomputers. As we mentioned in the previous section, the project is divided into 3 distinctive parts. Each part required a set of unique data that combine to create a complete picture of federal HPC resources.


1) Data Consolidation: We used the TOP500 data list from June 2024 to streamline and expedite the data collection. TOP500 is an internationally recognized and authoritative source of data on supercomputer capabilities across the globe. We corroborated the data from TOP500 against official websites on individual supercomputers.

2) Performance Comparison: We collected training hardware and performance-per-GPU data from official reports on each relevant supercomputer. For Eagle, Frontier, and Aurora, we pulled figures on total compute power from the TOP500 HPL (64-bit precision) benchmark test ranking. HPL-MxP benchmarks at lower precisions are also available online. However, there are few concrete associations between such benchmarks and a specific precision value available. Therefore, we have chosen the most conservative approach when estimating training time, cost, and energy usage by estimating a relatively low floating point operations per second (FLOPS) speedup of 2x when converting from 64 to 32-bit precision. Nevertheless, when available, we show comparisons using HPL-MxP data for DOE computers and label them as such for interested parties.

3) Resource Allocation Analysis: Our team initially searched through official websites of DOE national laboratories for data. We followed this up with a search through the document repositories of DOE Office of Scientific and Technical Information (OSTI) and Advanced Scientific Computing Research (ASCR). Finally, we tracked down the necessary data for analysis in the websites of individual DOE research programs, including INCITE, ALCC, and NERSC. The process required 2 months to complete. After using R commands to extract the data from INCITE, ALCC, and NERSC websites, we processed them on Google Sheets and Microsoft Excel. We performed the planned analyses, interpreted the results, and created figures on the same spreadsheets.

Project Deliverables

Data Consolidation

We present our analysis of 59 USG supercomputers currently in operation and their computational capabilities. We find an overwhelming dominance of DOE in the field of HPC within the federal government. 85.2% of the data processing abilities in the extant fleet of supercomputers represented in FLOPS comes from DOE systems. At the same time, 18 of 59 USG supercomputers belong to DOE (30.51%), and Frontier and Aurora are one of the world’s most powerful supercomputers in operation. This highlights the fact that DOE supercomputers are on average capable of processing more data in a shorter timeframe than comparable systems owned and operated by other federal agencies.

The USG Supercomputer Wiki supplements our report.


Performance Comparison

Relative comparisons using HPL benchmarks demonstrate that xAI’s Colossus, which may train a GPT-4-scale model in 3.9 days, outperforms both Frontier (train in 324.5 days) and Aurora (train in 385 days) by significant margins. Comparisons based on HPL-MxP benchmarks show noticeably lower estimates for Frontier (train in 76.4 days) and Aurora (train in 73.5 days), but these figures are still higher than the estimate for xAI’s Colossus. Moreover, requiring at least 13 GWh and roughly $1.3 million to train GPT-4, Colossus also exhibits higher power efficiency than either Frontier (147.6 GWh, roughly $14.7 million) or Aurora (268.18 GWh, roughly $26.82 million). We assess that though DOE supercomputers are the most powerful among those already deployed and in operation, private sector supercomputers have caught up and will soon surpass them in performance.


Resource Allocation Analysis

The pace at which computational capabilities of USG supercomputers improve has been fast, which implies that even with the increasing complexities in scientific research, DOE facilities would be able to accommodate many more users in the future. Indeed, the deployment of more advanced supercomputers such as Frontier, Aurora, and Perlmutter led to more compute power made available to the scientific community. The total FLOPS supplied to qualified projects through INCITE, ALCC, and ERCAP increased from 8.02 × 1024 in 2019 to 3.52 × 1025 in 2024, showing a 439% jump. Simultaneously, we observe a steady increase in the number of projects that either utilized artificial intelligence (AI) or conducted foundational research on the subject. Over 20% of projects accepted by INCITE and ALCC today relate to AI and machine learning (ML).


Project Impact and Future Work

We show in this report that DOE’s dominance in HPC highlights its critical role in advancing scientific research and protecting national priorities. DOE supercomputers, such as Frontier and Aurora, remain the most powerful systems in the world. However, the rise of private sector supercomputers reveals a competitive landscape where corporate systems are quickly surpassing government capabilities in certain fields, such as AI. Despite this shifting landscape, DOE has made significant strides in improving access to its HPC resources and the increasing adoption of AI and ML in federally supported projects in recent years. Moving forward, while DOE’s supercomputing infrastructure remains robust, it must anticipate the rapidly evolving demands of AI research and be able to provide an alternative to private sector systems that can process more data at a faster pace with lower energy cost. This report provides a foundation for DOE to assess its strategic direction and prioritize investments in HPC innovations, ensuring the U.S. remains globally competitive and well-equipped to address future computational challenges.

Contributors

Jae Wan Ahn (Project Lead)

University of Chicago

Audrey Berlie

Stetson University

Eric Gong

Harvard University

Bryn Kerslake

Colby College

Arshi Mahajan

Dartmouth College

Madison Moreau

University of Chicago

Uchenna Andrew Offorjebe

University of Chicago

Virginia Washington

University of Chicago