Skip to content
AI generated thinking skull

Blueprinting AI for Science at Exascale (BASE-II)

BASE-II was the UK research project with the main goal of developing a software suite and designs to serve as blueprints for using AI for scientific discoveries at exascale. This was phase II of BASE-I (Benchmarking for AI for Science at Exascale) project, started in December 2022 and ran until the end of March 2025, facilitating faster scientific discovery and cross-scientific knowledge exchange.

 

BASE-II delivered software:

  • to support exascale computing from the AI point of view by developing a software suite and designs​
  • to address challenges in critical areas of AI
  • AI in benchmarking​
  • AI in HPC​
  • Hardware software co-design
  • Learning from large-scale datasets, tools, frameworks, and workflows.

Throughout the project, BASE-II

  • Supported collaboration and knowledge exchange between academia, national laboratories, and industry.
  • Encouraged cooperation between various scientific disciplines.

Serving the exascale community, BASE-II

  • Actively engaged and supported knowledge exchange with ExCALIBUR funded projects in the UK, and relevant initiatives in Europe and USA.
  • Shared experience with evolving UKRI e-Infrastructures, such as Science Clouds, the IRIS science platform, and exascale-cloud middleware.

About the project

The project was split into four work packages:

  • An AI Benchmark Toolbox consisting of benchmarks, datasets and frameworks for an extensive number of scientific, climate and fusion research cases
  • Methodologies and solutions for different classes of surrogate cases for complex simulations, including an open-source model zoo for surrogate models (SMs)
  • Hardware-Software codesign for exascale systems, supporting both communities and stimulating their partnership
  • Knowledge exchange and community building, by organising training and communication events, between exascale, research software engineering, AI and science communities

Aim and impact of BASE-II

Following the success of the BASE-I project, BASE-II will address identified challenges and requirements in developing AI for Science solutions at exascale. SciML bench, released as part of BASE-I, provided the scientific community with examples and templates for several challenging problems from different research areas. However successful, benchmarks alone cannot be the only solution for the AI for Science community’s challenges.

This work helped to formulate a core set of requirements gathered from the community, which we decided to concentrate on:

  • Understanding the interactions between AI algorithms and systems across a range of scientific problems using benchmarks so that better AI software can be designed — with an emphasis on reproducibility (AI Benchmarking),
  • Better HPC scalability using AI (AI/HPC Convergence)
  • Cross-leveraging information about AI hardware and AI software for the betterment of the two through co-design principles (AI Hardware/Software Co-Design)
  • Enabling scientists to understand better very large-scale, complex, and multi-modal datasets, with attention to incremental learning and generative modelling (Learning from Largescale datasets)
  • Equipping the community with an ecosystem of tools, frameworks, and workflows to support the development of AI for Science applications for exascale (AI at Exascale Toolbox)

This project aims to develop a suite of exascale-ready software and relevant designs for addressing these highly prioritized requirements from the AI for Science community — Blueprinting AI for Science at Exascale.

We will ensure that our deliverables remain relevant to UKRI’s e-Infrastructures, and to the communities, through tight engagements with various ExCALIBUR-funded projects, industries, various user bases, academia, national laboratories, and international organisations. In addition, numerous knowledge exchange activities will underpin the maximum flow of information between relevant communities, leading to our success.

Outreach and knowledge exchange

To ensure BASE-II significantly impacts scientific, academic, and industry communities in the UK and worldwide, we created an extensive plan for cross-disciplinary Knowledge Exchange. This plan included the following:

  • Regular communication with other projects in the ExCALIBUR portfolio and other exascale communities.
  • Training events for all AI communities in the UK, including all ExCALIBUR projects, academic institutions, and public sector research establishments.
  • In-Field Placements to upskill research software engineers working across other ExCALIBUR projects.
  • Community Workshops open to industry, academia, and national laboratories.
  • ExCALIBUR-themed Mini Workshops for knowledge exchange between different projects.
  • Quarterly meetings with industry
  • International engagements as collaborations and monthly meetings
  • Other types of communication and engagement via institutional seminars, publications., conferences, and tutorials

The project was promoted by across the wider scientific and business communities via LinkedIn and YouTube.

The Benchmark Suite

The purpose of AI benchmarking is to

  • assess the merits and limitations of various AI solutions,
  • rank the multi-GPU systems and software platforms, and
  • interpret the measurements.

AI benchmarking is currently experiencing intensive growth, characterised by the appearance of numerous benchmark suites. Variety is good, but it makes the selection of suitable benchmarks more challenging since at present there is no generally accepted set of benchmarks let alone an agreement within the AI research community or industry.

In BASE-II, we suggest a characterisation approach which describes the resource requirements of AI applications in terms of computations and data movements for predicting runtime and scalability. The work on benchmarking is an ongoing effort the aim of which is to keep pace with the latest development of GPU chip design.

The repository BASE-II Benchmarks contains small, medium and large-scale benchmarks, where

    • Small benchmarks measure the runtime of frequently used operations and the key hardware parameters.,
    • Medium size benchmarks focus on convolutional networks (ResNet, VGG) and transformer models (GPT, BERT, T5), and
    • Large size benchmarks represent complex applications drawn from various domains such as numerical weather prediction, material science and natural language processing.

Small benchmarks

Deriving algebraic expressions for predicting the runtime, power usage and scalability of AI applications is an important step of the benchmarking activity. These algebraic expressions combine parameters of the workload and the multi-GPU system. A collection of simple benchmarks has been developed for measuring the key performance characteristics such as FLOP rate of GPUs, memory, communication and I/O bandwidths. The FLOP rate of frequently used operations is also measured which includes matrix multiplication, vector operations and collective communications. The advantage of these benchmarks is that they are simple, portable and the measurements are easy to interpret.

Medium-sized benchmarks

The medium size benchmarks represent convolutional (ResNet, VGG) and transformer networks (GPT-2, BERT, T5).

  1. GPT-2 (Generative Pre-trained Transformer 2)A decoder only architecture developed by OpenAI. This is one of the dominant LLM which is used mainly for text generation, completion, and understanding. The decoder architecture processes the input text in a unidirectional manner and the Self-Attention Mechanism (SAM) uses a multi-head self-attention for capturing both short and long-term relationships between tokens.GPT-2 is pre-trained using the Causal Language Modelling (CLM) algorithm which predicts the next token based on all previous ones. This makes GPT-2 suitable for text generation, sentence completion and summarisation. There are various versions of GPT-2 are available ranging from 117 million to 1.5 billion parameters (https://huggingface.co/openai-community/gpt2).
  2. BERTA bidirectional transformer which is an encoder-based model trained by Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) algorithms (https://huggingface.co/docs/transformers/en/model_doc/bert).BERT is mainly used for text classification, sentiment analysis, question answering and named entity recognition (NER). The algorithm of Attention Masking (AM) analyses all tokens in the input sequence simultaneously.
  3. T5 ModelA versatile sequence-to-sequence model which can handle various NLP tasks like text classification, summarization and translation (https://huggingface.co/docs/transformers/en/model_doc/t5).T5 is a sequence-to-sequence model that is often pre-trained on large-scale datasets like C4 (Colossal Clean Crawled Corpus, about 200B words, 750GBytes) for tasks such as translation, summarization and question answering.For the fine-tuning of T5 a subset (af) of C4 (Colossal Clean Crawled Corpus) was used which contains 2.15 million rows of text (https://huggingface.co/datasets/allenai/c4).
  4.  Convolutional Networks
  5. ResNet (Residual Network) is a typical representative of Convolutional Neural Networks (CNNs) frequently used for image processing tasks such as segmentation, classification and object detection.The main idea of ResNet is the introduction of shortcut connections which can skip individual or a group of layers, thus enabling to learn the residual much faster. There are different versions available ranging from ResNet-18 to ResNet-152, with 18 and 152 layers.VGG networks represent simpler modular architecture, they achieve a higher FLOP rate on GPUs and often used for object recognition tasks. For benchmarking compute and memory intensive AI applications the VGG16 and VGG19 variants were used.

Large Scale Benchmarks

  1. Numerical Weather Prediction (NWP)- NERSC BenchmarkNWP benchmark is based on the NERSC application presented at the SC23 tutorial from (https://github.com/NERSC/sc23-dl-tutorial). The importance of NWP has been outlined in the latest NOAA report (https://sab.noaa.gov/wp-content/uploads/4.0-DL4NWP_NOAAResponse_Nov2024.pdf).The full dataset (~1TB) consists of 28 years of ERA5 planet observations containing hourly estimates of numerous variables at a 3D grid resolution of 0.25 degrees, from the surface to 100km altitude (https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5).The benchmark uses a smaller dataset which can be found at https://portal.nersc.gov/project/dasrepo/pharring/sc23_data. The input data is in H5 format and split into three categories, these are: training (728GB), validation (56GB) and testing (28GB). This application provides a better forecast accuracy of surface wind speed and precipitation compared to the traditional weather prediction methods. Testing of the code has been performed on Polaris (ANL), Summit (ORNL) and Crusher (ORNL) machines.The application scales well, 256 GPUs (V100) were used on Summit. The best performance was achieved on Polaris which uses NVIDIA A100 GPUs. The Crusher machine uses AMD MI250 GPUs, the performance results were similar to the runtimes achieved on Summit.
  2. DeepCamA climate modelling application developed by Lawrence Berkeley National Laboratory and a modified version has been included in the set of MLPerf benchmarks (https://github.com/mlcommons/hpc/tree/main/deepcam).The application uses a Convolutional Network model, the training dataset is generated using CAM5 simulations. The size of small training dataset is 650MBytes containing 1537 samples. The shape of each input image sample is (768, 1152, 16), there are three output labels representing: background, atmospheric river and tropical cyclone. The code has been profiled for determining the volume of data movement and computational load.Using the application and the multi-GPU system parameters a prediction model was derived which indicates a compute bound nature of this application. Testing has been performed on an NVIDIA DGX-2 system with A100 GPUs.
  3. OLMoA transformer network based open-source large language model, widely used in academic research (https://huggingface.co/allenai/OLMo-7B).Various size of models are available from 1 to 7 billion parameters. For the training the Dolma dataset was used containing 2.45 Trillion tokens (https://huggingface.co/datasets/allenai/dolma). OLMo 1 and 7 billion versions were tested on NVIDIA DGX-2 system.
  4. MegatronTraining LLM represent many challenges in respect to data distribution, orchestration of computations and scalability. Megatron-LM was developed by NVIDIA, this is an environment for large-scale training of LLM ranging from 1B to 1T parameters (https://github.com/NVIDIA/Megatron-LM).Megatron supports three types of parallelism, these are Data, Tensor (intra-layer) and Pipeline (inter-layer). Data parallelism is used for small models which can fit into GPU’s memory. The model is replicated on the GPUs, and each GPU processes a different portion of data. During backpropagation the gradients are aggregated on the master node and averaged across GPUs. For the Tensor Parallelism single layer of the model is split across GPUs. This is an example of horizontal slicing of the network (intra-layer parallelism) where GPUs collaborate to compute outputs of a single layer. In case of Pipeline (inter-layer) parallelisation the layers of the network are grouped and assigned to different GPUs and the data flows across the GPUs. This is an example of a vertical slicing of the network.For benchmarking purposes Megatron-LM was used for training different size GPT-2 models (345M and 1.5B parameters). The application was tested on NVIDIA DGX-2 using 16 GPUs, all three types of parallelism have been tested.
  5. EquiformerV2A GNN (Graph Neural Network) used for modelling atomic structures, developed as part of the Open Catalyst challenge (https://github.com/atomicarchitects/equiformer_v2).Model size is 153 million parameters. Two input datasets were used for training, these are OC20 and OC22 with 130 million data points representing 3D atomic coordinates (https://fair-chem.github.io/core/datasets/oc22.html). EquiformerV2 was ported to Polaris (ANL) and the scalability tested on 256 A100 GPUs.
AI generated picture of network inside a brain

Enquiries

For more information and all enquiries, please contact us.

Colour-enhanced image of computer racks in an aisle