
Blueprinting AI for Science at Exascale (BASE-II)
BASE-II was the UK research project with the main goal of developing a software suite and designs to serve as blueprints for using AI for scientific discoveries at exascale. This was phase II of BASE-I (Benchmarking for AI for Science at Exascale) project, started in December 2022 and ran until the end of March 2025, facilitating faster scientific discovery and cross-scientific knowledge exchange.
BASE-II delivered software:
- to support exascale computing from the AI point of view by developing a software suite and designs
- to address challenges in critical areas of AI
- AI in benchmarking
- AI in HPC
- Hardware software co-design
- Learning from large-scale datasets, tools, frameworks, and workflows.
Throughout the project, BASE-II
- Supported collaboration and knowledge exchange between academia, national laboratories, and industry.
- Encouraged cooperation between various scientific disciplines.
Serving the exascale community, BASE-II
- Actively engaged and supported knowledge exchange with ExCALIBUR funded projects in the UK, and relevant initiatives in Europe and USA.
- Shared experience with evolving UKRI e-Infrastructures, such as Science Clouds, the IRIS science platform, and exascale-cloud middleware.
About the project
The project was split into four work packages:
- An AI Benchmark Toolbox consisting of benchmarks, datasets and frameworks for an extensive number of scientific, climate and fusion research cases
- Methodologies and solutions for different classes of surrogate cases for complex simulations, including an open-source model zoo for surrogate models (SMs)
- Hardware-Software codesign for exascale systems, supporting both communities and stimulating their partnership
- Knowledge exchange and community building, by organising training and communication events, between exascale, research software engineering, AI and science communities
Aim and impact of BASE-II
Following the success of the BASE-I project, BASE-II will address identified challenges and requirements in developing AI for Science solutions at exascale. SciML bench, released as part of BASE-I, provided the scientific community with examples and templates for several challenging problems from different research areas. However successful, benchmarks alone cannot be the only solution for the AI for Science community’s challenges.
This work helped to formulate a core set of requirements gathered from the community, which we decided to concentrate on:
- Understanding the interactions between AI algorithms and systems across a range of scientific problems using benchmarks so that better AI software can be designed — with an emphasis on reproducibility (AI Benchmarking),
- Better HPC scalability using AI (AI/HPC Convergence)
- Cross-leveraging information about AI hardware and AI software for the betterment of the two through co-design principles (AI Hardware/Software Co-Design)
- Enabling scientists to understand better very large-scale, complex, and multi-modal datasets, with attention to incremental learning and generative modelling (Learning from Largescale datasets)
- Equipping the community with an ecosystem of tools, frameworks, and workflows to support the development of AI for Science applications for exascale (AI at Exascale Toolbox)
This project aims to develop a suite of exascale-ready software and relevant designs for addressing these highly prioritized requirements from the AI for Science community — Blueprinting AI for Science at Exascale.
We will ensure that our deliverables remain relevant to UKRI’s e-Infrastructures, and to the communities, through tight engagements with various ExCALIBUR-funded projects, industries, various user bases, academia, national laboratories, and international organisations. In addition, numerous knowledge exchange activities will underpin the maximum flow of information between relevant communities, leading to our success.
Outreach and knowledge exchange
To ensure BASE-II significantly impacts scientific, academic, and industry communities in the UK and worldwide, we created an extensive plan for cross-disciplinary Knowledge Exchange. This plan included the following:
- Regular communication with other projects in the ExCALIBUR portfolio and other exascale communities.
- Training events for all AI communities in the UK, including all ExCALIBUR projects, academic institutions, and public sector research establishments.
- In-Field Placements to upskill research software engineers working across other ExCALIBUR projects.
- Community Workshops open to industry, academia, and national laboratories.
- ExCALIBUR-themed Mini Workshops for knowledge exchange between different projects.
- Quarterly meetings with industry
- International engagements as collaborations and monthly meetings
- Other types of communication and engagement via institutional seminars, publications., conferences, and tutorials
The project was promoted by across the wider scientific and business communities via LinkedIn and YouTube.

The Benchmark Suite
The purpose of AI benchmarking is to
- assess the merits and limitations of various AI solutions,
- rank the multi-GPU systems and software platforms, and
- interpret the measurements.
AI benchmarking is currently experiencing intensive growth, characterised by the appearance of numerous benchmark suites. Variety is good, but it makes the selection of suitable benchmarks more challenging since at present there is no generally accepted set of benchmarks let alone an agreement within the AI research community or industry.
In BASE-II, we suggest a characterisation approach which describes the resource requirements of AI applications in terms of computations and data movements for predicting runtime and scalability. The work on benchmarking is an ongoing effort the aim of which is to keep pace with the latest development of GPU chip design.
The repository BASE-II Benchmarks contains small, medium and large-scale benchmarks, where
-
- Small benchmarks measure the runtime of frequently used operations and the key hardware parameters.,
- Medium size benchmarks focus on convolutional networks (ResNet, VGG) and transformer models (GPT, BERT, T5), and
- Large size benchmarks represent complex applications drawn from various domains such as numerical weather prediction, material science and natural language processing.
Small benchmarks
Deriving algebraic expressions for predicting the runtime, power usage and scalability of AI applications is an important step of the benchmarking activity. These algebraic expressions combine parameters of the workload and the multi-GPU system. A collection of simple benchmarks has been developed for measuring the key performance characteristics such as FLOP rate of GPUs, memory, communication and I/O bandwidths. The FLOP rate of frequently used operations is also measured which includes matrix multiplication, vector operations and collective communications. The advantage of these benchmarks is that they are simple, portable and the measurements are easy to interpret.
Medium-sized benchmarks
The medium size benchmarks represent convolutional (ResNet, VGG) and transformer networks (GPT-2, BERT, T5).
- GPT-2 (Generative Pre-trained Transformer 2)A decoder only architecture developed by OpenAI. This is one of the dominant LLM which is used mainly for text generation, completion, and understanding. The decoder architecture processes the input text in a unidirectional manner and the Self-Attention Mechanism (SAM) uses a multi-head self-attention for capturing both short and long-term relationships between tokens.GPT-2 is pre-trained using the Causal Language Modelling (CLM) algorithm which predicts the next token based on all previous ones. This makes GPT-2 suitable for text generation, sentence completion and summarisation. There are various versions of GPT-2 are available ranging from 117 million to 1.5 billion parameters (https://huggingface.co/openai-community/gpt2).
- BERTA bidirectional transformer which is an encoder-based model trained by Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) algorithms (https://huggingface.co/docs/transformers/en/model_doc/bert).BERT is mainly used for text classification, sentiment analysis, question answering and named entity recognition (NER). The algorithm of Attention Masking (AM) analyses all tokens in the input sequence simultaneously.
- T5 ModelA versatile sequence-to-sequence model which can handle various NLP tasks like text classification, summarization and translation (https://huggingface.co/docs/transformers/en/model_doc/t5).T5 is a sequence-to-sequence model that is often pre-trained on large-scale datasets like C4 (Colossal Clean Crawled Corpus, about 200B words, 750GBytes) for tasks such as translation, summarization and question answering.For the fine-tuning of T5 a subset (af) of C4 (Colossal Clean Crawled Corpus) was used which contains 2.15 million rows of text (https://huggingface.co/datasets/allenai/c4).
- Convolutional Networks
- ResNet (Residual Network) is a typical representative of Convolutional Neural Networks (CNNs) frequently used for image processing tasks such as segmentation, classification and object detection.The main idea of ResNet is the introduction of shortcut connections which can skip individual or a group of layers, thus enabling to learn the residual much faster. There are different versions available ranging from ResNet-18 to ResNet-152, with 18 and 152 layers.VGG networks represent simpler modular architecture, they achieve a higher FLOP rate on GPUs and often used for object recognition tasks. For benchmarking compute and memory intensive AI applications the VGG16 and VGG19 variants were used.
Large Scale Benchmarks
- Numerical Weather Prediction (NWP)- NERSC BenchmarkNWP benchmark is based on the NERSC application presented at the SC23 tutorial from (https://github.com/NERSC/sc23-dl-tutorial). The importance of NWP has been outlined in the latest NOAA report (https://sab.noaa.gov/wp-content/uploads/4.0-DL4NWP_NOAAResponse_Nov2024.pdf).The full dataset (~1TB) consists of 28 years of ERA5 planet observations containing hourly estimates of numerous variables at a 3D grid resolution of 0.25 degrees, from the surface to 100km altitude (https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5).The benchmark uses a smaller dataset which can be found at https://portal.nersc.gov/project/dasrepo/pharring/sc23_data. The input data is in H5 format and split into three categories, these are: training (728GB), validation (56GB) and testing (28GB). This application provides a better forecast accuracy of surface wind speed and precipitation compared to the traditional weather prediction methods. Testing of the code has been performed on Polaris (ANL), Summit (ORNL) and Crusher (ORNL) machines.The application scales well, 256 GPUs (V100) were used on Summit. The best performance was achieved on Polaris which uses NVIDIA A100 GPUs. The Crusher machine uses AMD MI250 GPUs, the performance results were similar to the runtimes achieved on Summit.
- DeepCamA climate modelling application developed by Lawrence Berkeley National Laboratory and a modified version has been included in the set of MLPerf benchmarks (https://github.com/mlcommons/hpc/tree/main/deepcam).The application uses a Convolutional Network model, the training dataset is generated using CAM5 simulations. The size of small training dataset is 650MBytes containing 1537 samples. The shape of each input image sample is (768, 1152, 16), there are three output labels representing: background, atmospheric river and tropical cyclone. The code has been profiled for determining the volume of data movement and computational load.Using the application and the multi-GPU system parameters a prediction model was derived which indicates a compute bound nature of this application. Testing has been performed on an NVIDIA DGX-2 system with A100 GPUs.
- OLMoA transformer network based open-source large language model, widely used in academic research (https://huggingface.co/allenai/OLMo-7B).Various size of models are available from 1 to 7 billion parameters. For the training the Dolma dataset was used containing 2.45 Trillion tokens (https://huggingface.co/datasets/allenai/dolma). OLMo 1 and 7 billion versions were tested on NVIDIA DGX-2 system.
- MegatronTraining LLM represent many challenges in respect to data distribution, orchestration of computations and scalability. Megatron-LM was developed by NVIDIA, this is an environment for large-scale training of LLM ranging from 1B to 1T parameters (https://github.com/NVIDIA/Megatron-LM).Megatron supports three types of parallelism, these are Data, Tensor (intra-layer) and Pipeline (inter-layer). Data parallelism is used for small models which can fit into GPU’s memory. The model is replicated on the GPUs, and each GPU processes a different portion of data. During backpropagation the gradients are aggregated on the master node and averaged across GPUs. For the Tensor Parallelism single layer of the model is split across GPUs. This is an example of horizontal slicing of the network (intra-layer parallelism) where GPUs collaborate to compute outputs of a single layer. In case of Pipeline (inter-layer) parallelisation the layers of the network are grouped and assigned to different GPUs and the data flows across the GPUs. This is an example of a vertical slicing of the network.For benchmarking purposes Megatron-LM was used for training different size GPT-2 models (345M and 1.5B parameters). The application was tested on NVIDIA DGX-2 using 16 GPUs, all three types of parallelism have been tested.
- EquiformerV2A GNN (Graph Neural Network) used for modelling atomic structures, developed as part of the Open Catalyst challenge (https://github.com/atomicarchitects/equiformer_v2).Model size is 153 million parameters. Two input datasets were used for training, these are OC20 and OC22 with 130 million data points representing 3D atomic coordinates (https://fair-chem.github.io/core/datasets/oc22.html). EquiformerV2 was ported to Polaris (ANL) and the scalability tested on 256 A100 GPUs.



Dr Jeyan Thiyagalingam heads the Scientific Machine Learning (SciML) Research Group and the AI for Science initiatives at the Rutherford Appleton Laboratory, Science and Technology Facilities Council (STFC) in the UK. The SciML group is one of the premier research groups working on AI for Science, and is a central to all STFC’s experimental facilities. Prior to joining STFC, he was a faculty member in the School of Electrical Engineering and Electronics and Computer Sciences at the University of Liverpool, and prior to that at the University of Oxford as a James Martin Fellow. He also worked in industry, including MathWorks UK. His research interests and expertise are on representation learning, complex machine learning models, and signal processing. He is also a Fellow of the British Computer Society, a Senior Member of the Institute for Electrical and Electronic Engineers (IEEE) and a Turing Fellow at the Alan Turing Institute, UK’s hub for AI research.
Professor Jeyan Thiyagalingam, Head of AI for Science, Scientific Computing


Mark Wilkinson is the national director of the STFC DiRAC HPC Facility (www.dirac.ac.uk), which provides high performance computing resources for the theoretical astrophysics, particle physics, cosmology, and nuclear physics communities in the UK. He is a Professor of Astrophysics at the University of Leicester, specialising in the study of dark matter in galaxies using a combination of observations, theoretical models, and machine learning. He has published almost 100 peer-reviewed papers and has more than 15,000 citations.
Mark was the editor of the 2019 community-led white paper “UKRI National Supercomputing Roadmap 2019-30” and chaired the editorial board for the peer-reviewed “UKRI Science case for UK Supercomputing” which was published in 2020.
Professor Mark Wilkinson, DiRAC


Dr. Paul Calleja is the Director of Research Computing Services at the University of Cambridge and Director of the Cambridge Open Zettascale Lab. He obtained his Ph.D. in computational biophysics at the University of Bath. After securing a post-doctoral research position at Birkbeck, University of London, he moved into private industry, where he spearheaded the early commercialisation of High-Performance Computing cluster solutions in the UK.
Following six years in the commercial sector – during which time he led the market transition from proprietary SMP systems to commodity cluster-based solutions – Dr. Calleja returned to academia. At Imperial College London, Dr Calleja led the formation of a new HPC service, before moving in 2006 to the University of Cambridge to direct a major reorganisation of research computing services. This has resulted in university-wide HPC capabilities using a novel pay-per-use cloud computing model. The University of Cambridge is now home to the fastest academic supercomputer in the UK.
Dr. Calleja sits on numerous national and international HPC committees and advisory boards, as well as being a founding member of the UK HPC Special Interest Group.
Dr, Paul Calleja, University of Cambridge


Revd. Prof. Jeremy Yates was a Professorial Research Fellow in the Department of Computer Science at UCL, the Founder-Director of the STFC DiRAC Supercomputing Facility (2011-2017), the Deputy Director of the STFC IRIS facility (2017-2019), the joint lead of the UKRI ExCALIBUR Hardware and Enabling Software programme (2019-present) and the Director of the UK SKA Regional Data Centre (UKSRC) Project (2021-to date). These projects require the creation of powerful digital research infrastructures to deliver peer reviewed, internationally competitive, community-generated science cases in the areas of high energy particle physics, astrophysics and nuclear physics using supercomputing, and high throughput computing, design principles. His current project UKSRC is part of the international SKA Regional Centre Network and has to deliver a federated service to process and analyse data generated by the 2 SKA Telescopes, which by 2028 will be 2PB a day.
Rev. Prof. Yates is an expert, and undertakes research, in computational astrophysics, HPC system design, photon transport simulation, probabilistic mechanics and machine learning, and radio astronomy and interferometry. His current interests include developing in situ uncertainty propagation methods for simulations, using ML techniques, to allow simulations to be compared with measurements in order to turn simulations into science.
Revd. Professor Jeremy Yates


Marion Samler is the Knowledge Exchange Co-Ordinator for the BASE-II project. She is also the Business Development Manager for Scientific Computing (SC), and her primary focus is to build partnerships for SC, develop value propositions and support the extensive bid activity for the department as well as Tech Transfer opportunities. Marion’s background is in Business and finance, Partnership Development with previous roles in project management, training and business development and was an Operations Director for a learning technology UK Charity.
Marion Samler, Research and Innovation Development Lead, Scientific Computing


Archit Mantry is the Project Co-ordinator for the BASE-II Project. He oversees daily project activities, resource allocation, milestone management and progress monitoring. In his previous role, he successfully managed language translation projects, engaging stakeholders, updating tasks, and delivering timely results to clients. Archit has a background in Computer Engineering and Project Management.
Archit Mantry, Operations, Scientific Computing


Pam Slingsby joined UKRI STFC SCD in July 2023, and is the Administration Co-Ordinator for the BASE-II Project. With a diverse background, having worked at the Met. Office performing qualitative analysis on automatic weather station data, to payments facilitation and debit control in the telecoms industry and latterly, primary school business management, her unique perspective brings a breadth and impetus to support the project as an efficient, effective, accomplished administrator. She is a Fellow of the Institute of Administrative Management.
Pam Slingsby FInstAM, Operations, Scientific Computing