A Primer on AI Chips — The Takshashila Institution

Executive Summary

The rise of Machine Learning, Deep Learning, and Natural Language Processing has driven unprecedented demand for specialised AI chips. These systems require substantial computational resources and can be deployed either in cloud data centres for maximum processing power or at the network edge for reduced latency and enhanced privacy.

The AI chip ecosystem comprises three critical components: accelerators (including CPUs, GPUs, FPGAs, and ASICs), memory and storage systems, and networking infrastructure. Each component plays a vital role in handling AI workloads, with different architectures offering varying trade-offs between performance and efficiency. The market for these technologies is heavily concentrated among a few key players: NVIDIA, Intel, AMD, Google, and TSMC.

A particular concern is NVIDIA's dominance in the GPU market and its proprietary software ecosystem, which creates significant dependencies for organisations and nations seeking to build sovereign AI infrastructure. As AI becomes increasingly critical to techno-national strategies worldwide, policymakers must understand these technological dependencies and support the development of alternative hardware and software solutions to ensure a more diverse and resilient AI chip ecosystem.

1. Abbreviations

AI - Artificial Intelligence

ASIC - Application-Specific Integrated Circuits

CGI - Computer Generated Imagery

CPU - Central Processing Unit

CUDA - Compute Unified Device Architecture

CXL - Compute Express Link

DRAM - Dynamic Random-Access Memory

FPGA - Field-Programmable Gate Arrays

GDDR - Graphics Double Data Rate

GPU - Graphics Processing Units

HBM - High Bandwidth Memory

HDD - Hard Disk Drive

HPC - High-Performance Computing

ISA - Instruction Set Architecture

LAN - Local Area Network

LLM - Large Language Model

ML - Machine Learning

NLP - Natural Language Processing

NVMe - Non-Volatile Memory Express

NPU - Neural Processing Unit

PCIe - Peripheral Component Interconnect Express

PIM - Processing-in-Memory

ROCm - Radeon Open Compute

SMRs - Small Modular Reactors

SoC - Systems-on-Chip

SSD - Solid State Drive

TPU - Tensor Processing Unit

UALink - Ultra Accelerator Link

UEC - Ultra Ethernet Consortium

UCIe - Universal Chiplet Interconnect Express

UPI - Ultra Path Interconnect

2. Background

The emergence of AI marks a significant milestone in the information age. As a General-Purpose Technology, AI holds the potential to have a transmuting effect on different sectors in different ways—autonomous driving in the automotive industry, fraud detection and risk assessment in finance, personalised marketing in retail, AI-driven diagnosis and personalised medicine in healthcare, AI-driven weapons and decision support systems—the list is endless.

There is a pervasive interest in leveraging AI technologies for their economic, social, and strategic benefits. The size of the AI hardware market was valued at over $50 billion in 2023, and it is estimated to grow almost tenfold by 2030. As AI permeates across various sectors, all of this comes with massive computational needs that the hardware has to enable and sustain.

A big chunk of this computational need is being met using GPUs. NVIDIA is the world's largest GPU company. With its AI-centric GPUs and extensive software ecosystem, NVIDIA has emerged as the world leader in AI computing. It has positioned GPUs as the default choice for companies, government organisations, universities or any other entity that wants to deploy AI solutions. Case in point - about half of the India’s AI mission outlay of over ₹10,000 crores has been earmarked for procuring GPUs to build AI computational infrastructure.

Why is such a large portion of the budget earmarked to build AI computing capacity? Why did the Indian government choose GPUs? How do GPUs compare to other accelerators like the CPUs, FPGAs and ASICs for AI workloads? Does the growing complexity of AI algorithms challenge the traditional reliance on GPUs? Are there scenarios where FPGAs and ASICs outperform GPUs in AI applications? What implications does the choice of hardware architecture have on cost-effectiveness, energy consumption, flexibility and scalability?

As AI technologies evolve, policymakers should have a clear and thorough understanding of the available AI hardware options and their suitability for different use cases. Informed decision-making is necessary to build effective, efficient, and future-proof AI computing infrastructure under national missions like INDIAai.

This discussion document serves as a primer to understand the key elements of AI Chips, and is divided into three broad sections. The first section explains the workloads involved in AI tasks in order to understand the computational requirements that the hardware has to fulfil. The second section provides a comprehensive overview of the key elements of AI computing hardware. These elements include AI accelerators (also called processing units), memory, storage, interconnects and networking systems. The section also distinguishes AI-specific hardware from other general-purpose computing hardware. The third section discusses the linkages between AI accelerators and software development ecosystems.

3. Understanding AI and its hardware requirements

Artificial Intelligence as a diverse bundle of technologies has existed for decades now. As such, the underlying hardware that run these technologies is also similarly disparate, and continuously evolving. For instance, the computer systems that ran the first image recognition algorithms operated differently from those running today’s state-of-the-art facial recognition models.

AI hardware, therefore, encompasses a wide range of computing systems but it has recently gained prominence in the public eye due to a dramatic progress in fields such as machine learning, and an exponential growth in digitised data. At the same time, the ability for algorithms to crunch massive amounts of data is directly attributable to the drastic increase in computing power seen over the past few decades. While other subdomains of AI are still used, whenever AI is mentioned today, chances are that it refers to Machine Learning (ML). Machine Learning and associated subdomains such as Natural Language Processing (NLP), and Deep Learning, form the most significant chunk of the global AI market. The scope of this paper is restricted to computing hardware relevant to ML and associated fields.

Three main technological inputs come together to make these models work:

1. The algorithms that form the brains of the AI models,

2. The data that these algorithms learn from,

3. And finally, the hardware that enables the algorithms to learn and run.

Understanding the interaction between algorithms and the data in machine learning models provides a useful background to realise the computational requirements that the hardware has to fulfil.

These interactions can be broadly divided into two stages: training and inference. Algorithms undergo training where they learn from existing data. Once sufficiently trained, they can be used for inference, that is, to make predictions and draw conclusions about new data.

3.1. Training

The foundation of AI is created during the training phase. It creates a mathematical model that can process new data to make valid predictions and draw accurate conclusions. Training enables AI models to learn from supervised and unsupervised data and improve over time—essentially self-program.

There are various training types, all of which generally train over multiple stages or iterations. In each iteration, the model is taught from some data points in the dataset. The nature of the data varies depending on the model. For a model trained to interpret visual information, the datasets consist of images and videos. For an AI model that is trained to understand human language, called a Language Model (Large or Small LM, or LLM/SLM), the data consists of the language in textual form.

The process of training an AI model and the corresponding computational demands is explained using the example of an LLM below.

Collecting the data points in the dataset: The dataset should have numerous data points sufficient to capture the nuances of human language. The data sources can come from user-generated content on the internet, books, web pages, etc. The datasets for recent LLMs may consist of terabytes of text.

Pre-processing the datasets: The datasets have to be converted into a format that AI models can process. The data is cleaned of irrelevant content and converted from text to numbers—the format that the AI understands and interprets.

Training and Testing: The model starts with an initial understanding of the language that may be random gibberish. A partial sentence of a certain length is fed into the model. The model uses this input to predict the next part of the sentence. Based on the accuracy of the prediction, the AI algorithm readjusts its initial understanding.

Just a single one of these iterations can require up to billions of mathematical calculations. One complete pass of an entire dataset through the model is called an epoch. Given that a single iteration of processing one input requires billions of calculations, a single epoch may require many billion billion calculations. Training typically involves multiple epochs, as many as several thousand in some cases.

These calculations are mostly independent matrix multiplications. Each matrix multiplication does not always require the result of another matrix multiplication which means they can be run independently and in parallel.

According to OpenAI’s estimations, the training of the GPT-3 model took over 300 billion trillion floating point calculations. Considering that running these operations on a single NVIDIA Tesla V100 GPU would take 355 years and that GPT-3 was trained on 10,000 V100 GPUs, the total training time is estimated to be around 34 days.

Therefore, training involves processing massive amounts of data, necessitating significant computational resources. While sustaining computing power at this scale, AI hardware's cost-effectiveness, performance, and power consumption are important considerations.

3.2. Inference

AI models are deployed in real-world environments after training. This stage is called inference. The models process new, real-world data to make valid predictions and draw accurate conclusions. While less demanding computationally, AI inference use cases have different computational requirements. These may include latency, performance, memory, storage, energy efficiency, privacy, and scalability requirements. The type of inference use case has a bearing on the type of hardware infrastructure required to run the AI models.

3.2.1 AI on the cloud

LLM chatbots like OpenAI’s ChatGPT, for instance, are computationally very intensive, requiring significant processing power and memory. Such extensive utilisation of compute hardware also has extensive energy demands and heat generation. Therefore, these AI systems are deployed in data centres that have the necessary high-performance hardware along with dedicated cooling and power infrastructure to service high volume, sustained AI workloads.

3.2.1.1 AI run by Data Centres

Data centres have emerged as an integral part of AI-on-the-cloud infrastructure. Data centres are large-scale facilities that host hardware at scale and thus efficiently provide computational resources. They are increasingly used to train and run large AI models, and barring a few considerations such as software platform support, most customers running AI workloads do not have to worry about the minutiae of the computing hardware of these data centres.

Optimised not just for AI workloads but also other HPC tasks, data centres house dense clusters of specialised hardware. The image below shows a cluster of GPUs in a data centre.

Along with specialised processing chips, they have to also feature enormous amounts of high-performance memory and storage systems. All these components need to be connected via highly performant networking and interconnect solutions to enable rapid data transfers without delays. These networked clusters of compute infrastructure are so well-coordinated that they are usually considered a single unit of high-performance compute themselves. Consequently, as data centres composed of these compute clusters become the sole choice available to customers running AI workloads, they have become the de facto unit of AI compute infrastructure.

As a unit of compute, scalability is also a key requirement of these centres. They have to be designed to be modular with easy expansion. They also need to be amenable to quick upgrades to newer generations of specialised hardware. The goal is to have data centres that can evolve further as AI workloads change, without needing to be redesigned or rebuilt.

All of the operations in the data centre require large amounts of uninterrupted energy. The largest data centres can consume up to tens of hundreds of megawatts, and therefore, also require effective thermal management with advanced cooling solutions. Additionally, they need sufficient levels of redundancy built in to ensure reliability and minimise downtime. These high standards of operation impose significant hardware investments and innovation.

An AI-focused Data Centre © Free Malaysia

However, the AI in the cloud, powered by data centres that are often large distances away from end-users, cannot cover all AI use cases. For instance, highly powerful computing resources of a data centre will not always be able to meet the ultra-low latency operations required for ML workloads for a fitness tracker or a smartwatch, since user data needs to contend with the transmission and processing times associated with cloud computing. These use cases and devices require the processing and source of data input to be in close proximity.

3.2.2 AI on the Edge

There are many use cases such as autonomous driving or traffic management, where it would be impractical to centralise the computing power due to the AI algorithm needing to operate with very low latency or to preserve data privacy. In such a scenario, computing resources are instead placed closer to the source of inputs—the users at the network's edge. These systems are referred to as Edge AI.

Edge AI can vastly increase the scope of AI applications in the real world. This has easy and innovative use cases across multiple sectors. These possibilities have emerged due to advances in computing infrastructures, which have become small, fast, efficient, and specialised enough.

For instance, Edge AI systems can assess sensor data in industrial equipment to detect anomalies and potential issues early and minimise downtime. In these systems, the Edge AI hardware must continuously monitor and assess data in harsh conditions and balance edge and cloud communication.

Remote monitoring through continuous health data analysis using Edge AI with secure processing and low power consumption can be used to improve patient outcomes. Edge AI can enable user health data to be stored and processed locally as opposed to being sent to the cloud. This ensures privacy and gives users greater control over their personal data.

In the case of autonomous vehicles, Edge AI, with low latency, high energy efficiency and bespoke processing capabilities for interpreting multiple sensor inputs, is necessary to process data from cameras, GPS and other sensors for real-time decision-making. Similar use cases exist in traffic management, agriculture, customer analytics, etc.

Even consumer laptops have AI models deployed on-device that run on customised mobile processors. Each of these steps has a bearing on the hardware consideration for the computations. Low latency becomes essential to ensure a real-time, seamless experience. High processing power and memory efficiency are necessary to handle high-traffic applications.

4. Understanding the hardware that makes AI computation possible

The interaction of data and algorithms that results in functional AI models is powered by the underlying hardware architectures, enabling the computational prowess required for AI tasks ranging from facial recognition in smartphone photography to climate modelling on supercomputers.

This document focuses on three key categories of AI hardware components: processing units or accelerators, memory and storage systems, and networking and interconnects infrastructure.

4.1. Processors/Accelerators: The ‘engines’

The processors, or accelerators are the “engines” of the AI systems. They are responsible for crunching the complex mathematical and algorithmic operations over the course of the training as well as the inference stages. There are broadly four types of processing units.

4.1.1. Central Processing Units (CPUs)

CPUs are general-purpose accelerators that can handle a wide range of tasks, including AI workloads. These are at the heart of almost all consumer computing electronics like PCs, smartphones, and laptops.

While they are flexible and easy to program, they may not provide the optimal performance for AI applications compared to more specialised accelerators. In an effort to remedy this, newer CPU Systems-on-Chips (SoCs) dedicate a section of the silicon die to a specialised architecture meant only for running AI workloads locally. Often referred to as Neural Processing Units (NPUs), they are intended to address the traditional weakness of CPUs at training or inference tasks.

American companies, Intel and AMD, are the major players in the CPU market. While Intel maintains a significant market share and has long-standing established relationships with OEMs, AMD’s processors have wrested market share away in recent years with gains in performance and power/thermal efficiency. Both AMD and Intel maintain a quasi-duopoly on the x86 Instruction Set Architecture (ISA) that forms the foundation of their chip designs. NVIDIA, known for their GPUs, are a recent entrant into the CPU space, with their ARM-based “Grace” processor.

In the mobile devices market, the ARM-based accelerators dominate. UK-based ARM creates design blueprints for processing units and licences them out to other companies like Apple, Samsung and Qualcomm. The latter modify or incorporate the licensed ARM designs, make necessary customisations and create their own CPUs within Systems-on-Chips (SoCs). These customisations include adding unique features like NPUs.

Taiwan’s TSMC is a major manufacturer of most of these cutting-edge accelerators. Samsung in Korea also manufactures advanced accelerators but has lower volumes. Meanwhile, Intel, which used to manufacture its own chips in the US and Israel, has encountered difficulties in remaining abreast of the manufacturing capabilities of TSMC. The US is therefore also trying to catch up and build capabilities to fabricate the most advanced processors.

4.1.2. Graphics Processing Units (GPUs)

GPUs have become the dominant processing units for AI, particularly for training highly complex AI models. Originally coined as a term for accelerators meant for rendering graphics in video games, and other Computer-Generated Imagery (CGI), they are designed to perform parallel computations on large datasets, making them well-suited for the matrix operations of AI algorithms.

GPUs have been instrumental in the rapid advancement of AI capabilities in recent years, partly due to their accepted prevalence in scientific computing tasks that have leveraged parallel processing capabilities, as well as increased ease of use in programming them. Their use is prevalent in data centres specialised to run AI workloads.

NVIDIA has a monopoly in the GPU market and has captured nearly all of the market share, with AMD in a distant second place. It has also pivoted majorly to AI and data centre-based GPUs away from its traditional gaming GPU roots. Intel also makes GPUs of its own but remains a minor player.

GPU design is heavily concentrated in the US where NVIDIA, AMD and Intel are headquartered. NVIDIA and AMD are fabless companies that design and sell their own chips but do not manufacture them. The manufacturing is outsourced to East Asia where it is concentrated, particularly at TSMC in Taiwan. This combination of factors has prompted investments in the US, Europe, China, and India to explore production capabilities and enhance supply chain resilience.

4.1.3. Application-Specific Integrated Circuits (ASICs)

Being the most specialised of the processing units, ASICs are chips that are designed from the outset for a specific set of tasks, such as AI inference, sometimes within more-specific temperature, power, and space thresholds as compared to CPUs, and GPUs etc.

As such, they provide the highest performance and energy efficiency for their target workloads, but as a trade-off, they lack the flexibility of other processing units. Examples of AI ASICs include Google's Tensor Processing Units (TPUs), and Cerebras’ Wafer-Scale Engine. TPUs, for instance, are designed at the silicon level for AI workloads that take advantage of Google’s TensorFlow framework. Since the specific range of operations enabled by this framework is known beforehand, the chip design can be optimised only for them. Because of this, TPUs can deliver high performance and energy efficiency for both AI inference and training tasks in Google’s data centres. Vast number of TPUs are used in compute clusters like data centres to train and run AI models. Because of their custom-nature, ASICs can prove to be expensive.

The market for ASICs Is less concentrated than that of GPUs and CPUs. There are many big tech companies as well as startups building ASICs for AI workloads, with demand also similarly disaggregated in terms of volumes shipped. Given the large size of the market and scope for specialisation, companies try to find their own niche. Google and Intel are the notable big players with other emerging ones like Graphcore, Cerebras Systems, Groq, Tenstorrent, Mythic and Blaize. However, like all advanced semiconductor manufacturing, there is significant market concentration in East Asia for manufacturing ASICs.

4.1.4. Field-Programmable Gate Arrays (FPGAs)

FPGAs are reconfigurable integrated circuits that can be programmed to perform specific tasks, including AI workloads. Their versatility lies in being able to switch between different types of workloads post manufacturing, treading a middle-ground between the flexibility of CPUs and the performance of ASICs. This makes them attractive for certain AI applications, especially in scenarios where algorithms evolve rapidly, but without the costs of leveraging ASICs.

It also makes FPGAs an essential input in AI hardware and software R&D, allowing for experimentation and prototyping by researchers and students. FPGAs are predominantly seen in areas like defence electronics, networking, space research and exploration etc where adaptability of functions is an important factor.

Like the CPU market, the FPGA market is dominated by Intel and AMD again. While Intel fabricates some FPGAs in-house, TSMC is again a major manufacturer of FPGAs.

We can evaluate different types of AI-specific accelerators across two key dimensions: performance, and efficiency.

“Performance” encompasses considerations that enable an accelerator to quickly process high volumes of data and complex AI workloads. This captures multiple metrics: raw processing power to crunch precise mathematical operations; high memory bandwidth and capacity to feed data to the processing unit; low latency to provide fast response times to end-users; scalability for tackling massive datasets and models as per use cases; and finally, the ease of programmability.

“Efficiency” encapsulates the cost and sustainability aspects of operating AI systems built using different accelerators. This covers metrics such as the energy consumption of powering and cooling the units, the upfront purchase costs and long-term operating expenses, and also the overall environmental footprint.

Source: Authors’ Visualisation: A Rule of Thumb comparison of some popular accelerators

4.2. Memory and Storage Systems: The ‘fuel lines’ and ‘fuel tanks’

If processing units are visualised as the “engines” of AI systems, memory and storage systems are the veritable fuel lines and the fuel tanks that ensure that these systems run properly. AI models need to be fed with extremely large volumes of data as they are being trained. The models need to be able to store this data. Further, during inference, they need to be able to access input and return output data rapidly, consistently, and reliably. This storage, access and transfer of data is made possible by different types of memory and storage systems.

4.2.1. Memory

4.2.1.1. Random-Access Memory (Dynamic RAM)

Dynamic RAM (DRAM) is the most common type of main memory used in AI systems. It offers relatively high capacity and bandwidth but compared to other types of memory, it can be a bottleneck for data-intensive AI workloads. It is usually leveraged by processing units like CPUs, and is usually physically situated away from the latter. A combination of CPUs and DRAM is the most common configuration found in consumer computing devices like smartphones and PCs.

4.2.1.2. High Bandwidth Memory (HBM)

HBM is a specialised type of memory that provides much higher bandwidth than traditional DRAM. It is increasingly used in high-performance AI hardware. HBM stacks memory chips vertically and places them closer to the accelerator.

HBM modules take advantage of advanced packaging technologies to stack modules vertically, and are placed much closer to the logic processing unit, on the silicon die itself. Therefore, HBM is better placed for latency-sensitive tasks since the physical distance that electrical signals need to cross between processing and storage is lower. This significantly reduces the time and energy required to move data, enabling more complex AI models to run efficiently. HBM has been crucial in advancing areas like real-time video analysis and scientific simulations.

4.2.1.3. Graphics Double Data Rate (GDDR)

GDDR is a type of DRAM that offers much higher bandwidth and lower latency than standard DRAM. GDDR can be used in GPUs for AI workloads when cost is a primary concern or when the workloads and datasets are relatively small.

In contrast to HBM, GDDR memory modules are situated on the board instead of on the GPU’s silicon die. Therefore, it is generally less efficient than HBM in terms of power consumption and suffers comparatively on latency and bandwidth metrics. On the flip side, GDDR offers lower cost, wider availability, and lower memory requirements.

Given the high barriers to entry, the global market for memory systems is oligopolistic. Samsung, SK Hynix and Micron Technology have almost all of the market share. China’s YMTC has suffered in its capability to develop HBM production facilities due to US sanctions. Samsung and SK Hynix are South Korean companies while Micron is American. Due to the highly commoditised nature of memory chips, all the major players in the DRAM market are Integrated Device Manufacturers, and manufacture their own memory chips.

4.2.2. Storage

While memory systems focus on rapid data access for active computations, data storage is crucial for maintaining vast amounts of data that AI systems need for training and inference. AI datasets can easily reach into the petabytes, requiring massive storage capacity and fast access speeds. The viability of a data storage architecture is dependent on a variety of factors such as scalability, availability, security, performance, and resiliency. Storage solutions can therefore be configured to favour one or some of these factors based on system requirements. Key components include:

4.2.2.1. Solid State Drives (SSDs)

These offer faster read and write speeds compared to traditional hard disk drives, making them valuable for AI workloads that require frequent data access. NVMe (Non-Volatile Memory Express) SSDs, in particular, provide high-speed storage and retrieval, low latency, and high-throughput. NVMe SSDs are a popular choice in data centres to host the datasets required to run AI models.

In addition to Samsung, SK Hynix and Micron, Western Digital, Seagate and Kioxia (Japan) are some notable players in SSD manufacturing. The production is mainly concentrated in South Korea, Japan, China and the US.

4.2.2.2. Hard Disks Drives (HDDs)

While slower than SSDs, HDDs offer larger capacities at lower costs, making them suitable for storing vast datasets used in AI training. They are often used in tiered storage systems, where frequently accessed data is stored on faster SSDs while less frequently used data resides on HDDs.

The HDD industry has consolidated into three major players–Seagate, Western Digital and Toshiba that have almost all of the market share.

4.2.2.3. Tape Storage

For archival purposes and extremely large datasets that do not require quick retrieval, tape storage provides a cost-effective solution. While access times are slow, tape storage can be useful for storing historical data or backups of AI models and datasets.

Characterised by limited suppliers of key components, most of the tape storage production happens in Japan and the US by companies like IBM, Quantum and Fujifilm.

Memory and storage solutions may not directly form part of the calculus when it comes to taking investment decisions for AI infrastructure. Usually, the choice of the processing unit and use-cases will also determine the choice of memory and storage solutions due to tight integration. However, as mentioned, memory chips are also highly commoditised and subject to intense geopolitical and geoeconomic pressures, due to the nature of global value chains as well as a steadily escalating US-China trade competition.

4.3. Interconnects and Networking Capabilities: The ‘highways’

Interconnects serve as the metaphorical highways of AI hardware, enabling data transfer between processing units, memory, and storage. Interconnects and networking capabilities are essential for enabling efficient communication between AI hardware components, both within a single system and across multiple data centres. Broadly, there are three kinds of interconnect and networking technologies leveraged across the AI hardware technology stack:

4.3.1. On-Chip Interconnects

On-chip interconnects facilitate communication between different parts within a system-on-chip (SoC). An SoC contains various aforementioned components like a CPU, a GPU, memory and storage.

As Moore’s law hits the limitations of physics, chips have pivoted towards using separate parts of the chip for separate tasks, integrating them on the same foundational structure of the chip. The resulting chiplet would be able to retain better performance, improvements in thermal and power requirements and easier manufacturing. The on-chip interconnects enable fast and efficient communication within the chiplets.

As of now, very few firms such as TSMC (Taiwan), Samsung (South Korea), and Intel (US) etc, have the ability to develop advanced packaging technologies. The fabrication stage of the semiconductor global value chain (GVC) is already dominated by these players. This dominance is further amplified in AI chips.

4.3.2. Chip-to-Chip Interconnects

Chip-to-chip interconnects enable high-speed communication channels between multiple chips within the same system. Technologies like NVIDIA's NVLink or Intel’s Ultra Path Interconnect (UPI) enable high-bandwidth, low-latency connections between multiple NVIDIA GPUs or Intel CPUs, respectively. This is what allows for the creation of powerful compute clusters of processing units, and therefore, for scaling up computations and performance to tackle complex AI workloads.

However, due to the degree of their integration with the architecture of the processing units themselves, these interconnect technologies are typically proprietary. For instance, AMD’s GPU offerings are not compatible with NVLink, as the latter requires specific hardware and software support from NVIDIA. Therefore, the choice of processing unit also determines the nature, and capabilities of downstream technologies essential for the functioning of AI compute infrastructure.

The capabilities of processing units, motherboards, and storage solutions are often tied to their support of the newest iteration of the PCIe standard. The total number of PCIe lanes supported by an AI system directly affects its scalability as more GPUs etc can be connected to it.

Therefore, the choice of platforms and other peripherals directly correlates with the choice of processing units or memory solutions. Higher end platforms, which can support a higher number of interconnects, will cost more, and may be in higher demand, amidst potential supply-chain constraints. The question of vendor and ecosystem lock-in also becomes pertinent, since the ability to mix-and-match similar processing units sourced from different vendors remains limited.

4.3.3. Node-to-Node Interconnects

These refer to communication channels between different “nodes” of a compute cluster. Examples of a compute cluster could range from simple arrangements of home PCs connected via LAN to more complex ones such as server clusters in data centres. Node-to-Node interconnects enable this communication providing high-speed data transfers and low-latency responses.

The most common technologies used here are Ethernet and InfiniBand. The former is an open 50-year-old connectivity technology that sees both commonplace use for providing broadband internet to homes, as well as for communications between data centres and enterprises. The latter was developed to be a more performant replacement to Ethernet in the 1990s and initially found success only in High-Performance Computing (HPC) environments such as supercomputers. It has since become a widely deployed interconnect technology in HPC data centres, and cloud computing.

Estimates suggest that upwards of 90% of scalable AI systems use this networking architecture. InfiniBand’s main strength over existing Ethernet solutions is its relatively high data integrity during the transfer of data between nodes. A lack of data integrity can slow down AI training workloads, and therefore, has a direct impact on costs and efficiency in the AI value chain. On the other hand, Ethernet has a slightly higher bandwidth ceiling, and it has relatively lower implementation costs.

Since InfiniBand is an open industry standard interconnect specification, it means that other firms can still produce InfiniBand solutions to enable high-bandwidth HPC networking for their AI system offerings. However, NVIDIA’s partnerships (with leading server vendors and data centre operators), continued innovation of the standard, and its dominant share in the upstream GPU market have ensured that its InfiniBand-based products command the lion’s share of the downstream networking solutions market as well. The openness of this standard theoretically makes customers of InfiniBand technology less susceptible to vendor lock-in; in practice however, the combination of the above factors create network effects that make it difficult for competitors to unseat NVIDIA’s market dominance.

Investments in advanced interconnect technologies can be as important as investments in processing units themselves for building a presence in the AI global value chain (AI GVC). Standards like CXL, UALink, UCIe, and UEC are expected to play a significant role in the future of AI hardware, providing a standardised, interoperable foundation for high-performance, multi-vendor systems. A long-term policy strategy to incentivise homegrown hyperscalers to add to, and implement open standards like UEC can lower entry barriers for small AI data centre players. The presence of a large number of such networking-solution providers can potentially exert a countervailing pressure on NVIDIA’s market concentration in this downstream market.

5. Understanding the software that supports AI hardware

While the interaction of data and algorithms that results in functional AI models is powered by the underlying hardware, the hardware itself is dependent on certain software components. Much like how consumer PCs and smartphones are dependent on their operating systems, the performance of AI processing units, also known as AI accelerators, can only be realised through the software ecosystems that support their deployment. Software frameworks, libraries, and programming languages harness the processing capabilities of accelerators and simplify the development process for models and applications that run on them. As such, whether or not an AI accelerator meets with widespread success and adoption in the industry is heavily dependent on the maturity and ease of use of their compatible software ecosystems.

5.1. The AI Software Ecosystem

The AI software ecosystem broadly consists of: AI frameworks, programming languages, and programming platforms. This paper focuses on the software ecosystem relevant in the downstream stages of the AI value chain, i.e., models, and applications. While outside the scope of this paper, various software and data analysis tools also exist for processing data before it is used for training and inference.

The term AI Frameworks broadly refers to the pre-made tools and libraries that developers can use to create, train, and test AI models. Frameworks relieve developers of the need to be minutely aware of the complexities of managing the hardware’s low-level operations (like memory management) and are usually hardware-agnostic – which means they can run on CPUs and GPUs as well as other specialised accelerators. That said, many frameworks have optimisations for specific chip architectures. Prominent examples of AI frameworks include TensorFlow, PyTorch, and MXNet.

Programming languages (such as Python and Julia), used in AI development serve as the interface between developers and AI frameworks. Python, in particular, has become the de facto standard for AI developers since it is simple and easy to learn, and has a mature and extensive ecosystem of libraries useful for scientific computing.

As mentioned earlier, high-level languages and frameworks aim to be hardware-agnostic, but developers often rely on lower-level, accelerator-specific features to achieve optimal performance. This is where a custom software development platform can come into play.

Programming platforms like NVIDIA’s proprietary CUDA (Compute Unified Device Architecture) are a prime example. CUDA encapsulates a suite of software tools, libraries, and APIs specifically designed for NVIDIA GPUs. It provides a familiar programming interface to developers using common languages like C, C++, and Fortran, and allows them to write code that can directly access the parallel computing capabilities of the GPU to greatly speed up computing tasks.

5.2. CUDA and its absent competition

CUDA was developed to address the challenges in programming GPUs for general-purpose computing tasks. GPUs could potentially accelerate heavily parallelised workloads (graphics rendering was just such a task), but before CUDA, programming for them required low-level coding skills and a deep understanding of the underlying chip architecture.

NVIDIA tackled this problem in two ways. First, NVIDIA introduced a GPU chip design architecture that was composed of smaller programmable units, termed generally in the industry as “shader units”. Second, NVIDIA created the CUDA software development platform that specifically allowed coders to write programs for these units (now referred to as “CUDA cores”) on its GPUs.

The CUDA platform was designed to attract developers by advertising the massive parallel computing power of GPUs on the back of very little in the way of learning barriers, by highlighting its similarities with other common programming languages. In a nutshell, CUDA as a software platform is inextricably integrated with the silicon-level hardware architecture.

This closed CUDA-GPU integration means that potential competitors are prevented from leveraging the CUDA platform, as NVIDIA’s hardware architecture IP remains proprietary. CUDA itself is free to use, and NVIDIA invested in optimising different sub-platforms of CUDA meant for specific use-cases in industry and research, such as Robotics, Machine Learning, Data Centres etc. The commonality afforded by the platforms ensured that applications across a wide range of domains would also be compatible with all NVIDIA GPUs. NVIDIA invested heavily in training courses and outreach in this regard (and continues to do so), ensuring that both academia and industry adopted its GPUs for their needs.

The CUDA ecosystem has therefore created two-sided network effects stemming from both developers (supply) and industry (demand) utilising the same GPUs and software platform.

CUDA's exclusivity has been a key factor in NVIDIA’s dominance in the AI hardware market. The closed integration of the software development ecosystem with the hardware has enabled NVIDIA to charge supra-competitive prices for its GPUs across both gaming, and enterprise sectors

Source: Authors’ Visualisation (Data from Visualcapitalist)

5.3 CUDA alternatives

Several alternative software ecosystems to CUDA exist; however, these have struggled to match CUDA’s maturity and performance stemming from NVIDIA’s first-mover advantage and the network effects created by its large user-base. The most prominent competitor to CUDA is AMD’s Radeon Open Compute (ROCm).

ROCm is a platform designed for use with AMD’s GPUs, providing a suite of software tools and libraries to developers, similar to CUDA. While relatively new and lacking in overall support, ROCm has two key benefits: first, it includes an abstraction layer, HIP (Heterogeneous-Compute Interface for Portability), that allows developers to convert CUDA applications easily to run on AMD GPUs in a short timeframe. Second, its open-source nature potentially allows for long-term developer buy-in, and crowdsourced additions to its range of libraries. These two factors offer a major value proposition for developers and organisations concerned about vendor lock-in.

CUDA has undoubtedly accelerated the adoption and innovation of AI. However, from a policy perspective, it is a case study that highlights the unsavoury implications of proprietary software ecosystems in the AI hardware market. Besides market concentration risks, vendor lock-in, and other competition barriers, nation-states seeking to build sovereign AI infrastructure using GPUs will have to contend with the strategic dependency associated with being reliant on a single provider like NVIDIA.

6. Conclusion

This primer hopes to serve as a foundational resource for understanding the key facets and components of AI compute hardware. Policymakers must gain a comprehensive understanding of this hardware that powers transformative AI technologies.

The long-term implications of their hardware choices are magnified when we consider that computing infrastructure under national missions like INDIAai are expected to not only be effective, versatile, and efficient, but also future-proof. This document demonstrates how factors like performance, efficiency, cost, and the availability of a robust and developer-friendly software ecosystem play crucial roles in determining the suitability of different hardware options for various AI applications.

GPUs remain the popular choice for AI computing. Overreliance on a single GPU vendor or proprietary technologies can lead to strategic dependencies for nation-states, high switching costs and vendor lock-in, as well as a reduced scope for competition and innovation. It is useful to consider alternatives like ASICs and FPGAs while taking note of their technical characteristics, trade-offs, and market dynamics.

Given the importance of this hardware, long-term national strategies for building compute infrastructure should encompass exploring and supporting the development of alternative hardware and software solutions to mitigate the aforementioned risks. Future research documents will identify specific policy levers for AI compute governance and pathways through which nation-states can develop and maintain strategic footholds in the compute hardware global value chain.

7. Glossary

A

● AI (Artificial Intelligence): A broad term encompassing technologies that enable computers to mimic human intelligence, such as learning, problem-solving, and decision-making. The sources primarily focus on AI powered by Machine Learning.

● AI Accelerator: See Processing Unit.

● Algorithm: A set of instructions or rules that a computer follows to solve a problem or complete a task. In the context of AI, algorithms form the "brains" of AI models, learning from data to make predictions.

● Application-Specific Integrated Circuit (ASIC): A type of processing unit custom-designed for a specific task, such as AI inference. ASICs offer the highest performance and energy efficiency for their target workloads but lack flexibility. Examples: Google's Tensor Processing Units (TPUs), Cerebras' Wafer-Scale Engine.

● ARM: A UK-based company that designs processing unit blueprints and licenses them to other companies like Apple, Samsung, and Qualcomm. ARM processors dominate the mobile device market.

C

● Central Processing Unit (CPU): A general-purpose processor that can handle a wide range of tasks, including AI workloads. CPUs are found in most consumer electronics like PCs, smartphones, and laptops. While flexible, CPUs may not be as performant as specialised processors for AI. Key manufacturers: Intel, AMD.

● Chiplet: A modular approach to chip design where separate parts of a chip are dedicated to specific tasks and integrated onto the same foundational structure. This allows for better performance, improved thermal and power efficiency, and easier manufacturing.

● Cloud AI: AI systems deployed in data centres, providing computational resources remotely. Suitable for computationally-intensive tasks that require significant processing power and memory.

● Compute Cluster: A group of interconnected computers (nodes) that work together to perform complex computations. Examples range from home PCs connected via LAN to server clusters in data centres.

● Compute Unified Device Architecture (CUDA): NVIDIA's proprietary software platform specifically designed for programming NVIDIA GPUs. CUDA provides a familiar programming interface and allows developers to access the parallel computing capabilities of GPUs. It has played a significant role in NVIDIA's dominance in the AI hardware market.

● CXL (Compute Express Link): An open industry standard for high-speed CPU-to-device and CPU-to-memory interconnects, expected to play a significant role in the future of AI hardware.

D

● Data Centre: A large-scale facility that houses and efficiently provides computational resources, often used to train and run large AI models. Data Centres contain clusters of specialised hardware, including processing units, memory, storage, and networking solutions.

● Deep Learning: A subfield of Machine Learning that uses artificial neural networks with multiple layers to analyse and learn from data.

● Dynamic Random-Access Memory (DRAM): The most common type of main memory used in AI systems. It offers relatively high capacity and bandwidth but can be a bottleneck for data-intensive AI workloads.

E

● Edge AI: AI systems where computational resources are located closer to the source of data, such as on edge devices. This enables low-latency operation and data privacy. Examples: autonomous vehicles, fitness trackers, smartwatches.

● Epoch: One complete pass of an entire dataset through an AI model during training.

● Ethernet: An open standard networking technology used for communication between computers and other devices. It's widely used in both home and enterprise networks, including data centres. While not as performant as InfiniBand for AI training, it offers higher bandwidth and lower implementation costs.

F

● Field-Programmable Gate Array (FPGA): A reconfigurable integrated circuit that can be programmed to perform specific tasks, including AI workloads. FPGAs offer a balance between CPU flexibility and ASIC performance. They are commonly used in defence electronics, networking, and space research. Key manufacturers: Intel, AMD.

● Floating Point Calculation: A mathematical operation involving decimal numbers, performed by computers. AI training often requires billions of trillions of floating-point calculations.

G

● General-Purpose Technology: A technology with the potential to have a transformative impact across various sectors and industries. AI is considered a General-Purpose Technology.

● Global Value Chain (GVC): The interconnected network of activities involved in the design, production, distribution, and use of a product or service across different geographical locations. The AI hardware market has a complex GVC, with concentration in certain regions and companies.

● Graphics Double Data Rate (GDDR): A type of DRAM that offers higher bandwidth and lower latency than standard DRAM. Often used in GPUs for AI workloads when cost is a concern or datasets are relatively small. It's generally less efficient than HBM.

● Graphics Processing Unit (GPU): A type of processing unit originally designed for graphics rendering but now widely used for AI, particularly for training complex models. GPUs excel at parallel processing, making them well-suited for AI's matrix operations. NVIDIA dominates the GPU market.

● GPT-3: A large language model (LLM) developed by OpenAI, demonstrating the massive computational requirements of AI training. Training GPT-3 involved quadrillions of calculations and took an estimated 34 days using 10,000 NVIDIA V100 GPUs.

H

● Hard Disk Drive (HDD): A storage device that uses magnetic disks to store data. HDDs offer large storage capacities at lower costs compared to SSDs but are slower.

● Heterogeneous-Compute Interface for Portability (HIP): An abstraction layer in AMD's ROCm platform that allows developers to easily convert CUDA applications to run on AMD GPUs.

● High Bandwidth Memory (HBM): A specialised type of memory that provides higher bandwidth and lower latency than traditional DRAM. HBM stacks memory chips vertically and places them closer to the processor, improving data transfer speed and efficiency.

● High-Performance Computing (HPC): The use of supercomputers and parallel processing techniques to solve complex computational problems. AI training often requires HPC infrastructure.

I

● Inference: The stage where a trained AI model processes new data to make predictions or draw conclusions. Less computationally demanding than training but has different requirements, such as latency, performance, and efficiency.

● InfiniBand: A high-speed networking technology commonly used in HPC and data centre environments. It offers high bandwidth and low latency, making it suitable for data-intensive AI workloads. InfiniBand is known for its high data integrity, which is crucial for AI training.

● Instruction Set Architecture (ISA): The fundamental set of instructions that a processor can understand and execute. Intel and AMD processors are based on the x86 ISA.

● Interconnect: A communication channel that enables data transfer between different hardware components, such as processing units, memory, and storage. Types: on-chip, chip-to-chip, node-to-node.

L

● Latency: The time delay between a request for data and the data's arrival. Low latency is critical for real-time AI applications.

● Large Language Model (LLM): A type of AI model trained on massive text datasets to understand and generate human-like language. Examples: OpenAI's ChatGPT.

M

● Machine Learning (ML): A type of AI that enables computers to learn from data without explicit programming. ML algorithms identify patterns and make predictions based on data. The sources primarily focus on AI powered by ML.

● Memory: A temporary storage space where a computer stores data that it is actively using. Different types of memory are used in AI systems, including DRAM, HBM, and GDDR.

● Micron Technology: A leading manufacturer of memory and storage solutions, including DRAM, HBM, and SSDs.

N

● Natural Language Processing (NLP): A subfield of AI focused on enabling computers to understand, interpret, and generate human language.

● Networking: The interconnection of computers and other devices to enable communication and data exchange. Networking technologies like Ethernet and InfiniBand are essential for AI hardware, especially in data centre environments.

● Neural Processing Unit (NPU): A specialised processing unit integrated into some CPUs, specifically designed for AI workloads. NPUs aim to improve the performance of CPUs for AI tasks.

● Node: A single computer or server within a compute cluster. Nodes are interconnected to enable distributed computing for AI workloads.

● Non-Volatile Memory Express (NVMe): A communication protocol for SSDs that offers high-speed storage and retrieval, low latency, and high-throughput. NVMe SSDs are commonly used in data centres for AI workloads.

● NVLink: NVIDIA's proprietary chip-to-chip interconnect technology that enables high-bandwidth, low-latency connections between multiple NVIDIA GPUs.

O

● On-Chip Interconnect: A type of interconnect that facilitates communication between different components within a System-on-Chip (SoC). Chiplets and advanced packaging technologies are examples of on-chip interconnects.

● OpenAI: An AI research and deployment company known for developing large language models, including GPT-3.

P

● Packaging Technology: Techniques used to enclose and connect semiconductor chips to other components on a printed circuit board. Advanced packaging technologies, such as chiplets and 3D packaging, are critical for improving the performance and efficiency of AI hardware.

● Parallel Processing: The ability to execute multiple computations simultaneously, significantly speeding up complex tasks. GPUs excel at parallel processing, making them suitable for AI workloads.

● Peripheral Component Interconnect Express (PCIe): A widely used expansion standard that enables communication between various hardware components, including processing units, storage, and networking cards.

● Performance: A measure of how quickly and efficiently a processing unit can handle AI workloads. Factors considered include processing power, memory bandwidth, latency, scalability, and programmability.

● Processing-in-Memory (PIM): An emerging memory technology that integrates processing capabilities into memory itself, reducing data movement and improving efficiency.

● Processing Unit: The "engine" of an AI system responsible for executing the mathematical and algorithmic operations involved in training and inference. Types: CPUs, GPUs, ASICs, FPGAs. Also known as an AI accelerator.

● Programmability: The ease with which a processing unit can be programmed to perform specific tasks. A developer-friendly programming environment is crucial for AI hardware adoption.

● Programming Language: A formal language used to write instructions for computers to execute. Python is a popular programming language for AI development due to its simplicity and extensive libraries.

● Programming Platform: A set of software tools, libraries, and APIs that provide a framework for developing and deploying AI applications. Examples: NVIDIA's CUDA, AMD's ROCm.

● Proprietary Technology: Technology that is owned and controlled by a specific company, limiting access and competition. NVIDIA's CUDA is an example of proprietary technology that has contributed to its market dominance but also raised concerns about vendor lock-in.

● PyTorch: An open-source AI framework known for its flexibility and research-oriented features.

R

● Radeon Open Compute (ROCm): AMD's open-source software platform for programming AMD GPUs, designed to compete with NVIDIA's CUDA. ROCm offers an abstraction layer (HIP) for easier porting of CUDA applications and benefits from community contributions.

● Random-Access Memory (RAM): A type of computer memory that allows data to be accessed randomly, regardless of its physical location on the storage medium.

S

● Scalability: The ability of an AI system to handle increasing workloads or larger datasets by adding more resources. Scalability is crucial for data centres and cloud AI platforms.

● Samsung: A South Korean multinational conglomerate that is a leading manufacturer of memory chips, SSDs, and advanced packaging technologies.

● SK Hynix: A South Korean memory semiconductor manufacturer, specialising in DRAM, NAND flash, and other memory products.

● Small Modular Reactor (SMR): A type of nuclear reactor that is smaller and more scalable than traditional reactors. Microsoft is exploring the use of SMRs to power its data centres.

● Software Ecosystem: The collection of software components, including frameworks, programming languages, and platforms, that support the development and deployment of AI applications. A robust and developer-friendly software ecosystem is crucial for the success of AI hardware.

● Solid State Drive (SSD): A type of storage device that uses flash memory to store data. SSDs offer significantly faster read and write speeds compared to HDDs. NVMe SSDs are commonly used in data centres for AI workloads.

● System-on-Chip (SoC): An integrated circuit that combines multiple components of a computer system, such as a CPU, GPU, memory, and storage, onto a single chip. SoCs are commonly used in mobile devices and embedded systems.

T

● Taiwan Semiconductor Manufacturing Company (TSMC): A Taiwanese multinational semiconductor contract manufacturing and design company. TSMC is a major manufacturer of CPUs, GPUs, and other advanced processors.

● Tape Storage: A storage technology that uses magnetic tape to store data. Tape storage is often used for archival purposes and for storing extremely large datasets that do not require quick retrieval.

● Tensor Processing Unit (TPU): Google's custom-designed ASIC specifically optimised for AI workloads, particularly those using Google's TensorFlow framework.

● TensorFlow: An open-source AI framework developed by Google, known for its scalability and performance. TensorFlow is heavily optimised for Google's TPUs.

● Training: The process of creating an AI model by "teaching" an algorithm using data. During training, the AI model learns to identify patterns and make predictions based on the provided data.

U

● Ultra Accelerator Link (UALink): A proposed industry standard for an Ethernet-based interconnect designed for high-speed GPU-to-GPU communication. UALink aims to create an open alternative to proprietary solutions like NVIDIA's NVLink.

● Ultra Ethernet Consortium (UEC): An industry initiative backed by major players in the AI hardware and software industry to optimise Ethernet for high-performance computing and AI networking.

● Ultra Path Interconnect (UPI): Intel's proprietary chip-to-chip interconnect technology that enables high-speed communication between multiple Intel CPUs.

● Universal Chiplet Interconnect Express (UCIe): An open industry standard for interconnecting chiplets from different vendors, promoting interoperability and innovation in AI hardware.

V

● Vendor Lock-In: A situation where a customer becomes reliant on a single vendor for products or services, making it difficult and costly to switch to a competitor. Proprietary technologies can lead to vendor lock-in.

W

● Wafer-Scale Engine: A massive AI processor developed by Cerebras Systems, known for its large size and processing power by virtue of having been fabricated on an entire silicon wafer.

This glossary provides a starting point for understanding the key terms and concepts related to AI hardware. Further exploration of the sources and other resources is encouraged for a deeper understanding.

~~~

Find the sidenotes, references and image attributions in the PDF document above.

A Primer on AI Chips: The Brains Behind the Bots