Whether in data centers or at the edge, artificial intelligence (AI) accelerators address the limitations of traditional von Neumann architecture by rapidly processing massive datasets. Despite the gradual slowing of Moore’s law, these accelerators efficiently enable key applications such as generative AI (GenAI), deep reinforcement learning (DRL), advanced driver assistance systems (ADAS), smart edge devices, and wearables.
This article discusses the limitations of classic von Neumann architecture and highlights how AI accelerators overcome these challenges. It reviews various AI accelerator types and explains how engineers optimize performance and energy efficiency with advanced electronic design automation (EDA).
Addressing the limitations
Historically, von Neumann-based systems with primary general-purpose central processing units (CPUs) relied on coprocessors, such as discrete graphics processing units (GPUs), for specialized tasks like gaming, video editing, and cloud-based data processing. However, this paradigm — whether at the edge, data centers, or PCs — can’t efficiently meet the high-performance computational demands of advanced AI inference and training. This is because von Neumann’s architecture and the legacy software designed around it inherently create bottlenecks by processing data sequentially rather than in parallel.
Today, AI accelerators running large language models (LLMs) leverage parallel processing to meet AI-specific requirements. They break down complex problems into smaller tasks and execute billions of calculations simultaneously.
In addition to parallel processing, many AI accelerators use reduced precision techniques — employing 8-bit or 16-bit numbers instead of the standard 32-bit format — to minimize processing cycles and save power. Neural networks (Figure 1) are highly tolerant of reduced precision during training and inference, and this technique can sometimes even improve accuracy. Put simply, reduced precision enables faster calculations and significantly reduced power consumption, with each operation requiring up to 30x less silicon area than standard 32-bit precision.
Specialized memory, such as on-chip SRAM caches, high-bandwidth memory (HBM), or GDDR (Graphics Double Data Rate), further reduces latency and optimizes throughput for a wide range of AI workloads.
From the data center to the intelligent edge
Semiconductor companies design AI accelerators to enable both data center and edge applications. These include:
Wafer-scale integration (WSI) silicon: integrates large AI chip networks into a single “super” chip. In data centers, WSI chips such as the Cerebras wafer-scale engine (WSE) are used for high-performance deep learning tasks, including training LLMs and handling complex AI workloads. The WSE (Figure 2) supports multiple model sizes, including 8B and 70B parameter versions of LLaMA models.
GPUs: accelerate a wide range of applications in data centers and at the edge, such as machine learning (ML), DRL, GenAI, and computer vision. In data centers, massive NVIDIA GPU clusters boost processing power for AI training, scientific computing, and GenAI queries. At the edge, lower-power GPUs handle tasks like object detection in smart cameras and image processing in autonomous vehicles.
Neural processing units (NPUs): excel at tasks like image recognition and natural language processing (NLP). Low-power, high-performance NPUs process data efficiently in edge applications like industrial IoT (IIoT), automotive, smartphones, wearables, and smart home appliances. Companies like BrainChip, Synopsys, Cadence, Intel, AMD, and Apple design NPUs with low latency for applications like voice commands, rapid image generation, and inference.
Field-programmable gate arrays (FPGAs): these reprogrammable AI accelerators can be customized for specific tasks and target edge applications requiring diverse I/O protocols, low latency, and low power. Intel, AMD, Achronix Semiconductor, Flex Logix, and others design AI FPGA accelerators.
Application-specific integrated circuits (ASICs): deliver high performance for specific tasks like DRL, video encoding, and cryptographic processing. Although ASICs aren’t reprogrammable, their application-specific design ensures high efficiency and low latency for target use cases. Google’s tensor processing unit (TPU) is an ASIC developed for neural network machine learning. It is used in edge and data center environments to optimize AI workloads.
How EDA optimizes AI accelerator energy efficiency
AI accelerators are typically 100x to 1,000x more efficient than general-purpose systems, offering significant improvements in processing speed, power consumption, and computational throughput. Despite these advantages, the computational power required for the largest AI training runs has doubled approximately every 3.4 months since 2012.
This rapid increase is driven by the growing complexity of training datasets and LLMs and the demand for higher accuracy and capabilities. Consequently, the U.S. Department of Energy (DoE) has recommended a 1,000-fold improvement in semiconductor energy efficiency, making performance-per-watt (PPW) optimization a top industry priority.
Improving AI accelerator power delivery network (PDN) architecture ensures high-speed, energy-efficient, cost-effective performance. EDA engineers begin the design process with AI-driven architectural exploration platforms that assess power, performance, and area (PPA) tradeoffs. Sophisticated emulation systems running billions of cycles allow engineers to precisely assess power consumption and thermal dissipation across diverse scenarios.
EDA engineers use register transfer level (RTL) power analysis to optimize dynamic and static power consumption, leveraging timing-driven and physically aware synthesis for accuracy (Figure 3). Timing-driven synthesis prevents power calculation errors by ensuring proper cell sizing, while physically aware synthesis incorporates first-pass placement and global routing for precise capacitance estimation.
Some RTL power analysis tools include a signoff-quality computation engine to accurately calculate glitch power, accounting for a significant portion of a chip’s power consumption in specific scenarios. After the RTL analysis, the physical implementation tools were used to refine PPA further.
Many EDA tools feature an integrated data model architecture, interleaved engines, and unified interfaces to ensure scalability and reliability. Additionally, they accurately model advanced node effects, expediting engineering change orders (ECOs) and final design closure.
Conclusion
Classic von Neumann architecture creates bottlenecks by processing data sequentially rather than in parallel. In contrast, AI accelerators break down complex problems and execute billions of calculations simultaneously with parallel processing. Design engineers use advanced EDA tools to optimize AI accelerators for performance and energy efficiency in data centers and at the intelligent edge.
Related EE World content
What’s the Difference Between GPUs and TPUs for AI Processing?
High-Speed, Low-Power Embedded Processor Technology Helps Advance Vision AI
How Are High-Speed Board-to-Board Connectors Used in ML and AI Systems?
How does UCIe on Chiplets Enable Optical Interconnects in Data Centers?
What is TinyML?
References
What is an AI Accelerator? Synopsys
Designing Energy-Efficient AI Accelerators for Data Centers and the Intelligent Edge, Synopsys
What is an AI Accelerator?, IBM
Artificial Intelligence (AI) Accelerators, Intel
What is a Hardware Accelerator?, Cadence
Cerebras Takes On Nvidia With AI Model On Its Giant Chip, Forbes
Reduced-Precision Computation for Neural Network Training, Rambus
Lowering Precision Does Not Mean Lower Accuracy, BigDataWire