Academic Journal

A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs

Bibliographic Details
Title: A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs
Authors: Ledoux Pardo, Luis Eduardo, Casas, Marc
Source: UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
Publisher Information: IEEE, 2022.
Publication Year: 2022
Subject Terms: Matrius de portes programables per l'usuari, Space exploration, Systolic arrays, Energia -- Consum, Field programmable gate arrays, System performance, Throughput, 7. Clean energy, Generators, Energy consumption, Hardware, Energy efficiency, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Description: We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations.We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86×.
This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 955606 Marc Casas is supported by Grant RYC-2017-23269 funded by MCIN/AEI/ 10.13039/501100011033 and by “ESF Investing in your future”
Document Type: Article
Conference object
File Description: application/pdf
DOI: 10.1109/fccm53951.2022.9786164
DOI: 10.13039/100010661
Rights: STM Policy #29
Accession Number: edsair.doi.dedup.....fdb749e858b418994b2749f5e9c280c0
Database: OpenAIRE
Description
DOI:10.1109/fccm53951.2022.9786164