Academic Journal
A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs
| Title: | A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs |
|---|---|
| Authors: | Ledoux Pardo, Luis Eduardo, Casas, Marc |
| Source: | UPCommons. Portal del coneixement obert de la UPC Universitat Politècnica de Catalunya (UPC) |
| Publisher Information: | IEEE, 2022. |
| Publication Year: | 2022 |
| Subject Terms: | Matrius de portes programables per l'usuari, Space exploration, Systolic arrays, Energia -- Consum, Field programmable gate arrays, System performance, Throughput, 7. Clean energy, Generators, Energy consumption, Hardware, Energy efficiency, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] |
| Description: | We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations.We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86×. This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 955606 Marc Casas is supported by Grant RYC-2017-23269 funded by MCIN/AEI/ 10.13039/501100011033 and by “ESF Investing in your future” |
| Document Type: | Article Conference object |
| File Description: | application/pdf |
| DOI: | 10.1109/fccm53951.2022.9786164 |
| DOI: | 10.13039/100010661 |
| Rights: | STM Policy #29 |
| Accession Number: | edsair.doi.dedup.....fdb749e858b418994b2749f5e9c280c0 |
| Database: | OpenAIRE |
| DOI: | 10.1109/fccm53951.2022.9786164 |
|---|