Abstract
We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.
| Original language | English |
|---|---|
| Pages | 86-95 |
| Number of pages | 10 |
| Publication status | Published - 20-Jun-2005 |
| Externally published | Yes |
| Event | ACM/SIGDA Thirteenth ACM International Symposium on Field Programmable Gate Arrays - FPGA 2005 - Monterey, CA, United States Duration: 20-Feb-2005 → 22-Feb-2005 |
Conference
| Conference | ACM/SIGDA Thirteenth ACM International Symposium on Field Programmable Gate Arrays - FPGA 2005 |
|---|---|
| Country/Territory | United States |
| City | Monterey, CA |
| Period | 20/02/2005 → 22/02/2005 |
Keywords
- Floating-point
- FPGA
- Matrix multiplication