TY - GEN
T1 - Elastic pipeline
T2 - 8th ACM International Conference on Computing Frontiers, CF'11
AU - Gou, Chunyang
AU - Gaydadjiev, Georgi N.
PY - 2011/9/13
Y1 - 2011/9/13
N2 - One of the major problems with the GPU on-chip shared memory is bank conflicts. We observed that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel elastic pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed elastic pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0% (with 42.3% on average) and improves the overall performance by up to 20.7% (on average 13.3%) for our benchmark applications, at trivial hardware overhead.
AB - One of the major problems with the GPU on-chip shared memory is bank conflicts. We observed that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel elastic pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed elastic pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0% (with 42.3% on average) and improves the overall performance by up to 20.7% (on average 13.3%) for our benchmark applications, at trivial hardware overhead.
KW - B.3.2[Design Styles]:Interleaved memories
KW - C.1.2[Multiple Data Stream Architectures (Multiprocessors)]:SIMD
KW - Design
KW - Performance
UR - http://www.scopus.com/inward/record.url?scp=80052516856&partnerID=8YFLogxK
U2 - 10.1145/2016604.2016608
DO - 10.1145/2016604.2016608
M3 - Conference contribution
AN - SCOPUS:80052516856
SN - 9781450306980
T3 - Proceedings of the 8th ACM International Conference on Computing Frontiers, CF'11
BT - Proceedings of the 8th ACM International Conference on Computing Frontiers, CF'11
PB - IEEE
Y2 - 3 May 2011 through 5 May 2011
ER -