Home > correlate_opencl

correlate_opencl

Correlate_opencl is a project mainly written in C++ and C, based on the View license.

Implementation of an image correlation algorithm in OpenCL for benchmark purposes

An implementation of the algorithm used in the paper "Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications!" (http://bit.ly/a1xvZB) for comparing CPU and GPU performance on a cache-friendly image correlation algorithm.

And it was a good excuse to learn OpenCL.

Hardware tested: ATI Radeon HD 4850, 67.9 GBps memory bandwidth, 384 GBps L2, 480 GBps L1, 1000 mul-add SP GFLOPS, hobbled by lack of local mem and texture support in ATI's OpenCL drivers, resulting in most reads coming from main memory Core 2 Duo E6400, 6.4 GBps memory bandwidth, 18 GBps L2, 90.5 GBps L1, 37.5 mul-add SP GFLOPS, hobbled by lack of performance Athlon II X4 640, 15 GBps memory bandwidth, 50 GBps L2, 250 GBps L1, 96 mul-add SP GFLOPS, hobbled by small cache (not a problem here though) Interesting CPU result differences between the 32-bit and 64-bit versions.

Current results (8 FLOPS per 32 bytes memory read): The results from the paper are on 500x500 input images. The paper reported runtimes in seconds, I calculated the bandwidth numbers by dividing 281 GB by the reported runtimes.

The OpenCL results are measured without the context and kernel creation and release times, as those are likely only relevant for a single-run workload (and a 0.7 sec one-time overhead to do a 1 sec operation would be a bit skewy). The OpenCL results do include the data write and read times.

850 GBps (212 GFLOPS) 2x2 blocked GLSL on Radeon HD 6950 (640x640 input) 772 GBps (193 GFLOPS) 2x2 blocked GLSL on Radeon HD 6950 (500x500 input) 671 GBps (168 GFLOPS) OpenCL image kernel on Radeon HD 6950 (512x512 input) 618 GBps (154 GFLOPS) OpenCL image kernel on Radeon HD 6950 (500x500 input) 339 GBps (85 GFLOPS) 2x2 blocked GLSL on Radeon HD 4850 (500x500 input) 278 GBps (70 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (640x640 input) [276 GBps (69 GFLOPS) the paper's optimized impl on 8-core 3 GHz Power7] 259 GBps (65 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (512x512 input) 243 GBps (61 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (500x500 input) 240 GBps (60 GFLOPS) optimized OpenCL CPU on Athlon II X4 640 (32-bit) [230 GBps (58 GFLOPS) the paper's GTX 285 implementation using local mem] 204 GBps (51 GFLOPS) naive GLSL implementation on Radeon HD 4850 201 GBps (50 GFLOPS) manually optimized OpenMP SSE on Athlon II X4 640 (64-bit) [159 GBps (40 GFLOPS) the paper's GTX 285 implementation] [150 GBps (38 GFLOPS) the paper's optimized impl on 2x4-core 2.93 GHz Nehalem] 125 GBps (31 GFLOPS) optimized OpenMP SSE on Athlon II X4 (64-bit) 103 GBps (26 GFLOPS) manually optimized OpenMP SSE on Athlon II X4 (32-bit) 86 GBps (22 GFLOPS) naive OpenCL GPU implementation (1D workgroup) 84 GBps (21 GFLOPS) optimized OpenCL CPU on Core 2 Duo E6400 45 GBps (11 GFLOPS) manually optimized OpenMP SSE implementation on C2D 25 GBps (6 GFLOPS) naive OpenCL GPU implementation (2D workgroup) 13 GBps (3 GFLOPS) naive OpenMP SSE implementation on Athlon II X4 5 GBps (1 GFLOPS) naive OpenMP SSE implementation on C2D 5 GBps (1 GFLOPS) naive scalar C implementation on C2D 1.1 GBps (0.3 GFLOPS) naive OCaml implementation on C2D 0.03 GBps (0.008 GFLOPS) naive pure-Python implementation on C2D

Compiling and running: export ATISTREAMSDKROOT=path_to_ati_stream_sdk export LD_LIBRARY_PATH=$ATISTREAMSDKROOT/lib/x86:$LD_LIBRARY_PATH

./build.sh ./corr ./corr_500 ./corr_gl ./corr_naive ./memsum

ocamlopt -o ocorr corr.ml time ./ocorr # divide time by 1.21 to get GBps

time python corr.py # divide time by 1.21 to get GBps

Previous:citrus-plugin