The latest OCaml Journal article High-performance parallel programming with HLVM (23rd January 2010) described a simple but effective parallelization of the HLVM implementation of the ray tracer. Comparing with similarly naive parallelizations of the fastest serial implementations in C++ and Haskell, we obtained the following results:
These results exhibit several interesting characteristics:
- Even the naive parallelization in C++ is significantly faster than HLVM and Haskell.
- C++ and HLVM scale well, with performance improving as more cores are used.
- Despite having serial performance competitive with HLVM, the naively-parallelized Haskell scales poorly. In particular, Haskell failed to obtain a competitive speedup with up to 5 cores and its performance even degraded significantly beyond 5 cores, running 4.4× slower than C++ on 7 cores.
The efficiency of the parallelization can be quantified as the speed of the parallel version on multiple cores relative to its speed on a single core:
These results show that HLVM obtained the largest speedup: 6.3× faster on 8 cores. C++ obtained the next largest speedup: 5.2x faster on 8 cores. Haskell obtained the worst speedup: only 2.9× faster on 8 cores.
More sophisticated parallelizations could obtain better performance still. Currently, the parallel for loop over the rows of the image causes each thread to compete for an iteration and results in poor locality: if a thread computes one row then it is unlikely to compute the next. Work-stealing queues and recursive subdivision of the problem space would greatly improve locality and, therefore, performance. Also, the initial construction of the scene tree has some opportunity for parallelization and further performance improvements.
Furthermore, there is plenty of room to optimize HLVM itself and some simple hacks can quantify the possibilities. Disabling bounds checking provides a substantial 20% performance improvement. Disabling the shadow stack (and, therefore, disabling GC) provides another 25% performance improvement. With these two changes, HLVM is only 24% slower than C++. In particular, HLVM's current design makes heavy reuse of its own generic routines even in performance critical sections such as manipulating the shadow stack. Optimizing these routines by hand could leverage some of this potential. For example, by removing bounds checks from manipulations of the shadow stack.