Clang spends a decent amount of time in the LineOffsetMapping::get(...)
function. This function used to be vectorized (through SSE2) then the
optimization got dropped because the sequential version was on-par performance
wise.
This provides an optimization of the sequential version that works on a word at
a time, using (documented) bithacks to provide a portable vectorization.
When preprocessing the sqlite amalgamation, this yields a sweet 3% speedup.
This is a recommit of 6951b72334bbe4c189c71751edc1e361d7b5632c with endianness
and unsigned long vs uint64_t issues fixed (hopefully).
Differential Revision: https://reviews.llvm.org/D99409