This PR optimizes the performance of `std::ranges::rotate` for
`vector<bool>::iterator`. The optimization yields a performance
improvement of up to 2096x.
Closes#64038.
This PR optimizes the performance of `std::ranges::swap_ranges` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations yield performance improvements of up to **611x** for
aligned range swap and **78x** for unaligned range swap comparison.
Additionally, comprehensive tests covering up to 4 storage words (256
bytes) with odd and even bit sizes are provided, which validate the
proposed optimizations in this patch.
This PR optimizes the performance of `std::ranges::equal` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations yield performance improvements of up to 188x for
aligned equality comparison and 82x for unaligned equality
comparison. Moreover, comprehensive tests covering up to 4 storage words
(256 bytes) with odd and even bit sizes are provided, which validate the
proposed optimizations in this patch.
As a follow-up to #121013 (which optimized `ranges::copy`) and #121026
(which optimized `ranges::copy_backward`), this PR enhances the
performance of `std::ranges::{move, move_backward}` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations bring performance improvements analogous to those
achieved for the `{copy, copy_backward}` algorithms: up to 2000x for
aligned moves and 60x for unaligned moves. Moreover, comprehensive
tests covering up to 4 storage words (256 bytes) with odd and even bit
sizes are provided, which validate the proposed optimizations in this
patch.
We have some benchmarks that were benchmarking very specific
functionality, namely the optimizations in vector<bool>::iterator. Call
this out in the benchmarks by renaming them appropriately. In the future
we will also increase the coverage of these benchmarks to test other
containers.
This patch does not significantly change how the sequence container
benchmarks are done, but it adopts the same style as the associative
container benchmarks.
This commit does adjust how we were benchmarking push_back, where we
never really measured the overhead of the slow path of push_back (when
we need to reallocate).
This patch implements generic associative container benchmarks for
containers with unique keys. In doing so, it replaces the existing
std::map benchmarks which were based on the cartesian product
infrastructure and were too slow to execute.
These new benchmarks aim to strike a balance between exhaustive coverage
of all operations in the most interesting case, while executing fairly
rapidly (~40s on my machine).
This bumps the requirement for the map benchmarks from C++17 to C++20
because the common header that provides associative container benchmarks
requires support for C++20 concepts.
Rewrite the sequence container benchmarks to only rely on the actual
operations specified in SequenceContainer requirements and add
benchmarks for std::list, which is also considered a sequence container.
One of the major goals of this refactoring is also to make these
container benchmarks run faster so that they can be run more frequently.
The existing benchmarks have the significant problem that they take so
long to run that they must basically be run overnight. This patch
reduces the size of inputs such that the rewritten benchmarks each take
at most a minute to run.
This patch doesn't touch the string benchmarks, which were not using the
generic container benchmark functions previously.
As a follow-up to #121013 (which focused on `std::ranges::copy`), this
PR optimizes the performance of `std::ranges::copy_backward` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations yield performance improvements of up to 2000x for
aligned copies and 60x for unaligned copies.
While implementing this feature and its associated LWG issues it turns
out
- LWG3316 Correctly define epoch for utc_clock / utc_timepoint only
added non-normative wording to the standard.
Implements parts of:
- P0355 Extending <chrono> to Calendars and Time Zones
- P1361 Integration of chrono with text formatting
- LWG3359 <chrono> leap second support should allow for negative leap
seconds
As a follow-up to #113852, this PR optimizes the performance of the
`insert(const_iterator pos, InputIt first, InputIt last)` function for
`input_iterator`-pair inputs in `std::vector` for cases where
reallocation occurs during insertion. Additionally, this optimization
enhances exception safety by replacing the traditional `try-catch`
mechanism with a modern exception guard for the `insert` function.
The optimization targets cases where insertion trigger reallocation. In
scenarios without reallocation, the implementation remains unchanged.
Previous implementation
-----------------------
The previous implementation of `insert` is inefficient in reallocation
scenarios because it performs the following steps separately:
- `reserve()`: This leads to the first round of relocating old
elements to new memory;
- `rotate()`: This leads to the second round of reorganizing the
existing elements;
- Move-forward: Moves the elements after the insertion position to
their final positions.
- Insert: performs the actual insertion.
This approach results in a lot of redundant operations, requiring the
elements to undergo three rounds of relocations/reorganizations to be
placed in their final positions.
Proposed implementation
-----------------------
The proposed implementation jointly optimize the above 4 steps in the
previous implementation such that each element is placed in its final
position in just one round of relocation. Specifically, this
optimization reduces the total cost from 2 relocations + 1 std::rotate
call to just 1 relocation, without needing to call `std::rotate`,
thereby significantly improving overall performance.
Since that benchmark is testing n*n inputs, the batch size reported to
GoogleBenchmark should be that amount. Otherwise, GoogleBenchmark
reports the timing for calling std::gcd on the whole sequence, which is
misleading.
The benchmark currently uses makeCartesianProductBenchmark, which
doesn't make a ton of sense, since minmax_element always goes through
every element one by one. The runtime doesn't depend on the values
of the elements.
Fixes#120758
That benchmark isn't really useful, since it doesn't benchmark anything
from libc++ (besides `operator new`). The implementation of the
benchmark also has serious problems like the fact that it allocates an
unknown amount of memory without deallocating it.
This patch refactors the tests around aligned allocation and sized
deallocation to avoid relying on passing the -fsized-deallocation or
-faligned-allocation flags by default. Since both of these features are
enabled by default in >= C++14 mode, it now makes sense to make that
assumption in the test suite.
A notable exception is MinGW and some older compilers, where sized
deallocation is still not enabled by default. We treat that as a "bug"
in the test suite and we work around it by explicitly adding
-fsized-deallocation, but only under those configurations.
This PR optimizes the input iterator overload of `assign(_InputIterator,
_InputIterator)` in `std::vector<_Tp, _Allocator>` by directly assigning
to already initialized memory, rather than first destroying existing
elements and then constructing new ones. By eliminating unnecessary
destruction and construction, the proposed algorithm enhances the
performance by up to 2x for trivial element types (e.g.,
`std::vector<int>`), up to 2.6x for non-trivial element types like
`std::vector<std::string>`, and up to 3.4x for more complex non-trivial
types (e.g., `std::vector<std::vector<int>>`).
### Google Benchmarks
Benchmark tests (`libcxx/test/benchmarks/vector_operations.bench.cpp`)
were conducted for the `assign()` implementations before and after this
patch. The tests focused on trivial element types like
`std::vector<int>`, and non-trivial element types such as
`std::vector<std::string>` and `std::vector<std::vector<int>>`.
#### Before
```
-------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------
BM_AssignInputIterIter/vector_int/1024/1024 1157 ns 1169 ns 608188
BM_AssignInputIterIter<32>/vector_string/1024/1024 14559 ns 14710 ns 47277
BM_AssignInputIterIter<32>/vector_vector_int/1024/1024 26846 ns 27129 ns 25925
```
#### After
```
-------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------
BM_AssignInputIterIter/vector_int/1024/1024 561 ns 566 ns 1242251
BM_AssignInputIterIter<32>/vector_string/1024/1024 5604 ns 5664 ns 128365
BM_AssignInputIterIter<32>/vector_vector_int/1024/1024 7927 ns 8012 ns 88579
```
This addresses the issue uncovered by #115361. Previously, we weren't
building benchmarks in many cases due to the following block:
e58949632e/libcxx/CMakeLists.txt (L162-L172)
We need to passthrough the necessary variables into the benchmarks
subbuild and use correct syntax.
In particular, test everything with both a normal and a min_allocator,
add tests for a few corner cases and add tests with types that are
trivially relocatable. Also add tests that count the number of
assignments performed by vector::erase, since that is mandated by the
Standard.
This patch is a preparation for optimizing vector::erase.
The input generating functions for benchmark tests in the GenerateInput.h
file can be slightly improved by invoking vector::reserve before calling
vector::push_back. This slight performance improvement could potentially
speed-up all benchmark tests for containers and algorithms that use these
functions as inputs.
This PR refines the code in `GenerateInput.h` used for benchmark testing
by implementing the following changes:
- Replaced all unqualified usages of `size_t` with `std::size_t`.
- Removed unnecessary curly braces `{}` from for loops that contain
simple single-statement bodies, in accordance with LLVM coding
standards.
Instead of building the benchmarks separately via CMake and running them
separately from the test suite, this patch merges the benchmarks into
the test suite and handles both uniformly.
As a result:
- It is now possible to run individual benchmarks like we run tests
(e.g. using libcxx-lit), which is a huge quality-of-life improvement.
- The benchmarks will be run under exactly the same configuration as
the rest of the tests, which is a nice simplification. This does
mean that one has to be careful to enable the desired optimization
flags when running benchmarks, but that is easy with e.g.
`libcxx-lit <...> --param optimization=speed`.
- Benchmarks can use the same annotations as the rest of the test
suite, such as `// UNSUPPORTED` & friends.
When running the tests via `check-cxx`, we only compile the benchmarks
because running them would be too time consuming. This introduces a bit
of complexity in the testing setup, and instead it would be better to
allow passing a --dry-run flag to GoogleBenchmark executables, which is
the topic of https://github.com/google/benchmark/issues/1827.
I am not really satisfied with the layering violation of adding the
%{benchmark_flags} substitution to cmake-bridge, however I believe
this can be improved in the future.
This patch fixes warnings and errors that come up when running the
benchmarks as part of the test suite. It also adds the necessary Lit
annotations to make it pass in various configurations and increases the
portability of the benchmarks.
As a drive-by, remove unused functor inside the unordered_set benchmark.
That benchmark still isn't very exhaustive, but that can be addressed
separately.
This patch is the start of a series to improve the speed of std::format,
std::format_to, std::format_to_n, and std::formatted_size.
This is mostly achieved by changing the __output_buffer class. This new
__output_buffer class also makes it easier to implement buffering for
P3107R5 "Permit an efficient implementation of std::print"
This is an intermediate and fairly mechanical step towards unifying the
benchmarks with the rest of the test suite. Moving this around requires
a few changes, notably making sure we don't throw a wrench into the
discovery process of the normal test suite. This won't be a problem
anymore once benchmarks are taken into account by the test setup out of
the box.