7 Commits

Author SHA1 Message Date
Mark de Wever
5b98be4e0b
[lib++][Format] Updates Unicode database. (#125712)
Updates the databease to the Unicode release 16.0.0. The algorithms of
the Grapheme clustering rules have not changed.
2025-02-05 18:55:02 +01:00
Mark de Wever
59e66c515a
[libc++][format] Switches to Unicode 15.1. (#86543)
In addition to changes in the tables the extended grapheme clustering
algorithm has been overhauled. Before I considered a separate state
machine to implement the rules. With the new rule GB9c this became more
attractive and the design has changed.

This change initially had quite an impact on the performance. By making
the state machine persistent the performance was improved greatly. Note
it is still slower than before due to the larger Unicode tables.

Before
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             1891 ns         1889 ns       369504
BM_unicode_text<char>         106642 ns       106397 ns         6576
BM_cyrillic_text<char>         73420 ns        73277 ns         9445
BM_japanese_text<char>         62485 ns        62387 ns        11153
BM_emoji_text<char>             1895 ns         1893 ns       369525
BM_ascii_text<wchar_t>          2015 ns         2013 ns       346887
BM_unicode_text<wchar_t>       92119 ns        92017 ns         7598
BM_cyrillic_text<wchar_t>      62637 ns        62568 ns        11117
BM_japanese_text<wchar_t>      53850 ns        53785 ns        12803
BM_emoji_text<wchar_t>          2016 ns         2014 ns       347325

After
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             1906 ns         1904 ns       369409
BM_unicode_text<char>         265462 ns       265175 ns         2628
BM_cyrillic_text<char>        181063 ns       180865 ns         3871
BM_japanese_text<char>        130927 ns       130789 ns         5324
BM_emoji_text<char>             1892 ns         1890 ns       370537
BM_ascii_text<wchar_t>          2038 ns         2035 ns       343689
BM_unicode_text<wchar_t>      277603 ns       277282 ns         2526
BM_cyrillic_text<wchar_t>     188558 ns       188339 ns         3727
BM_japanese_text<wchar_t>     133084 ns       132943 ns         5262
BM_emoji_text<wchar_t>          2012 ns         2010 ns       348015

Persistent
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             1904 ns         1899 ns       367472
BM_unicode_text<char>         133609 ns       133287 ns         5246
BM_cyrillic_text<char>         90185 ns        89941 ns         7796
BM_japanese_text<char>         75137 ns        74946 ns         9316
BM_emoji_text<char>             1906 ns         1901 ns       368081
BM_ascii_text<wchar_t>          2703 ns         2696 ns       259153
BM_unicode_text<wchar_t>      131497 ns       131168 ns         5341
BM_cyrillic_text<wchar_t>      87071 ns        86840 ns         8076
BM_japanese_text<wchar_t>      72279 ns        72099 ns         9682
BM_emoji_text<wchar_t>          2021 ns         2016 ns       346767
2024-04-09 19:20:06 +02:00
Louis Dionne
0133e25d99 [runtimes][NFC] Remove trailing whitespace 2023-11-17 16:50:49 -05:00
Mark de Wever
68c3d66a97 [libc++][format] Improves width estimate.
As obvious from the paper's title this is an LWG issue and thus retroactively
applied to C++20. This change may the output for certain code points:
1 Considers 8477 extra codepoints as having a width 2 (as of Unicode 15)
  (mostly Tangut Ideographs)
2 Change the width of 85 unassigned code points from 2 to 1
3 Change the width of 8 codepoints (in the range U+3248 CIRCLED NUMBER
  TEN ON BLACK SQUARE ... U+324F CIRCLED NUMBER EIGHTY ON BLACK
  SQUARE) from 2 to 1, because it seems questionable to make an exception
  for those without input from Unicode

Note that libc++ already uses Unicode 15, while the Standard requires Unicode 12.
(The last time I checked MSVC STL used Unicode 14.)

So in practice the only notable change is item 3.

Implements
  P2675 LWG3780: The Paper
  format's width estimation is too approximate and not forward compatible

Benchmark before these changes
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             3928 ns         3928 ns       178131
BM_unicode_text<char>          75231 ns        75230 ns         9158
BM_cyrillic_text<char>         59837 ns        59834 ns        11529
BM_japanese_text<char>         39842 ns        39832 ns        17501
BM_emoji_text<char>             3931 ns         3930 ns       177750
BM_ascii_text<wchar_t>          4024 ns         4024 ns       174190
BM_unicode_text<wchar_t>       63756 ns        63751 ns        11136
BM_cyrillic_text<wchar_t>      44639 ns        44638 ns        15597
BM_japanese_text<wchar_t>      34425 ns        34424 ns        20283
BM_emoji_text<wchar_t>          3937 ns         3937 ns       177684

Benchmark after these changes
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             3914 ns         3913 ns       178814
BM_unicode_text<char>          70380 ns        70378 ns         9694
BM_cyrillic_text<char>         51889 ns        51877 ns        13488
BM_japanese_text<char>         41707 ns        41705 ns        16723
BM_emoji_text<char>             3908 ns         3907 ns       177912
BM_ascii_text<wchar_t>          3949 ns         3948 ns       177525
BM_unicode_text<wchar_t>       64591 ns        64587 ns        10649
BM_cyrillic_text<wchar_t>      44089 ns        44078 ns        15721
BM_japanese_text<wchar_t>      39369 ns        39367 ns        17779
BM_emoji_text<wchar_t>          3936 ns         3934 ns       177821

Benchmarks without "if(__code_point < (__entries[0] >> 14))"
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             3922 ns         3922 ns       178587
BM_unicode_text<char>          94474 ns        94474 ns         7351
BM_cyrillic_text<char>         69202 ns        69200 ns        10157
BM_japanese_text<char>         42735 ns        42692 ns        16382
BM_emoji_text<char>             3920 ns         3919 ns       178704
BM_ascii_text<wchar_t>          3951 ns         3950 ns       177224
BM_unicode_text<wchar_t>       81003 ns        80988 ns         8668
BM_cyrillic_text<wchar_t>      57020 ns        57018 ns        12048
BM_japanese_text<wchar_t>      39695 ns        39687 ns        17582
BM_emoji_text<wchar_t>          3977 ns         3976 ns       176479

This optimization does carry its weight for the Unicode and Cyrillic
test. For the Japanese tests the gains are minor and for emoji it seems
to have no effect.

Reviewed By: ldionne, tahonermann, #libc

Differential Revision: https://reviews.llvm.org/D144499
2023-04-20 21:18:33 +02:00
Mark de Wever
a48007355a [libc++][format] Implements string escaping.
Implements parts of
- P2286R8 Formatting Ranges

Reviewed By: #libc, tahonermann

Differential Revision: https://reviews.llvm.org/D134036
2022-10-20 17:29:34 +02:00
Mark de Wever
deaad6a1b6 [libc++][format] Updates to Unicode 15.
This adds support for the new code points in the Extended Grapheme
Cluster algorithm. The algorithm itself has remained unchanged.

The width estimation still follows the rules of the Standard.
@cor3ntin filed
  LWG3780 format's width estimation is too approximate and not forward compatible
to improve the estimate.

Reviewed By: ldionne, #libc

Differential Revision: https://reviews.llvm.org/D134106
2022-10-04 18:58:09 +02:00
Mark de Wever
130b1816c5 [libc++] Improve updating data files.
This changes makes it easier to update the Unicode data files used for
the Extended Graphme Clustering as added in D126971.

Reviewed By: ldionne, #libc

Differential Revision: https://reviews.llvm.org/D129668
2022-08-16 18:55:46 +02:00