llvm-project

mirror of https://github.com/llvm/llvm-project.git synced 2025-04-27 13:06:06 +00:00

Author	SHA1	Message	Date
Mark de Wever	5b98be4e0b	[lib++][Format] Updates Unicode database. (#125712 ) Updates the databease to the Unicode release 16.0.0. The algorithms of the Grapheme clustering rules have not changed.	2025-02-05 18:55:02 +01:00
Mark de Wever	59e66c515a	[libc++][format] Switches to Unicode 15.1. (#86543 ) In addition to changes in the tables the extended grapheme clustering algorithm has been overhauled. Before I considered a separate state machine to implement the rules. With the new rule GB9c this became more attractive and the design has changed. This change initially had quite an impact on the performance. By making the state machine persistent the performance was improved greatly. Note it is still slower than before due to the larger Unicode tables. Before -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_ascii_text<char> 1891 ns 1889 ns 369504 BM_unicode_text<char> 106642 ns 106397 ns 6576 BM_cyrillic_text<char> 73420 ns 73277 ns 9445 BM_japanese_text<char> 62485 ns 62387 ns 11153 BM_emoji_text<char> 1895 ns 1893 ns 369525 BM_ascii_text<wchar_t> 2015 ns 2013 ns 346887 BM_unicode_text<wchar_t> 92119 ns 92017 ns 7598 BM_cyrillic_text<wchar_t> 62637 ns 62568 ns 11117 BM_japanese_text<wchar_t> 53850 ns 53785 ns 12803 BM_emoji_text<wchar_t> 2016 ns 2014 ns 347325 After -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_ascii_text<char> 1906 ns 1904 ns 369409 BM_unicode_text<char> 265462 ns 265175 ns 2628 BM_cyrillic_text<char> 181063 ns 180865 ns 3871 BM_japanese_text<char> 130927 ns 130789 ns 5324 BM_emoji_text<char> 1892 ns 1890 ns 370537 BM_ascii_text<wchar_t> 2038 ns 2035 ns 343689 BM_unicode_text<wchar_t> 277603 ns 277282 ns 2526 BM_cyrillic_text<wchar_t> 188558 ns 188339 ns 3727 BM_japanese_text<wchar_t> 133084 ns 132943 ns 5262 BM_emoji_text<wchar_t> 2012 ns 2010 ns 348015 Persistent -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_ascii_text<char> 1904 ns 1899 ns 367472 BM_unicode_text<char> 133609 ns 133287 ns 5246 BM_cyrillic_text<char> 90185 ns 89941 ns 7796 BM_japanese_text<char> 75137 ns 74946 ns 9316 BM_emoji_text<char> 1906 ns 1901 ns 368081 BM_ascii_text<wchar_t> 2703 ns 2696 ns 259153 BM_unicode_text<wchar_t> 131497 ns 131168 ns 5341 BM_cyrillic_text<wchar_t> 87071 ns 86840 ns 8076 BM_japanese_text<wchar_t> 72279 ns 72099 ns 9682 BM_emoji_text<wchar_t> 2021 ns 2016 ns 346767	2024-04-09 19:20:06 +02:00
Louis Dionne	0133e25d99	[runtimes][NFC] Remove trailing whitespace	2023-11-17 16:50:49 -05:00
Mark de Wever	68c3d66a97	[libc++][format] Improves width estimate. As obvious from the paper's title this is an LWG issue and thus retroactively applied to C++20. This change may the output for certain code points: 1 Considers 8477 extra codepoints as having a width 2 (as of Unicode 15) (mostly Tangut Ideographs) 2 Change the width of 85 unassigned code points from 2 to 1 3 Change the width of 8 codepoints (in the range U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE ... U+324F CIRCLED NUMBER EIGHTY ON BLACK SQUARE) from 2 to 1, because it seems questionable to make an exception for those without input from Unicode Note that libc++ already uses Unicode 15, while the Standard requires Unicode 12. (The last time I checked MSVC STL used Unicode 14.) So in practice the only notable change is item 3. Implements P2675 LWG3780: The Paper format's width estimation is too approximate and not forward compatible Benchmark before these changes -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_ascii_text<char> 3928 ns 3928 ns 178131 BM_unicode_text<char> 75231 ns 75230 ns 9158 BM_cyrillic_text<char> 59837 ns 59834 ns 11529 BM_japanese_text<char> 39842 ns 39832 ns 17501 BM_emoji_text<char> 3931 ns 3930 ns 177750 BM_ascii_text<wchar_t> 4024 ns 4024 ns 174190 BM_unicode_text<wchar_t> 63756 ns 63751 ns 11136 BM_cyrillic_text<wchar_t> 44639 ns 44638 ns 15597 BM_japanese_text<wchar_t> 34425 ns 34424 ns 20283 BM_emoji_text<wchar_t> 3937 ns 3937 ns 177684 Benchmark after these changes -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_ascii_text<char> 3914 ns 3913 ns 178814 BM_unicode_text<char> 70380 ns 70378 ns 9694 BM_cyrillic_text<char> 51889 ns 51877 ns 13488 BM_japanese_text<char> 41707 ns 41705 ns 16723 BM_emoji_text<char> 3908 ns 3907 ns 177912 BM_ascii_text<wchar_t> 3949 ns 3948 ns 177525 BM_unicode_text<wchar_t> 64591 ns 64587 ns 10649 BM_cyrillic_text<wchar_t> 44089 ns 44078 ns 15721 BM_japanese_text<wchar_t> 39369 ns 39367 ns 17779 BM_emoji_text<wchar_t> 3936 ns 3934 ns 177821 Benchmarks without "if(__code_point < (__entries[0] >> 14))" -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_ascii_text<char> 3922 ns 3922 ns 178587 BM_unicode_text<char> 94474 ns 94474 ns 7351 BM_cyrillic_text<char> 69202 ns 69200 ns 10157 BM_japanese_text<char> 42735 ns 42692 ns 16382 BM_emoji_text<char> 3920 ns 3919 ns 178704 BM_ascii_text<wchar_t> 3951 ns 3950 ns 177224 BM_unicode_text<wchar_t> 81003 ns 80988 ns 8668 BM_cyrillic_text<wchar_t> 57020 ns 57018 ns 12048 BM_japanese_text<wchar_t> 39695 ns 39687 ns 17582 BM_emoji_text<wchar_t> 3977 ns 3976 ns 176479 This optimization does carry its weight for the Unicode and Cyrillic test. For the Japanese tests the gains are minor and for emoji it seems to have no effect. Reviewed By: ldionne, tahonermann, #libc Differential Revision: https://reviews.llvm.org/D144499	2023-04-20 21:18:33 +02:00
Mark de Wever	a48007355a	[libc++][format] Implements string escaping. Implements parts of - P2286R8 Formatting Ranges Reviewed By: #libc, tahonermann Differential Revision: https://reviews.llvm.org/D134036	2022-10-20 17:29:34 +02:00
Mark de Wever	deaad6a1b6	[libc++][format] Updates to Unicode 15. This adds support for the new code points in the Extended Grapheme Cluster algorithm. The algorithm itself has remained unchanged. The width estimation still follows the rules of the Standard. @cor3ntin filed LWG3780 format's width estimation is too approximate and not forward compatible to improve the estimate. Reviewed By: ldionne, #libc Differential Revision: https://reviews.llvm.org/D134106	2022-10-04 18:58:09 +02:00
Mark de Wever	130b1816c5	[libc++] Improve updating data files. This changes makes it easier to update the Unicode data files used for the Extended Graphme Clustering as added in D126971. Reviewed By: ldionne, #libc Differential Revision: https://reviews.llvm.org/D129668	2022-08-16 18:55:46 +02:00

7 Commits