jaime-m-p
37bef89433
tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
2024-06-18 18:40:52 +02:00
..
2023-11-27 21:25:42 +02:00
2023-11-02 08:50:16 +02:00
2024-05-30 21:40:00 +10:00
2024-01-26 14:18:00 +02:00
2024-06-04 14:32:42 +02:00
2024-06-04 21:23:20 +03:00
2024-05-30 21:40:00 +10:00
2024-05-17 22:40:14 +10:00
2024-04-09 09:23:19 +03:00
2023-11-27 21:25:42 +02:00
2024-06-18 18:40:52 +02:00
2024-02-18 16:21:52 -05:00
2024-06-13 00:41:52 +01:00
2024-01-09 19:21:13 +02:00
2024-06-13 00:41:52 +01:00
2024-06-13 00:41:52 +01:00
2024-06-13 00:41:52 +01:00
2024-06-13 00:41:52 +01:00
2024-01-31 08:08:07 +05:30
2024-06-04 21:23:20 +03:00
2024-06-13 00:41:52 +01:00
2024-06-13 00:41:52 +01:00
2023-08-29 10:50:30 +03:00
2024-06-13 00:41:52 +01:00
2024-06-13 00:41:52 +01:00
2024-06-13 00:41:52 +01:00
2024-06-04 21:23:20 +03:00
2024-06-18 09:50:45 +03:00
2024-06-04 21:23:20 +03:00
2024-05-03 22:36:41 +03:00
2024-04-21 18:48:53 +01:00