Expand description
Tokenizer load + no-pad tokenization + chunk-array build helpers.
Functions§
- build_
chunk_ 🔒arrays - Builds
input_idsandattention_maskarrays for a single chunk. - load_
tokenizer 🔒 - Loads and configures the BGE-M3 tokenizer with truncation at
max_seq_lengthbut no padding. Padding is applied per-chunk inbuild_chunk_arrays. - tokenize_
no_ 🔒pad - Tokenizes
textswithout applying any padding. Returns oneEncodingper text, each truncated to the tokenizer’s configuredmax_length.