Skip to main content

Module tokenize

Module tokenize 

Source
Expand description

Tokenizer load + no-pad tokenization + chunk-array build helpers.

Functions§

build_chunk_arrays 🔒
Builds input_ids and attention_mask arrays for a single chunk.
load_tokenizer 🔒
Loads and configures the BGE-M3 tokenizer with truncation at max_seq_length but no padding. Padding is applied per-chunk in build_chunk_arrays.
tokenize_no_pad 🔒
Tokenizes texts without applying any padding. Returns one Encoding per text, each truncated to the tokenizer’s configured max_length.