Expand description
Bin-packing algorithm that groups tokenized sequences into session.run()
calls that each fit within the per-worker workspace budget.
The central type is CostModel, which captures the quadratic memory
scaling of BGE-M3 attention and is used by bin_pack to partition an
incoming batch into chunks that are safe to run in a single ORT call.
Structs§
- Cost
Model - Quadratic-aware workspace cost model for ONNX attention inference.
Functions§
- bin_
pack 🔒 - Length-sorted greedy bin-packer.