Skip to main content

Module binpack

Module binpack 

Source
Expand description

Bin-packing algorithm that groups tokenized sequences into session.run() calls that each fit within the per-worker workspace budget.

The central type is CostModel, which captures the quadratic memory scaling of BGE-M3 attention and is used by bin_pack to partition an incoming batch into chunks that are safe to run in a single ORT call.

Structs§

CostModel
Quadratic-aware workspace cost model for ONNX attention inference.

Functions§

bin_pack 🔒
Length-sorted greedy bin-packer.