Struct Config

Source

pub struct Config {
    pub cache_dir: String,
    pub bind_addr: String,
    pub workers: usize,
    pub intra_threads: usize,
    pub max_batch: usize,
    pub max_seq_length: usize,
    pub idle_timeout: Option<Duration>,
    pub model_variant: ModelVariant,
    pub memory_safety_factor: f64,
    pub cost_model_override: Option<CostModel>,
    pub heartbeat_secs: u64,
}

Expand description

Runtime configuration loaded from environment variables.

All fields are read once at startup via Config::from_env. Changes to environment variables after startup have no effect.

Fields§

§cache_dir: String

Path to the directory where ONNX model files are cached.

Set with BGE_M3_CACHE_DIR. Defaults to /cache.

§bind_addr: String

TCP bind address for the HTTP server.

Set with BGE_M3_BIND. Defaults to 0.0.0.0:8081. The 0.0.0.0 default is intentional for Docker container deployments.

§workers: usize

Number of embedding worker threads to spawn.

Set with BGE_M3_WORKERS. Defaults to 2. Minimum effective value is 1. Each worker loads its own model instance.

§intra_threads: usize

Number of intra-op threads each ORT session may use for a single session.run() call (matmul / attention kernels).

Set with BGE_M3_INTRA_THREADS. Defaults to 1. Minimum effective value is 1.

The default of 1 preserves predictable per-worker RSS (the workspace probe and quadratic cost model are calibrated against single-threaded MLAS runs). Raise this on under-utilized hosts where BGE_M3_WORKERS * intra_threads <= num_cpus: e.g. on an 8 vCPU task with workers=2, setting intra_threads=4 lets each worker fan out to four cores during inference, taking CPU utilization from ~25% to ~100% under load. Going above floor(num_cpus / workers) causes thread oversubscription and hurts throughput.

Re-run the startup probe (do not pin coefficients) after changing this value so the cost model captures any new scratch-buffer overhead.

§max_batch: usize

Maximum number of input texts accepted in a single request.

Set with BGE_M3_MAX_BATCH. Defaults to 256. Minimum effective value is 1.

§max_seq_length: usize

Maximum sequence length (tokens) for a single text.

Set with BGE_M3_MAX_SEQ_LENGTH. Defaults to 8192 (BGE-M3’s published max). Range: [1, 8192]. Set lower to reduce memory footprint on constrained hardware.

The tokenizer will silently truncate any input exceeding this length. The probe and bin-packer use this as the upper bound when computing workspace costs.

§idle_timeout: Option<Duration>

Duration of inactivity after which workers unload their model instances from memory.

Set with BGE_M3_IDLE_TIMEOUT_SECS. Defaults to 300 (5 minutes). Set to 0 to disable idle unloading entirely.

When unloaded, models are automatically reloaded on the next incoming request. The reload blocks the request until complete (~5–10 s from CoreML compiled cache; ~15–30 s cold).

§model_variant: ModelVariant

ONNX model variant to load.

Set with BGE_M3_MODEL. Accepts "fp32", "fp16", or "int8". Defaults to "fp16" for fleet-wide embedding consistency and reduced RAM on Linux/Intel deployments. Set BGE_M3_MODEL=fp32 on Apple Silicon to recover CoreML GPU acceleration. See ModelVariant for per-variant performance and memory trade-offs.

§memory_safety_factor: f64

Fraction of estimated available workspace to actually use per worker.

Set with BGE_M3_MEMORY_SAFETY_FACTOR. Defaults to 0.7 (30% headroom for ORT arena fragmentation and spike overhead not captured by the probe). Range: 0.1..=1.0.

§cost_model_override: Option<CostModel>

If Some, skip the startup probe and use this cost model directly.

Populated when:

BGE_M3_DISABLE_AUTO_BUDGET=1 is set (uses conservative defaults), or
BGE_M3_TOKEN_BUDGET is set (translates the legacy token count to a max_workspace_bytes using conservative a/b coefficients), or
BGE_M3_COST_MODEL_A and BGE_M3_COST_MODEL_B are both set with BGE_M3_AVAILABLE_MEMORY_BYTES (full explicit override).

§heartbeat_secs: u64

Interval (seconds) between periodic heartbeat log events.

Set with BGE_M3_HEARTBEAT_SECS. Defaults to 60. Set to 0 to disable heartbeat logging entirely.

Heartbeat events log RSS, live/loaded worker counts, queue depth, available request permits, and current probe status — useful for detecting slow memory leaks or queue saturation between requests.

Config

Struct Config Copy item path

Fields§

Implementations§

impl Config

pub fn from_env() -> Self

pub(crate) fn from_lookup<F: Fn(&str) -> Option<String>>(lookup: F) -> Self

Auto Trait Implementations§

impl Freeze for Config

impl RefUnwindSafe for Config

impl Send for Config

impl Sync for Config

impl Unpin for Config

impl UnsafeUnpin for Config

impl UnwindSafe for Config

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

Struct Config

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,