pub struct Config {
pub cache_dir: String,
pub bind_addr: String,
pub workers: usize,
pub intra_threads: usize,
pub max_batch: usize,
pub max_seq_length: usize,
pub idle_timeout: Option<Duration>,
pub model_variant: ModelVariant,
pub memory_safety_factor: f64,
pub cost_model_override: Option<CostModel>,
pub heartbeat_secs: u64,
}Expand description
Runtime configuration loaded from environment variables.
All fields are read once at startup via Config::from_env. Changes to
environment variables after startup have no effect.
Fields§
§cache_dir: StringPath to the directory where ONNX model files are cached.
Set with BGE_M3_CACHE_DIR. Defaults to /cache.
bind_addr: StringTCP bind address for the HTTP server.
Set with BGE_M3_BIND. Defaults to 0.0.0.0:8081.
The 0.0.0.0 default is intentional for Docker container deployments.
workers: usizeNumber of embedding worker threads to spawn.
Set with BGE_M3_WORKERS. Defaults to 2. Minimum effective value is 1.
Each worker loads its own model instance.
intra_threads: usizeNumber of intra-op threads each ORT session may use for a single
session.run() call (matmul / attention kernels).
Set with BGE_M3_INTRA_THREADS. Defaults to 1. Minimum effective
value is 1.
The default of 1 preserves predictable per-worker RSS (the workspace
probe and quadratic cost model are calibrated against single-threaded
MLAS runs). Raise this on under-utilized hosts where BGE_M3_WORKERS * intra_threads <= num_cpus: e.g. on an 8 vCPU task with workers=2,
setting intra_threads=4 lets each worker fan out to four cores during
inference, taking CPU utilization from ~25% to ~100% under load. Going
above floor(num_cpus / workers) causes thread oversubscription and
hurts throughput.
Re-run the startup probe (do not pin coefficients) after changing this value so the cost model captures any new scratch-buffer overhead.
max_batch: usizeMaximum number of input texts accepted in a single request.
Set with BGE_M3_MAX_BATCH. Defaults to 256. Minimum effective value is 1.
max_seq_length: usizeMaximum sequence length (tokens) for a single text.
Set with BGE_M3_MAX_SEQ_LENGTH. Defaults to 8192 (BGE-M3’s published max).
Range: [1, 8192]. Set lower to reduce memory footprint on constrained hardware.
The tokenizer will silently truncate any input exceeding this length. The probe and bin-packer use this as the upper bound when computing workspace costs.
idle_timeout: Option<Duration>Duration of inactivity after which workers unload their model instances from memory.
Set with BGE_M3_IDLE_TIMEOUT_SECS. Defaults to 300 (5 minutes).
Set to 0 to disable idle unloading entirely.
When unloaded, models are automatically reloaded on the next incoming request.
The reload blocks the request until complete (~5–10 s from CoreML compiled
cache; ~15–30 s cold).
model_variant: ModelVariantONNX model variant to load.
Set with BGE_M3_MODEL. Accepts "fp32", "fp16", or "int8".
Defaults to "fp16" for fleet-wide embedding consistency and reduced RAM
on Linux/Intel deployments. Set BGE_M3_MODEL=fp32 on Apple Silicon to
recover CoreML GPU acceleration. See ModelVariant for per-variant
performance and memory trade-offs.
memory_safety_factor: f64Fraction of estimated available workspace to actually use per worker.
Set with BGE_M3_MEMORY_SAFETY_FACTOR. Defaults to 0.7 (30% headroom
for ORT arena fragmentation and spike overhead not captured by the probe).
Range: 0.1..=1.0.
cost_model_override: Option<CostModel>If Some, skip the startup probe and use this cost model directly.
Populated when:
BGE_M3_DISABLE_AUTO_BUDGET=1is set (uses conservative defaults), orBGE_M3_TOKEN_BUDGETis set (translates the legacy token count to amax_workspace_bytesusing conservativea/bcoefficients), orBGE_M3_COST_MODEL_AandBGE_M3_COST_MODEL_Bare both set withBGE_M3_AVAILABLE_MEMORY_BYTES(full explicit override).
heartbeat_secs: u64Interval (seconds) between periodic heartbeat log events.
Set with BGE_M3_HEARTBEAT_SECS. Defaults to 60.
Set to 0 to disable heartbeat logging entirely.
Heartbeat events log RSS, live/loaded worker counts, queue depth, available request permits, and current probe status — useful for detecting slow memory leaks or queue saturation between requests.
Implementations§
Source§impl Config
impl Config
Sourcepub fn from_env() -> Self
pub fn from_env() -> Self
Creates a Config by reading environment variables.
Unrecognized or missing variables fall back to their defaults.
Sourcepub(crate) fn from_lookup<F: Fn(&str) -> Option<String>>(lookup: F) -> Self
pub(crate) fn from_lookup<F: Fn(&str) -> Option<String>>(lookup: F) -> Self
Creates a Config by resolving each setting through lookup.
lookup receives an env-var name and returns its value if set, or
None to fall back to the default for that setting. Used by
Config::from_env with the real environment and in tests with a
closure over a HashMap.
Auto Trait Implementations§
impl Freeze for Config
impl RefUnwindSafe for Config
impl Send for Config
impl Sync for Config
impl Unpin for Config
impl UnsafeUnpin for Config
impl UnwindSafe for Config
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more