pub async fn run_readiness_probe(
init_handle: JoinHandle<Result<()>>,
state: Arc<AppState>,
cfg_max_seq: usize,
cfg_workers: usize,
cfg_safety: f64,
cost_model_override: Option<CostModel>,
cache_dir: PathBuf,
model_variant_str: String,
disable_probe_cache: bool,
) -> Result<()>Expand description
Runs after all workers finish loading their model instances.
§Sequence
- Wait for worker pool initialisation to finish.
- Read
pool.model_rss_per_worker_bytes()— the median RSS delta measured inside each worker’sspawn_blockingclosure aroundload_models(). Workers load sequentially (one at a time), so each delta reflects only that worker’s ORT session allocation with no parallel-load contamination. - Detect available memory; compute
per_worker_workspaceviacompute_workspace_budget. Fail fast if the budget is below the physics-based floor (cannot fit even one text atmax_seq_length). - Write static
TuningInfotoOnceLock. - Resolve the cost model — one of three paths:
- cost-model override set: apply immediately,
probe_status = Disabled. - EFS cache hit: apply cached
(a, b)viaArcSwap,probe_status = CacheHit. - cache miss: set
probe_status = Running, launch background probe task.
- cost-model override set: apply immediately,
- Run dense + sparse readiness calls to confirm the worker pool is healthy.
- Flip
state.ready = true—/healthreturns200 okfrom this point on. If the probe is still running in the background, the bin-packer uses conservative defaults until theArcSwapis updated (typically ~120 s).
§Errors
- Worker pool init panicked (
JoinError) or returned an error from model loading. - Per-worker workspace budget falls below the physics floor (cannot fit even one text
at
max_seq_length— container is restarted by the orchestrator).