Function run_readiness_probe

Source

pub async fn run_readiness_probe(
    init_handle: JoinHandle<Result<()>>,
    state: Arc<AppState>,
    cfg_max_seq: usize,
    cfg_workers: usize,
    cfg_safety: f64,
    cost_model_override: Option<CostModel>,
    cache_dir: PathBuf,
    model_variant_str: String,
    disable_probe_cache: bool,
) -> Result<()>

Expand description

Runs after all workers finish loading their model instances.

§Sequence

Wait for worker pool initialisation to finish.
Read pool.model_rss_per_worker_bytes() — the median RSS delta measured inside each worker’s spawn_blocking closure around load_models(). Workers load sequentially (one at a time), so each delta reflects only that worker’s ORT session allocation with no parallel-load contamination.
Detect available memory; compute per_worker_workspace via compute_workspace_budget. Fail fast if the budget is below the physics-based floor (cannot fit even one text at max_seq_length).
Write static TuningInfo to OnceLock.
Resolve the cost model — one of three paths:
- cost-model override set: apply immediately, probe_status = Disabled.
- EFS cache hit: apply cached (a, b) via ArcSwap, probe_status = CacheHit.
- cache miss: set probe_status = Running, launch background probe task.
Run dense + sparse readiness calls to confirm the worker pool is healthy.
Flip state.ready = true — /health returns 200 ok from this point on. If the probe is still running in the background, the bin-packer uses conservative defaults until the ArcSwap is updated (typically ~120 s).

§Errors

Worker pool init panicked (JoinError) or returned an error from model loading.
Per-worker workspace budget falls below the physics floor (cannot fit even one text at max_seq_length — container is restarted by the orchestrator).

run_readiness_probe

Function run_readiness_probe Copy item path

§Sequence

§Errors

Function run_readiness_probe