Skip to main content

run_readiness_probe

Function run_readiness_probe 

Source
pub async fn run_readiness_probe(
    init_handle: JoinHandle<Result<()>>,
    state: Arc<AppState>,
    cfg_max_seq: usize,
    cfg_workers: usize,
    cfg_safety: f64,
    cost_model_override: Option<CostModel>,
    cache_dir: PathBuf,
    model_variant_str: String,
    disable_probe_cache: bool,
) -> Result<()>
Expand description

Runs after all workers finish loading their model instances.

§Sequence

  1. Wait for worker pool initialisation to finish.
  2. Read pool.model_rss_per_worker_bytes() — the median RSS delta measured inside each worker’s spawn_blocking closure around load_models(). Workers load sequentially (one at a time), so each delta reflects only that worker’s ORT session allocation with no parallel-load contamination.
  3. Detect available memory; compute per_worker_workspace via compute_workspace_budget. Fail fast if the budget is below the physics-based floor (cannot fit even one text at max_seq_length).
  4. Write static TuningInfo to OnceLock.
  5. Resolve the cost model — one of three paths:
    • cost-model override set: apply immediately, probe_status = Disabled.
    • EFS cache hit: apply cached (a, b) via ArcSwap, probe_status = CacheHit.
    • cache miss: set probe_status = Running, launch background probe task.
  6. Run dense + sparse readiness calls to confirm the worker pool is healthy.
  7. Flip state.ready = true/health returns 200 ok from this point on. If the probe is still running in the background, the bin-packer uses conservative defaults until the ArcSwap is updated (typically ~120 s).

§Errors

  • Worker pool init panicked (JoinError) or returned an error from model loading.
  • Per-worker workspace budget falls below the physics floor (cannot fit even one text at max_seq_length — container is restarted by the orchestrator).