Enum ModelVariant

Source

pub enum ModelVariant {
    Fp32,
    Fp16,
    Int8,
}

Expand description

ONNX model variant to load.

Controlled by BGE_M3_MODEL. Defaults to ModelVariant::Fp16.

Variants§

§

Fp32

BAAI/bge-m3 FP32 model (~2.16 GB per session).

Set BGE_M3_MODEL=fp32 to enable. Recommended for Apple Silicon CoreML deployments where latency is the primary constraint: the FP32 ONNX graph contains no Cast nodes, so ORT can dispatch the entire multi-head attention + FFN block as one contiguous CoreML subgraph to the GPU — delivering 20–61% lower latency than the MLAS CPU baseline.

Not the default. Linux/Intel (MLAS-only) deployments should prefer ModelVariant::Fp16 for lower RAM and fleet-wide embedding consistency.

§

Fp16

Xenova/bge-m3 FP16 model (~1.08 GB per session). Default. Halves per-session memory vs FP32 (~50% reduction; ~1.08 GB vs ~2.16 GB).

This is the fleet default: all Apple Silicon LaunchAgent deployments set BGE_M3_MODEL=fp16 explicitly, and the server default matches so that Linux/Docker deployments produce consistent embeddings without any additional configuration.

Latency caveat (CoreML only). The Xenova FP16 ONNX model contains FP16↔FP32 Cast nodes at every transformer-layer boundary. ORT’s CoreML EP cannot fuse these into the attention/FFN subgraphs; each Cast executes on CPU and the transformer block never forms a single contiguous GPU subgraph. Result: FP16 + CoreML EP runs 6–10× slower than FP32 + CoreML. On MLAS/CPU EP (Linux, Intel), this Cast overhead is similarly present but the MLAS FP16 penalty (~6–9×) is the accepted trade-off for lower RAM and fleet consistency. Use BGE_M3_MODEL=fp32 on Apple Silicon to recover CoreML GPU acceleration.

§

Int8

Xenova/bge-m3 INT8 quantized model (~568 MB per session). Weights-only quantization; ORT dequantizes to f32 internally. Reduces peak memory by ~74% per worker vs FP32.

Embedding quality validated: dense cosine similarity ≥ 0.963 vs FP32 reference across a 184-text corpus — suitable for ANN search and semantic ranking. Avoid for applications requiring ranking precision within very small similarity margins (< 0.05 apart).

Use with MLAS (CPU EP) only. DequantizeLinear nodes fragment the CoreML execution plan identically to FP16 Cast nodes; INT8 + CoreML EP runs 42–79% slower than INT8 + MLAS with no GPU benefit.

ModelVariant

Enum ModelVariant Copy item path

Variants§

Fp32

Fp16

Int8

Trait Implementations§

impl Clone for ModelVariant

fn clone(&self) -> ModelVariant

fn clone_from(&mut self, source: &Self)

impl Debug for ModelVariant

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Display for ModelVariant

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl PartialEq for ModelVariant

fn eq(&self, other: &ModelVariant) -> bool

fn ne(&self, other: &Rhs) -> bool

impl Copy for ModelVariant

impl Eq for ModelVariant

impl StructuralPartialEq for ModelVariant

Auto Trait Implementations§

impl Freeze for ModelVariant

impl RefUnwindSafe for ModelVariant

impl Send for ModelVariant

impl Sync for ModelVariant

impl Unpin for ModelVariant

impl UnsafeUnpin for ModelVariant

impl UnwindSafe for ModelVariant

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<Q, K> Equivalent<K> for Qwhere Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

fn equivalent(&self, key: &K) -> bool

impl<Q, K> Equivalent<K> for Qwhere Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

fn equivalent(&self, key: &K) -> bool

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> FromRef<T> for Twhere T: Clone,

fn from_ref(input: &T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToCompactString for Twhere T: Display,

fn try_to_compact_string(&self) -> Result<CompactString, ToCompactStringError>

fn to_compact_string(&self) -> CompactString

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T> ToString for Twhere T: Display + ?Sized,

fn to_string(&self) -> String

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

Enum ModelVariant

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<Q, K> Equivalent<K> for Q
where Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

impl<Q, K> Equivalent<K> for Q
where Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

impl<T> FromRef<T> for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToCompactString for T
where T: Display,

impl<T> ToOwned for T
where T: Clone,

impl<T> ToString for T
where T: Display + ?Sized,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,