pub enum ModelVariant {
Fp32,
Fp16,
Int8,
}Expand description
ONNX model variant to load.
Controlled by BGE_M3_MODEL. Defaults to ModelVariant::Fp16.
Variants§
Fp32
BAAI/bge-m3 FP32 model (~2.16 GB per session).
Set BGE_M3_MODEL=fp32 to enable. Recommended for Apple Silicon CoreML
deployments where latency is the primary constraint: the FP32 ONNX graph
contains no Cast nodes, so ORT can dispatch the entire multi-head
attention + FFN block as one contiguous CoreML subgraph to the GPU —
delivering 20–61% lower latency than the MLAS CPU baseline.
Not the default. Linux/Intel (MLAS-only) deployments should prefer
ModelVariant::Fp16 for lower RAM and fleet-wide embedding consistency.
Fp16
Xenova/bge-m3 FP16 model (~1.08 GB per session). Default. Halves per-session memory vs FP32 (~50% reduction; ~1.08 GB vs ~2.16 GB).
This is the fleet default: all Apple Silicon LaunchAgent deployments set
BGE_M3_MODEL=fp16 explicitly, and the server default matches so that
Linux/Docker deployments produce consistent embeddings without any
additional configuration.
Latency caveat (CoreML only). The Xenova FP16 ONNX model contains
FP16↔FP32 Cast nodes at every transformer-layer boundary. ORT’s CoreML EP
cannot fuse these into the attention/FFN subgraphs; each Cast executes on
CPU and the transformer block never forms a single contiguous GPU subgraph.
Result: FP16 + CoreML EP runs 6–10× slower than FP32 + CoreML. On
MLAS/CPU EP (Linux, Intel), this Cast overhead is similarly present but
the MLAS FP16 penalty (~6–9×) is the accepted trade-off for lower RAM and
fleet consistency. Use BGE_M3_MODEL=fp32 on Apple Silicon to recover
CoreML GPU acceleration.
Int8
Xenova/bge-m3 INT8 quantized model (~568 MB per session). Weights-only quantization; ORT dequantizes to f32 internally. Reduces peak memory by ~74% per worker vs FP32.
Embedding quality validated: dense cosine similarity ≥ 0.963 vs FP32 reference across a 184-text corpus — suitable for ANN search and semantic ranking. Avoid for applications requiring ranking precision within very small similarity margins (< 0.05 apart).
Use with MLAS (CPU EP) only. DequantizeLinear nodes fragment the
CoreML execution plan identically to FP16 Cast nodes; INT8 + CoreML EP
runs 42–79% slower than INT8 + MLAS with no GPU benefit.
Trait Implementations§
Source§impl Clone for ModelVariant
impl Clone for ModelVariant
Source§fn clone(&self) -> ModelVariant
fn clone(&self) -> ModelVariant
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for ModelVariant
impl Debug for ModelVariant
Source§impl Display for ModelVariant
impl Display for ModelVariant
Source§impl PartialEq for ModelVariant
impl PartialEq for ModelVariant
impl Copy for ModelVariant
impl Eq for ModelVariant
impl StructuralPartialEq for ModelVariant
Auto Trait Implementations§
impl Freeze for ModelVariant
impl RefUnwindSafe for ModelVariant
impl Send for ModelVariant
impl Sync for ModelVariant
impl Unpin for ModelVariant
impl UnsafeUnpin for ModelVariant
impl UnwindSafe for ModelVariant
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§impl<Q, K> Equivalent<K> for Q
impl<Q, K> Equivalent<K> for Q
§fn equivalent(&self, key: &K) -> bool
fn equivalent(&self, key: &K) -> bool
§impl<Q, K> Equivalent<K> for Q
impl<Q, K> Equivalent<K> for Q
§fn equivalent(&self, key: &K) -> bool
fn equivalent(&self, key: &K) -> bool
key and return true if they are equal.§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more§impl<T> Pointable for T
impl<T> Pointable for T
§impl<T> PolicyExt for Twhere
T: ?Sized,
impl<T> PolicyExt for Twhere
T: ?Sized,
§impl<T> ToCompactString for Twhere
T: Display,
impl<T> ToCompactString for Twhere
T: Display,
§fn try_to_compact_string(&self) -> Result<CompactString, ToCompactStringError>
fn try_to_compact_string(&self) -> Result<CompactString, ToCompactStringError>
ToCompactString::to_compact_string()] Read more§fn to_compact_string(&self) -> CompactString
fn to_compact_string(&self) -> CompactString
CompactString]. Read more