How it works
The HuggingFace model card describes a pipeline — tokenize, run the encoder, pool, normalize. Oracle's ONNX runtime only sees a single graph. So we stitch the pipeline into one graph before handing it over.
The augmented graph
Every preset produces an ONNX file with exactly one string input and one float32 vector output.
The string is the raw document; the vector is ready to go into a VECTOR(dims, FLOAT32) column with cosine distance.
pre_text), one output (embedding), four internal stages fused into a single ONNX graph.Stage by stage
01 · Tokenizer as ONNX ops
Normally the HuggingFace tokenizer is a Python object. We wrap it with onnxruntime-extensions so the tokenization step becomes real ONNX operators inside the graph. That means Oracle's runtime does the string → token-id conversion — no Python interpreter needed at inference time.
02 · The transformer encoder
This is the unmodified encoder from the HF repo. We export it with onnx>=1.16 at opset 18, which Oracle's 23ai/26ai ONNX runtime can ingest. Attention, layer norms, residuals — nothing special, just the encoder a sentence-transformer ships with.
03 · Pooling
Two shapes show up in practice:
- Mean pooling —
ReduceMeanacross the sequence axis, respecting the attention mask. Used by all-MiniLM, all-mpnet, e5, and nomic. - CLS pooling —
Gather(axis=1, indices=[0])to grab the first token's hidden state. Used by BGE.
The preset registry encodes the correct choice per model. If you're loading via --from-huggingface, you pass --pooling mean or --pooling cls yourself.
04 · L2 normalization
Assemble it from primitives Oracle's runtime already supports:
# conceptual; emitted as ONNX ops
norm = Max(Sqrt(ReduceSum(Pow(x, 2), axis=-1)), eps=1e-12)
out = Div(x, norm)
That Max(..., eps) is the one opinionated detail — it prevents divide-by-zero on empty or whitespace-only input.
How Oracle ingests it
Three facts worth knowing:
DBMS_VECTOR.LOAD_ONNX_MODEL has an overload that takes the model as a raw BLOB argument, so we never write the augmented graph to disk.
The loader builds the bytes in memory, opens a connection, and streams them straight in.
The metadata JSON that accompanies the model tells Oracle the input tensor name (pre_text) and output shape ([dims], name embedding). If those names don't line up with the graph's actual IO, the load fails — see #ora-20000.
Once loaded, the model name is what you reference in SQL:
SELECT VECTOR_EMBEDDING(ALL_MINILM_L6_V2 USING :doc AS DATA)
FROM my_table;
Oracle invokes the graph row by row. For bulk loads, batch the :doc bind in your application code — the graph is single-row, but the runtime is happy to be called often.
What this means for you
- No separate tokenizer deployment. The tokenizer lives in the graph.
- No Python on the DB host. Oracle's ONNX runtime handles the whole pipeline.
- Upgrading a model is a re-
load. The old blob gets dropped and replaced atomically. - Vectors are already L2-normalized, so cosine distance and dot product give the same ranking.