Guide 02

How it works

The HuggingFace model card describes a pipeline — tokenize, run the encoder, pool, normalize. Oracle's ONNX runtime only sees a single graph. So we stitch the pipeline into one graph before handing it over.

The augmented graph

Every preset produces an ONNX file with exactly one string input and one float32 vector output. The string is the raw document; the vector is ready to go into a VECTOR(dims, FLOAT32) column with cosine distance.

One input (pre_text), one output (embedding), four internal stages fused into a single ONNX graph.

Stage by stage

01 · Tokenizer as ONNX ops

Normally the HuggingFace tokenizer is a Python object. We wrap it with onnxruntime-extensions so the tokenization step becomes real ONNX operators inside the graph. That means Oracle's runtime does the string → token-id conversion — no Python interpreter needed at inference time.

02 · The transformer encoder

This is the unmodified encoder from the HF repo. We export it with onnx>=1.16 at opset 18, which Oracle's 23ai/26ai ONNX runtime can ingest. Attention, layer norms, residuals — nothing special, just the encoder a sentence-transformer ships with.

03 · Pooling

Two shapes show up in practice:

Mean pooling — ReduceMean across the sequence axis, respecting the attention mask. Used by all-MiniLM, all-mpnet, e5, and nomic.
CLS pooling — Gather(axis=1, indices=[0]) to grab the first token's hidden state. Used by BGE.

The preset registry encodes the correct choice per model. If you're loading via --from-huggingface, you pass --pooling mean or --pooling cls yourself.

04 · L2 normalization

Assemble it from primitives Oracle's runtime already supports:

# conceptual; emitted as ONNX ops
norm = Max(Sqrt(ReduceSum(Pow(x, 2), axis=-1)), eps=1e-12)
out  = Div(x, norm)

That Max(..., eps) is the one opinionated detail — it prevents divide-by-zero on empty or whitespace-only input.

How Oracle ingests it

Three facts worth knowing:

Oracle accepts ONNX bytes directly as a BLOB. DBMS_VECTOR.LOAD_ONNX_MODEL has an overload that takes the model as a raw BLOB argument, so we never write the augmented graph to disk. The loader builds the bytes in memory, opens a connection, and streams them straight in.

The metadata JSON that accompanies the model tells Oracle the input tensor name (pre_text) and output shape ([dims], name embedding). If those names don't line up with the graph's actual IO, the load fails — see #ora-20000.

Once loaded, the model name is what you reference in SQL:

SELECT VECTOR_EMBEDDING(ALL_MINILM_L6_V2 USING :doc AS DATA)
FROM my_table;

Oracle invokes the graph row by row. For bulk loads, batch the :doc bind in your application code — the graph is single-row, but the runtime is happy to be called often.

What this means for you

No separate tokenizer deployment. The tokenizer lives in the graph.
No Python on the DB host. Oracle's ONNX runtime handles the whole pipeline.
Upgrading a model is a re-load. The old blob gets dropped and replaced atomically.
Vectors are already L2-normalized, so cosine distance and dot product give the same ranking.

Quickstart Pick a model