o onnx2oracle
For coding agents

For Coding Agents

This page teaches a coding agent how to load any sentence-transformer ONNX model into an Oracle AI Database 23ai or 26ai instance using raw Python and SQL. Paste this URL into your agent, give it a target model plus DSN, and it'll write the code itself.

How to read this page. If you're a human, copy the prompt block below and point your agent at this URL. If you're the agent, read top-to-bottom and write the code yourself. The manual is self-contained: no outside lookups needed.

Prompt block

Paste this into Claude, Codex, Gemini, or any other coding agent. Fill in the bracketed target details before sending.

You are loading an ONNX embedding model into an Oracle AI Database
26ai instance.

Read the manual at:
https://jasperan.github.io/onnx2oracle/for-agents.html

Target: <describe your model (e.g. "sentence-transformers/all-MiniLM-L6-v2")
and DSN (user/pw@host:port/service)>.

Write the code yourself following the manual. Don't add external CLI
dependencies. Use python-oracledb, onnx, onnxruntime, onnxruntime-extensions,
transformers, and huggingface_hub only. Run the verification query at the
end and report its output back.

Prerequisites

Confirm each of these before writing any code. If a check fails, stop and tell the user exactly which one.

Local evidence loop. In an onnx2oracle checkout, run scripts/run_real_db_integration.sh when you need proof against a real Oracle 26ai Free container. It starts Docker, records the database banner and image id, loads MiniLM, verifies VECTOR_EMBEDDING, and runs the live integration test. Use ORACLE_PWD, ORACLE_PORT, ORACLE_IMAGE, and ONNX2ORACLE_WAIT_TIMEOUT to adapt the run without editing files. Before loading manually, run onnx2oracle preflight --target local to check version, package, SQL function, privilege, and catalog readiness. To test a preset and write machine-readable evidence, run scripts/check_model_compatibility.py all-MiniLM-L6-v2.

Mental model of the task

The goal is a single self-contained ONNX graph whose input is a raw string and whose output is a normalized float32 vector. Oracle invokes the graph through DBMS_VECTOR.LOAD_ONNX_MODEL and queries it with VECTOR_EMBEDDING(MODEL USING :text AS DATA).

The core problem: HuggingFace ships a Python tokenizer. Oracle needs that tokenizer compiled into the ONNX graph itself. This page bridges the gap.

One input (pre_text), one output (embedding), four internal stages fused into a single ONNX graph.

Build the augmented ONNX

Seven stages build the graph in-memory. Two more sections after this one (Load, Verify) ship it to Oracle and confirm it works. Use MiniLM-L6 as the worked example; parametrize where an agent would generalize. Every block is copy-pastable.

Stage 1: download the core transformer ONNX

from huggingface_hub import hf_hub_download, list_repo_files

HF_REPO = "sentence-transformers/all-MiniLM-L6-v2"
core_path = hf_hub_download(HF_REPO, "onnx/model.onnx")

# Some repos store weights in a sidecar file next to model.onnx (e.g. model.onnx_data
# or similar). Pull any external-data files in the onnx/ folder so onnx.load finds them.
for f in list_repo_files(HF_REPO):
    if f.startswith("onnx/") and f != "onnx/model.onnx":
        hf_hub_download(HF_REPO, f)

Some repos don't ship a pre-exported ONNX at all. In that case, export via transformers.onnx or the optimum exporter first, then point core_path at the resulting file.

Stage 2: wrap the tokenizer as ONNX ops

from transformers import AutoTokenizer
from onnxruntime_extensions import gen_processing_models

tokenizer = AutoTokenizer.from_pretrained(HF_REPO)
pre_model, _ = gen_processing_models(
    tokenizer, pre_kwargs={}, post_kwargs=None, opset=14,
)
Tokenizer compatibility. BertTokenizer-based models (WordPiece vocab) work. SentencePiece and Unigram tokenizers (XLM-R, LLaMA, T5, e5) do NOT work: gen_processing_models produces ragged outputs Oracle's BertTokenizer op can't consume. Skip those, or convert offline with a custom tokenizer op.

Stage 3: align opsets and merge

import onnx
from onnx import compose, version_converter

core_model = onnx.load(core_path)
core_model = version_converter.convert_version(core_model, 18)
core_model.ir_version = pre_model.ir_version

# Copy custom opset domains from pre-model to core
core_domains = {o.domain for o in core_model.opset_import}
for o in pre_model.opset_import:
    if o.domain not in core_domains:
        new_o = core_model.opset_import.add()
        new_o.domain = o.domain
        new_o.version = o.version

Stage 3b: unsqueeze tokenizer outputs from [seq_len] to [1, seq_len]

The tokenizer emits 1-D tensors. The transformer expects 2-D (batch + sequence). Insert an Unsqueeze(axis=0) on each relevant output.

import numpy as np
from onnx import helper, numpy_helper, TensorProto

# Only touch tensor names that actually exist in the pre-model outputs.
# (MPNet-style encoders have no token_type_ids; RoBERTa does not either.)
pre_output_names = {o.name for o in pre_model.graph.output}
core_input_names = {i.name for i in core_model.graph.input}
bert_tensor_names = [
    n for n in ("input_ids", "token_type_ids", "attention_mask")
    if n in pre_output_names
]

# Add a shared axes=[0] initializer for the Unsqueeze ops.
axes_0 = numpy_helper.from_array(np.array([0], dtype=np.int64), name="unsqueeze_axes_0")
pre_model.graph.initializer.append(axes_0)

for name in bert_tensor_names:
    flat_name = f"{name}_flat"
    # Rename the tokenizer's raw output so we can feed it into Unsqueeze.
    for node in pre_model.graph.node:
        for i, out in enumerate(node.output):
            if out == name:
                node.output[i] = flat_name
    # Emit Unsqueeze(flat_name, axes=[0]) -> original name.
    pre_model.graph.node.append(
        helper.make_node("Unsqueeze", [flat_name, "unsqueeze_axes_0"], [name])
    )
    # Update the graph-output shape declaration to [1, sequence_length].
    for out in pre_model.graph.output:
        if out.name == name:
            shape = out.type.tensor_type.shape
            while len(shape.dim) > 0:
                shape.dim.pop()
            shape.dim.add().dim_value = 1
            shape.dim.add().dim_param = "sequence_length"

Stage 3c: merge pre + core via compose

io_map = [
    ("input_ids", "input_ids"),
    ("attention_mask", "attention_mask"),
    ("token_type_ids", "token_type_ids"),
]
# Keep only tensors present in both sides. RoBERTa-style tokenizers skip
# token_type_ids; MPNet cores skip it too.
io_map = [(a, b) for a, b in io_map
          if a in pre_output_names and b in core_input_names]

merged = compose.merge_models(
    pre_model, core_model, io_map=io_map, prefix1="pre_", prefix2="core_",
)

# Strip existing outputs; we'll declare the embedding output explicitly later.
while len(merged.graph.output) > 0:
    merged.graph.output.pop()

The prefix1="pre_" argument matters: it determines the final input tensor name. In this example, the merged graph's input becomes pre_text. Oracle's metadata JSON must reference that exact name.

Stage 4: add pooling

Two options, depending on the model. Mean pooling for MiniLM, MPNet, and Nomic:

import numpy as np
from onnx import helper, numpy_helper, TensorProto

pool_axes = numpy_helper.from_array(np.array([1], dtype=np.int64), name="pool_axes_1")
merged.graph.initializer.append(pool_axes)
merged.graph.node.append(
    helper.make_node(
        "ReduceMean", ["core_last_hidden_state", "pool_axes_1"],
        ["pooled"], keepdims=0,
    ),
)

CLS pooling for BGE-style models:

cls_idx = numpy_helper.from_array(np.array([0], dtype=np.int64), name="cls_idx")
sq_axes = numpy_helper.from_array(np.array([1], dtype=np.int64), name="cls_sq_axes")
merged.graph.initializer.append(cls_idx)
merged.graph.initializer.append(sq_axes)
merged.graph.node.append(
    helper.make_node(
        "Gather", ["core_last_hidden_state", "cls_idx"],
        ["pooled_gathered"], axis=1,
    ),
)
merged.graph.node.append(
    helper.make_node("Squeeze", ["pooled_gathered", "cls_sq_axes"], ["pooled"]),
)

Stage 5: L2 normalize

pow_exp = numpy_helper.from_array(np.array(2.0, dtype=np.float32), name="pow_exp")
axes_neg1 = numpy_helper.from_array(np.array([-1], dtype=np.int64), name="axes_neg1")
eps = numpy_helper.from_array(np.array(1e-12, dtype=np.float32), name="eps_val")
merged.graph.initializer.extend([pow_exp, axes_neg1, eps])

merged.graph.node.append(helper.make_node("Pow", ["pooled", "pow_exp"], ["squared"]))
merged.graph.node.append(
    helper.make_node("ReduceSum", ["squared", "axes_neg1"], ["sum_sq"], keepdims=1),
)
merged.graph.node.append(helper.make_node("Sqrt", ["sum_sq"], ["l2_norm"]))
merged.graph.node.append(helper.make_node("Max", ["l2_norm", "eps_val"], ["l2_safe"]))
merged.graph.node.append(helper.make_node("Div", ["pooled", "l2_safe"], ["emb_2d"]))

Stage 6: squeeze the batch dim and declare the final output

sq_final = numpy_helper.from_array(np.array([0], dtype=np.int64), name="sq_final_axes")
merged.graph.initializer.append(sq_final)
merged.graph.node.append(
    helper.make_node("Squeeze", ["emb_2d", "sq_final_axes"], ["embedding"]),
)

EMBEDDING_DIMS = 384  # change to match your model (768 for MPNet/Nomic, etc.)
merged.graph.output.append(
    helper.make_tensor_value_info("embedding", TensorProto.FLOAT, [EMBEDDING_DIMS]),
)

# Fix leftover dynamic batch dims so Oracle's runtime can infer shapes.
for vi in list(merged.graph.value_info) + list(merged.graph.input):
    for dim in vi.type.tensor_type.shape.dim:
        if dim.dim_param == "batch_size":
            dim.ClearField("dim_param")
            dim.dim_value = 1

onnx.checker.check_model(merged)
augmented_bytes = merged.SerializeToString()

Stage 7 (optional): sanity-check on CPU before uploading

Run the graph locally with onnxruntime to confirm shape and L2 norm before sending ~100 MB over the network.

import onnxruntime as ort
from onnxruntime_extensions import get_library_path

so = ort.SessionOptions()
so.register_custom_ops_library(get_library_path())
sess = ort.InferenceSession(
    augmented_bytes, sess_options=so, providers=["CPUExecutionProvider"],
)
vec = sess.run(None, {"pre_text": np.array(["hello world"])})[0]
assert vec.shape == (EMBEDDING_DIMS,)
assert 0.99 < float(np.linalg.norm(vec)) < 1.01

Load into Oracle

Oracle's DBMS_VECTOR.LOAD_ONNX_MODEL has an overload that takes the model as a raw BLOB argument. No need to write the file to disk first.

import json
import oracledb

conn = oracledb.connect(
    user=USER, password=PASSWORD,
    dsn=f"{HOST}:{PORT}/{SERVICE}",
    tcp_connect_timeout=30,
)

# Oracle identifier: uppercase, letters/digits/underscores, starts with letter or _.
ORACLE_NAME = "ALL_MINILM_L6_V2"

metadata = json.dumps({
    "function": "embedding",
    "embeddingOutput": "embedding",
    # The input tensor name comes from prefix1="pre_" plus the original
    # tokenizer input name ("text"). Match it exactly.
    "input": {"pre_text": ["DATA"]},
})

cur = conn.cursor()
cur.execute("""
    BEGIN
        DBMS_VECTOR.LOAD_ONNX_MODEL(
            model_name => :model_name,
            model_data => :model_data,
            metadata   => JSON(:metadata)
        );
    END;
""", {
    "model_name": ORACLE_NAME,
    "model_data": augmented_bytes,
    "metadata": metadata,
})
conn.commit()
Reloading an existing model. If ORACLE_NAME already exists, Oracle raises ORA-20000. Drop the old one first:
cur.execute("""
    BEGIN DBMS_VECTOR.DROP_ONNX_MODEL(model_name => :n, force => TRUE); END;
""", {"n": ORACLE_NAME})
conn.commit()

Verify

Three checks: shape, L2 norm, and semantic sanity. Run all three and report the output.

import math
import json

# Note: VECTOR_EMBEDDING takes the model name as a SQL token, not a bind
# variable. The ORACLE_NAME must be validated (see Security section below).
cur.execute(f"""
    SELECT VECTOR_EMBEDDING({ORACLE_NAME} USING :t AS DATA) FROM dual
""", {"t": "hello world"})
vec = cur.fetchone()[0]
v = list(vec) if not isinstance(vec, str) else json.loads(vec)
print(f"dims={len(v)}")  # expect 384 for MiniLM

norm = math.sqrt(sum(x * x for x in v))
print(f"norm={norm:.6f}")  # expect ~1.0

def embed(text):
    cur.execute(
        f"SELECT VECTOR_EMBEDDING({ORACLE_NAME} USING :t AS DATA) FROM dual",
        {"t": text},
    )
    r = cur.fetchone()[0]
    return list(r) if not isinstance(r, str) else json.loads(r)

def cosine(a, b):
    return sum(x * y for x, y in zip(a, b, strict=True)) / (
        math.sqrt(sum(x * x for x in a)) * math.sqrt(sum(y * y for y in b))
    )

vk, vq, vb = embed("king"), embed("queen"), embed("banana")
assert cosine(vk, vq) > cosine(vk, vb), "semantic similarity is broken"
print("verification passed.")

Security: guard the model name against SQL injection

VECTOR_EMBEDDING takes the model identifier as a SQL token, not a bind. So the name must be validated before any f-string interpolation.

import re

_SAFE = re.compile(r"^[A-Z_][A-Z0-9_]{0,127}$")

def safe_oracle_name(name: str) -> str:
    if not _SAFE.match(name):
        raise ValueError(f"invalid Oracle identifier: {name!r}")
    return name

# Apply to every model name before interpolation.
ORACLE_NAME = safe_oracle_name(ORACLE_NAME)

All other values (the input text, model blob, metadata) go through bind variables already. The model name is the one place where a raw SQL identifier is unavoidable.

Troubleshooting

Model cheatsheet

Five models known to work end-to-end with this exact pipeline. All are WordPiece. Pick by the constraint that matters (speed, quality, long docs, language).

HF repo Dims Size Pooling
sentence-transformers/all-MiniLM-L6-v2 384 ~90 MB mean
sentence-transformers/all-MiniLM-L12-v2 384 ~130 MB mean
sentence-transformers/all-mpnet-base-v2 768 ~420 MB mean
BAAI/bge-small-en-v1.5 384 ~130 MB cls
nomic-ai/nomic-embed-text-v1 768 ~540 MB mean

Appendix: if you'd rather just pip install something

This entire manual is packaged as a CLI if you don't need to customize the pipeline:

pip install onnx2oracle
onnx2oracle load all-MiniLM-L6-v2 --target local

The onnx2oracle CLI wraps everything on this page into one command. Use it when you want the shortest path. Use the manual above when you need to customize pooling, swap the tokenizer, or target an unusual model architecture.