Context
Saudi regulatory law is a dynamic corpus. New regulations supersede older ones, jurisdictions overlap, and domain boundaries between labour, commercial, and criminal law are non-trivially distinct. A legal AI assistant operating in this environment carries a correctness requirement that most retrieval systems are not designed for: retrieving a regulation that was valid six months ago but has since been superseded is not a retrieval miss. It looks like a correct answer. The system returns a citation, the model generates a coherent legal explanation, and the error is undetectable until a practitioner catches it downstream.
A Saudi legal tech startup building an AI assistant for this corpus needed more than semantic retrieval. The accuracy threshold was not a performance target. It was a product prerequisite: if the system could not guarantee domain-scoped, legally current retrieval, it could not be positioned as a legal assistant.
The Hard Problem
The failure mode of standard RAG on a dynamic legal corpus is not retrieval failure. It is confident retrieval of outdated context. Dense vector search optimizes for semantic similarity, not legal validity. A superseded regulation describing the same legal concept as a current one sits close to it in embedding space. Without a structural mechanism to scope retrieval by domain and version, the system surfaces that outdated regulation, and the model generates a legally coherent but incorrect answer with no indication that anything went wrong.
Before introducing metadata-scoped retrieval, we ran a baseline evaluation on approximately 300 labeled bilingual QA pairs. Full-corpus vector search with token-based chunking produced recall@5 of 68–72% and answer correctness of 62–66%. Cross-domain contamination, cases where a query about labour law retrieved criminal or commercial law context, reached 20–25%. End-to-end latency sat at 5–10 seconds, driven by large candidate sets and reranking over an unfiltered corpus.
A retrieval from a superseded regulation is indistinguishable from a correct answer until a practitioner catches the error downstream. In a system positioned for professional legal use, that silent failure mode made full-corpus semantic search an unacceptable architecture, not an optimization problem.
How It Works
Ingestion
Legal validation
Each regulation passes through a legal validation stage where domain experts verify the document before it is accepted into the index. Indexing errors in a legal corpus are not correctable at retrieval time.
Structural segmentation
Documents are segmented at legal structural boundaries, articles, clauses, and rule delimitations, rather than fixed token windows, because legal reasoning depends on atomic rules.
Metadata enrichment
Each segment is tagged with legal domain, regulatory grouping, and scope classification, restricting retrieval to domain-relevant chunks and preventing a flat embedding space.
Indexing
Validated, segmented, and enriched chunks are indexed into Milvus with filterable payload fields for domain and regulatory scope.
Query
Intent routing
A ReAct-based planning layer determines whether retrieval is required at all. Conversational queries, out-of-scope requests, and greetings resolve without it.
Query decomposition
Multi-intent queries are decomposed into sub-queries aligned to distinct legal intents, each scoped to its own regulatory domain, before any retrieval runs.
Metadata-scoped retrieval
Each sub-query is filtered by inferred regulatory domain before vector search runs, then reranked to surface legally precise matches over purely semantic ones.
Context validation
Retrieved context is checked for domain consistency and language alignment, enforcing Arabic or English stability and blocking prompt injection before generation.
Generation and observability
Arize Phoenix captures a structured span per query: query type, retrieved domain, retrieval confidence, per-stage latency, and guardrail events, surfacing drift that aggregate accuracy metrics miss.
Evaluation
Evaluation methodology
- Test set
- Approximately 300–500 bilingual QA pairs covering English and Arabic legal queries across multiple Saudi regulatory domains, built from real-world style questions mapped to verified source regulations with ground-truth answers reviewed by legal domain experts.
- Query types
- Exact regulation citation lookup, paraphrased legal intent, multi-intent and multi-hop reasoning, ambiguous queries requiring domain inference, cross-domain distractors targeting the metadata filter, and out-of-scope queries validating guardrail behavior.
- Pass/fail
- Binary. Correct only if retrieved context was legally valid and current, the answer matched ground-truth legal interpretation, no incorrect authority was introduced, and guardrail and language constraints were respected. Partial retrieval with incorrect generation counted as failure.
- Tooling
- LlamaIndex evaluators for automated retrieval relevance, faithfulness, and context-answer alignment at scale, with a dedicated QA engineer for manual review of edge cases, adversarial inputs, and jailbreak attempts, serving as the final authority on legal correctness.
Key Decisions
Decision
Embedding model: open-source fine-tuning vs OpenAI embeddings
The choice was between BGE-M3 or E5-large-v2 with domain-specific PEFT fine-tuning versus OpenAI embedding models as the retrieval backbone. I ran experiments on the 300–500 QA evaluation set comparing top-k hit rate, downstream answer correctness, and cross-lingual retrieval consistency between Arabic and English queries. BGE-M3 was the strongest open-source baseline; fine-tuning on domain legal data improved cross-lingual alignment marginally but did not close the gap in semantic precision. I standardized on OpenAI embeddings because they produced higher semantic precision and more stable cross-lingual alignment across legal domains, which directly affected answer correctness in the hybrid evaluation.
We lost control over model internals and the ability to customize domain representation. In a deployment where a client prohibits sending data to external embedding APIs, the open-source path becomes necessary regardless of the precision delta. We accepted that constraint in exchange for retrieval reliability given the accuracy requirements and three-month timeline.
Decision
Retrieval architecture: full-corpus dense search vs metadata-scoped retrieval
The baseline evaluation produced recall@5 of 68–72% with cross-domain contamination at 20–25%. The contamination problem was structural: legal domains share terminology, and dense search alone could not distinguish between semantically similar but legally distinct regulations. I introduced a pre-filtering layer where the agent infers legal domain from query intent and constrains Milvus retrieval to the relevant regulatory subset before vector search runs. This reduced cross-domain contamination to under 5% and improved recall@10 to 97–98%.
A deliberate reduction in theoretical recall coverage. A query that crosses domain boundaries or touches an emerging regulatory area not well represented in the metadata taxonomy can be under-retrieved. We accepted this because surfacing cross-domain contamination in a legal context is more damaging than a contained miss.
Decision
Vector database: ChromaDB vs Milvus
Under concurrent load (~100+ async users) and growing corpus size, ChromaDB produced inconsistent retrieval latency and occasional instability. I benchmarked Milvus and Qdrant under the same conditions. Qdrant performed well for filtered retrieval, but Milvus provided more predictable behavior under increasing load and better support for the metadata-heavy filtering patterns the retrieval architecture depended on.
Milvus requires cluster management and operational overhead that ChromaDB and Qdrant do not. On a three-month MVP timeline, this meant engineering time spent on infrastructure rather than product iteration. We accepted it because retrieval stability under concurrency was not negotiable for a legal system in production.
Implementation Detail
The non-obvious component in the system was the query planning layer inside the ReAct agent that constructed a structured retrieval plan before any vector search ran.
Rather than treating retrieval as a single tool call, the agent transformed raw user intent into a structured plan: decomposing multi-intent queries into sub-queries aligned to distinct legal intents, and attaching explicit metadata constraints to each sub-query before passing it to the retrieval layer. A query spanning two regulatory domains produced two retrieval calls, each pre-scoped to its domain.
The metadata constraints were not applied post-retrieval as a filter on results. They were embedded into the agent's decision step, meaning the effective search space was determined before vector search ran. This shifted part of the retrieval optimization problem into the reasoning layer and produced two concrete effects: fewer vector search operations per query, and retrieval candidates that arrived at the reranking stage already domain-constrained.
The response assembly stage applied a final consistency pass across all retrieved chunks before context was passed to the generation model. This caught a class of subtle failures where individual retrieval results were correct but context mixing across sub-queries introduced domain inconsistency at generation time, a failure mode that neither retrieval metrics nor automated faithfulness evaluators would have flagged on their own.
Results
97–98%
recall@10
300–500 bilingual QA test set, validated by legal domain experts. Baseline full-corpus recall@5 was 68–72%.
95–98%
answer correctness
Hybrid evaluation: LlamaIndex automated scoring plus QA engineer as final authority on legal correctness.
<5%
cross-domain contamination
Down from 20–25% in the full-corpus baseline, after introducing metadata-scoped retrieval.
1.0–3.0s
latency, p50–p95
Staging under production-like load, down from 5–10s in the baseline system.
The metadata-scoped architecture recovered roughly 25–30 points of recall over the full-corpus baseline while simultaneously reducing cross-domain contamination from 20–25% to under 5%. The latency improvement came primarily from reducing the retrieval candidate set through pre-filtering, not from component-level tuning.
The system reached production with a live user base within three months. Retrieval accuracy and domain-scoped reasoning were demonstrated to investors during the fundraising process, and the pre-seed round closed at $600K.
Reflection
The current system is designed for codified regulatory retrieval, where answers trace to a single authoritative statute. It is less effective on precedent-based queries requiring synthesis across judicial reasoning and case law, where no single document is authoritative and correctness depends on interpreting patterns across multiple unstructured sources. Extending to that query class would require a different context assembly strategy, most likely a hierarchical or multi-document synthesis layer, rather than the current single-pass retrieval and generation design.
If starting again, I would invest earlier in a fine-tuned open-source embedding strategy alongside the OpenAI-based approach, not to replace it, but to have a compliance-ready alternative before entering commercial discussions. Data governance constraints in regulated industries often surface late in the sales cycle, and a retrieval architecture that requires external API calls for embeddings becomes a deployment blocker in some client environments. Having that alternative validated on the evaluation set before it becomes a client requirement would remove a late-stage negotiation variable.