Monday, May 18, 2026

Bridging the Gap Between GenAI and Data Consistency: Re-Engineering a Canadian Used-Car Valuation Engine

 

Executive Summary

Over the past 48 hours, I undertook a massive architectural overhaul of the backend valuation engine for Kaestify (a Canadian used-car depreciation and report platform). The core mission was twofold: eliminating data inconsistency caused by Large Language Model (LLM) hallucinations and standardizing highly fragmented Canadian automotive trim data based on live market anchors from AutoTrader.ca.

By transitioning from a purely prompt-dependent AI generation model to a hybrid deterministic-stochastic architecture, the platform now guarantees rock-solid data integrity while preserving the narrative power of GenAI.


1. The Problem: The Perils of LLM-Dependent Valuation

In the initial iteration, our valuation logic heavily relied on the Gemini API to estimate the current market value (marketValueCad) dynamically based on few-shot prompt benchmarking (e.g., using a 2018 Honda Civic EX as a static anchor).

However, this approach suffered from three critical flaws:

  • Stochastic Inconsistency (Hallucination): Asking the LLM the same vehicle query with slight mileage variations could result in wildly fluctuating valuations day-to-day.

  • Flawed Caching Resolution: The initial caching specification only keyed Year + Make + Model. This meant a high-mileage base coupe and a low-mileage top-tier convertible shared the same cache row, serving corrupted data to subsequent users.

  • Trim & Market Mismatch: Missing localized Canadian trims (e.g., Toyota's Preferred/Ultimate naming conventions or discontinued brands like Scion) forced the LLM to scrape fallback U.S. data, polluting the report with inaccurate cross-border MSRPs.


2. The Solution: A Hybrid Deterministic-Stochastic Architecture

To resolve these vulnerabilities, I decoupled the mathematical core of vehicle valuation from the textual synthesis role of the AI.

[User Input] 
     │
     ▼
[Cache Check (72h)] ──(Hit)──► Return Fresh Data
     │
   (Miss)
     ▼
[msrpResolveServer.ts] ──► Query DB Anchors ──(Miss)──► Real-time Gemini + Google Search
     │
     ▼
[systemBaseMarketValue.ts] ──► Apply Mathematical Depreciation Formula (Age & Mileage Penalties)
     │
     ▼
[Gemini Prompt Injection] ──► Pass System Base Value as Absolute Anchor
     │
     ▼
[Server-Side Post-Processing] ──► Clamp AI Output to ±5% of System Base Value
     │
     ▼
[Final JSON Payload] (Includes Debug Metadata: msrpResolvedBy, systemCalculatedBaseCad)

Key Implementation Details:

A. Core Schema Hardening & Data Cleansing (The Foundation)

I manually audited and cross-referenced fragmented automotive data with AutoTrader.ca to establish clean database anchors.

  • Normalized volatile string formats (e.g., standardizing C-HR, bZ4X, and RAV4 hyphenations).

  • Isolated structural variants from model names into distinct trim categories (e.g., moving Convertible status out of Ford Mustang's model field into a specific trim sub-layer to prevent relational DB bloat).

  • Fully mapped historical outlier brands like Scion (FR-S, tC, xB) to stitch together uninterrupted 13-year enthusiast car depreciation curves.

B. msrpResolveServer.ts — The Single Source of Truth

We abstracted the MSRP resolution into a reusable server module. When a cache miss occurs:

  1. The system checks our newly optimized canadaMsrpAnchors database.

  2. If it's a cold start or an obscure vehicle, it triggers an isolated Gemini + Google Search validation loop to fetch the exact Canadian MSRP, instantly feeding it back into the pipeline.

C. Deterministic Post-Processing & Clamp Controls (systemBaseMarketValue.ts)

Instead of letting the AI guess the valuation blindly, a strict server-side formula now dictates the boundaries:

  • Base Depreciation: Calculates a structured age decay (e.g., 20% hit in year one, stepped degradation thereafter).

  • Mileage Penalty: Dynamically penalizes vehicles exceeding the Canadian average baseline of 20,000 km/year.

  • The Server-Side Clamp: The calculated base value is injected into the LLM prompt as a mandatory anchor rule. Upon receiving the JSON payload from the AI, the server executes enforceMarketValueAnchorBand(), strictly clamping the final marketValueCad to within ±5% of our deterministic formula.


3. Key Takeaways & Architectural Impact

  • 100% Deterministic Guardrails: The LLM is no longer the "accountant" calculating the price; it is now purely the "narrator" explaining the market dynamics. If the AI hallucinates a random number, the server-side post-processor catches and corrects it instantly.

  • Enriched Metadata for UI Extensibility: The API payload now explicitly passes msrpResolvedBy (canada_msrp_anchor | gemini | none) and systemCalculatedBaseCad. This allows for telemetry monitoring and provides clean hooks to render transparent price breakdown charts on the client-side UI.

  • Production-Ready Optimization: By fixing the cache-key scoping bug and packing the database with high-traffic Canadian trims (Toyota, Hyundai, BMW Diesel/M, Volvo Recharge), we minimized external API network hops, drastically reducing operational token costs and latency for the impending production launch.

No comments:

Post a Comment

Bridging the Gap Between GenAI and Data Consistency: Re-Engineering a Canadian Used-Car Valuation Engine

  Executive Summary Over the past 48 hours, I undertook a massive architectural overhaul of the backend valuation engine for Kaestify (a Ca...