Building an LLM Evaluation Dashboard with AWS Bedrock and Rails Admin

1. The Problem: The Confidence Trap in AI Strategy

RevSec AI is an agentic pricing-intelligence platform. The core deliverable for this eval feature is a 12-part strategy synthesis generated via AWS Bedrock. It covers everything from executive overviews to risk assessments, representing one critical module in the full pricing strategy generation engine.

The Risk: LLM outputs look confident even when they are wrong. A strategy document can read like it was written by a senior analyst while quietly inventing churn rates or fabricating infrastructure costs.
The Philosophy: Our architecture follows a "Math-First, AI-Second" principle.
Deterministic Layer: All financial modeling (Price Elasticity, Van Westendorp, Stress Testing) is performed by Ruby service objects with a full logic trace.
Narrative Layer: Bedrock only narrates these pre-computed outputs.

To ensure the Narrative Layer never drifts from the Math Layer, I built a custom Eval Dashboard inside Rails Admin. This allows us to treat AI quality as a measurable engineering metric, rather than an assumption.

Rails Admin Eval Performance Dashboard

2. Architecture: The Performance Center

I chose to extend Rails Admin rather than building a separate React dashboard or using Retool.

Why Rails Admin?

Zero Context-Switching: Keeping observability within the core monolith means the team doesn't have to manage a second repo.
Service-Object Alignment: Our StrategyEval services live alongside the code they monitor.
Deployment Simplicity: One git push updates both the product and the monitoring tools.

The Stack:

Backend: Rails 8 (Service-Object pattern).
LLMs: Claude 3.5 Sonnet (Generation) & Claude 3.5 Haiku (Evaluation).
Frontend: Chart.js via CDN for p95 latency and cost visualizations.
Storage: PostgreSQL JSONB for granular evaluation dimensions.

3. The Dual-Layer Eval Framework

We don't rely on a single score. Every simulation passes through a two-stage framework.

Layer 1: The Deterministic Gate (Fast & Free)

Before calling an LLM to judge quality, a Ruby service runs 8 boolean checks.

Hard Checks: Does it have an Executive Overview? Is it free of placeholder text (e.g., [Insert Company Name])?
Data Presence: Does it actually mention the user's specific competitors and price points?
Result: Catches ~15% of failures (malformed JSON or truncated responses) instantly, at no API cost.

Layer 2: LLM-as-Judge (Haiku 3.5)

We use Claude 3.5 Haiku at a temperature of 0.0 to score the narrative across five dimensions:

Dimension	Focus
Faithfulness	Does the AI invent numbers not found in the source math (ingested inputs/data)?
Coverage	Did it omit critical projections or risk factors?
Specificity	Is the advice concrete (e.g., "$39/mo") or generic fluff?
Actionability	Are there clear next steps tied to the data?
Non-Redundancy	Does it repeat the same insight across 12 sections?

Note: This post specifically covers the Pricing Strategy Engine. Agentic and AI Assistant features are evaluated via a separate performance pipeline, but follow a similar approach.

Eval dashboard for the strategy aynthesis engine

4. Lessons from the Trenches

Faithfulness is the hardest problem

Bedrock is too good at being helpful. It will invent a 2.5% inflation rate because it sounds realistic. By tracking a fabricated_numbers array in our JSONB logs, we identified that the LLM was hallucinating industry benchmarks. We used this data to tighten our system prompts, explicitly forbidding any figure not present in the <computed> data block.

Scoring Honesty over Perfection

We learned that Actionability and Specificity are often correlated, but Faithfulness is usually our lowest-scoring dimension. In a business-critical pricing app, a 5/5 prose style is useless if the faithfulness is a 2/5.

Temperature 0.0 is non-negotiable

For evals to be useful for regression testing, they must be deterministic. At higher temperatures, the Judge is inconsistent. At 0.0, our Judge provides a stable baseline that allows us to run Rake-based Matrix Testing across various pricing/business models (SaaS, Hybrid, Usage-based) to see how prompt changes affect quality.

Eval detailed breakdown showing flagged metrics

5. Data Strategy & Schema

I avoided creating a separate Evaluation model. Instead, I utilized JSONB columns on the Simulation model. This allows us to perform complex SQL aggregations on the fly:

# Example: Identifying the most common hallucinations
Simulation.where.not(eval_critical_issues: [])
          .pluck(:eval_critical_issues)
          .flatten
          .group_by(&:itself)
          .transform_values(&:count)

6. The Optimization Loop

Ultimately, this framework supports a systematic prompt engineering workflow. By identifying specific failure modes, like the hallucination of industry benchmarks, ingested macroeconomic data, etc, we can iterate on our system prompts with a clear feedback loop - Measurable and highly actionable through clear delineation. This ensures that the 12-part synthesis, as well as the parallel service layers powering our pricing dashboards, recommended pricing moves across pricing models, plan management, and CPQ, remain grounded in the deterministic service layer's logic.

Closing Thoughts

Building an internal eval tool rather than using a third-party black box observability platform allowed us to align our monitoring with our specific product philosophy. We don't just know if the AI is good, we know if it's being faithful to the math, i.e., the ingested data and inputs that form the basis for strategy generation.

What’s Next?

Real-time Slack Alerting: Automated notifications when gate_pass_rate drops below 80%.
User Feedback Loops: Correlating high "Actionability" scores with actual user engagement.
Finalize the EVAl and Performance Monitoring service for each agentic chat experience, scoped to the users’ private business and platform data.