Job Title
Evaluation Engineer – AI Systems
Role Overview
We are seeking an Evaluation Engineer with hands-on experience testing AI-powered products, particularly large language models and other generative or probabilistic systems. This role goes beyond traditional functional QA and focuses on evaluating model behavior, hallucinations, edge cases, and subjective or custom-defined quality metrics. You will play a critical role in ensuring our AI systems are reliable, safe, and aligned with real-world user and business expectations.
Key Responsibilities
- Design and execute evaluation strategies for AI-driven products, including LLM-based and other non-deterministic systems
- Identify, categorize, and track hallucinations, factual inaccuracies, and unsafe or misaligned model outputs
- Develop and maintain test frameworks for subjective evaluation criteria (e.g., relevance, coherence, tone, clinical or domain appropriateness and pharma specific evaluations)
- Define and operationalize custom quality metrics in collaboration with product, ML, and domain experts
- Perform regression testing on model updates, prompt changes, and pipeline modifications
- Create structured test datasets, adversarial prompts, and edge-case scenarios to stress-test AI behavior
- Analyze qualitative and quantitative evaluation results and translate findings into actionable recommendations
- Document evaluation processes, evaluation rubrics, and known failure modes clearly and rigorously
- Partner closely with engineering team to support model improvement and release readiness
Required Qualifications
- 2+ years of experience in evaluation or test engineering mandatory experience focused on AI or ML-based products
- Direct experience testing generative AI systems, including identifying hallucinations and non-deterministic failures
- Experience designing and applying subjective or human-in-the-loop evaluation frameworks
- Familiarity with custom metrics beyond accuracy (e.g., groundedness, relevance, completeness, consistency)
- Strong analytical skills and comfort working with ambiguous or probabilistic outputs