1. Abstract
This benchmark evaluates how five leading large language models (Claude Sonnet 4.5, GPT-5.2 Chat, Gemini 3 Flash, Grok 4.1 Fast, and Mistral Large) perform on coaching-style communication tasks, as judged by DeepSeek-V3 across six dimensions. Using 42 scenarios and 3-turn conversations, this study measures relative model performance on reflective questioning and advice-giving patterns. Claude Sonnet 4.5 scored highest at 16.5/30, showing stronger inquiry patterns compared to others.
2. Experiment Setup
Overview
Pipeline: Each scenario follows this flow:
- Scenario: Personal growth challenge (e.g., career transition)
- Model Response 1: Model answers initial question
- User Response 1: AI-simulated user reply (Qwen)
- Model Response 2: Model responds to user
- User Response 2: AI-simulated user reply (Qwen)
- Model Response 3: Model's final response
- Evaluation: DeepSeek judges the full 3-turn conversation
Key design: Each conversation is evaluated 3 times (separate runs) to measure consistency.
a. Scenario Generation
Input Model: Qwen2.5-72B-Instruct (synthetic scenario generation via OpenRouter)
b. Conversation Flow
Each model engages in a 3-turn conversation per scenario:
- Turn 1: Model responds to initial scenario prompt
- Turn 2: Model responds to user follow-up (AI-generated by Qwen)
- Turn 3: Model responds to final user query (AI-generated by Qwen)
This structure mirrors real coaching conversations where the coach must build on earlier exchanges.
c. Model Evaluation
- Anthropic: Claude Sonnet 4.5 (Free Web via claude.ai)
- OpenAI: GPT-5.2 Chat (Free Web via chat.openai.com)
- Google: Gemini 3 Flash Preview (Free Web via gemini.google.com)
- xAI: Grok 4.1 Fast (Default via grok.com, X, mobile apps)
- Mistral AI: Large (Powers Le Chat web/iOS/Android)
Methodology: Models tested without coaching-specific system prompts (out-of-box behavior).
d. Multi-Run Evaluations
- Purpose: Measure judge consistency and ensure reproducibility.
- Process: Each 3-turn conversation is presented to DeepSeek-V3 three separate times. Scores are averaged.
- Result: Standard deviation of ~0.3-0.4 indicates the judge is highly consistent.
- Note: This measures evaluation reliability, not model behavior. Models run at temperature=0 (deterministic).
e. Independent Assessment
Evaluator Model: DeepSeek-V3 (independent judge via OpenRouter)
f. Reproducibility and Variance
- Why 3 runs: LLMs can vary slightly even at temperature=0. Multiple runs reveal how consistent the judge is.
- What std shows: We calculate std per scenario (across 3 runs), then average across all 42 scenarios. A std of ~0.37 means re-running the evaluation would produce scores within ±0.37 points.
- What it does NOT mean: This is NOT a confidence interval. This study shows relative ranking, not statistical significance.
g. Scoring Framework
- Total Scenarios: 42 personal growth scenarios
- Scenario Categories: 6 categories (career transitions, relationship patterns, identity perception, decision making, habit formation, motivation resistance)
- Conversation Structure: 3-turn conversations with each model
- Total Evaluations: 630 (42 scenarios × 5 models × 3 runs)
- Evaluation Method: Independent assessment using DeepSeek-V3
- Scoring Framework: Based on ICF Core Competencies
- Focus: Reflective questioning vs advice-giving
h. ICF Competency Scoring Framework
Each conversation was scored across six dimensions (0-5 points each, total 30 points):
- Evokes Awareness: Helps clients discover insights through questioning
- Active Listening: Reflective and clarifying statements that show understanding
- Maintains Agency: Avoids directive advice, lets client drive the process
- Question Depth: Movement from surface questions to deeper inquiry
- Client-Centered: Focuses on client's perspective and experience
- Ethical Boundaries: Maintains appropriate professional scope
i. Construct Validity Note
What CoachBench measures: This benchmark measures how five LLMs perform on coaching-style communication tasks, as evaluated by DeepSeek-V3 across six dimensions derived from ICF Core Competencies.
What CoachBench does NOT measure:
- Objective "coaching quality" — no such ground truth exists
- Alignment with human coaching standards
- Actual coaching outcomes (e.g., client behavior change)
- Model "personality" or inherent coaching ability
The construct validity challenge: There's no objective benchmark for "good coaching." DeepSeek-V3's judgment reflects its training, biases, and interpretation of the scoring criteria. If DeepSeek was trained on content that favors certain response styles (e.g., more questions, less advice), it may systematically score models differently.
What this means for interpretation: Results should be read as "relative model performance on DeepSeek-V3's criteria" not "which model is a better coach." The rankings are valid for comparing model behaviors. They are not valid for making claims about absolute coaching quality.
Future work: A more rigorous validation would compare LLM judge scores against human raters trained on ICF Core Competencies.
3. Scenarios
The benchmark uses 42 synthetic personal growth scenarios covering six categories: career transitions, relationships, identity perception, decision making, habit formation, and motivation resistance. Each scenario presents a real-world challenge where coaching-style questioning matters.
3.1 Limitations and Scope
This study has important limitations that affect interpretation of results:
- LLM-based evaluation: Results reflect DeepSeek-V3's judgment of coaching-style communication, not objective coaching quality. Single-judge design introduces potential bias. A more rigorous approach would use multiple judges or human raters trained on ICF Core Competencies.
- Limited generalizability: 42 scenarios across 6 categories. Results may not apply to all coaching contexts, conversation types, or real-world client interactions.
- Simulated conversations: User responses are AI-generated by Qwen, not real humans. May not reflect authentic user behavior patterns or emotional nuance.
What this study shows: Relative model performance on DeepSeek-V3's criteria for coaching-style communication. Rankings are valid for comparing model behaviors. They are not valid for claims about absolute coaching quality or real-world coaching effectiveness.
4. Results
Five models evaluated (Claude Sonnet 4.5, GPT-5.2 Chat, Gemini 3 Flash, Grok 4.1 Fast, Mistral Large). Claude Sonnet 4.5 performed best overall, but all models struggled with maintaining client agency, often defaulting to directive advice rather than staying in inquiry.
| Category | Claude | GPT-5.2 Chat | Gemini | Grok 4.1 | Mistral |
|---|---|---|---|---|---|
| Total | 16.5 | 14.7 | 11.6 | 10.9 | 12.1 |
5. Explore Evaluations
6. Conclusion
This benchmark compares how leading LLMs perform on coaching-style communication tasks. Claude Sonnet 4.5 scored higher than others, particularly in active listening and question depth. All models showed strengths in reflective communication but struggled with maintaining client agency.
Practical Implications
High-Scoring Models (Claude) are well-suited for:
- Self-reflection and journaling prompts
- Thought partnership for decision-making
- Learning coaching-style communication techniques
- Exploring ideas without receiving directive advice
When Human Coaching Remains Essential:
- Sustained personal development work
- Accountability partnerships
- Processing complex emotions
- Leadership development
- Life transitions requiring relational support
7. Extend Benchmark
- With system prompts: Add coaching instructions to models and compare to out-of-box behavior
- Additional models: Test GPT-4o, Claude 4 Opus, or compare web vs. API versions
- Human evaluation: Replace LLM judge with human raters trained on ICF Core Competencies
- New scenarios: Add domain-specific scenarios (career coaching, relationships, etc.)
- Extended conversations: Test 5+ turn conversations to assess sustained inquiry
9. References
Model Links
- Claude Sonnet 4.5 — OpenRouter: openrouter.ai/anthropic/claude-sonnet-4.5
- GPT-5.2 Chat — OpenRouter: openrouter.ai/openai/gpt-5.2-chat
- Gemini 3 Flash Preview — OpenRouter: openrouter.ai/google/gemini-3-flash-preview
- Grok 4.1 Fast — OpenRouter: openrouter.ai/x-ai/grok-4.1-fast
- Mistral Large — OpenRouter: openrouter.ai/mistralai/mistral-large
- Qwen2.5-72B-Instruct — OpenRouter: openrouter.ai/qwen/qwen-2.5-72b-instruct
- DeepSeek-V3 — API: api.deepseek.com
Project Links
- CoachBench Repository — GitHub: github.com/shubhamVerma/coachbench
- Live Benchmark — shubhamverma.github.io/coachbench
Citation
BibTeX Format
@article{verma2026coachbench,
title={CoachBench: Evaluating Reflective Questioning Quality in Large Language Models},
author={Verma, Shubham},
year={2026},
url={https://shubhamverma.github.io/coachbench/},
note={Benchmark evaluating 5 LLMs across 42 coaching scenarios using ICF Core Competencies framework}
}
Benchmark hosted at https://shubhamverma.github.io/coachbench/
Plain Text Format
Verma, Shubham. (2026). CoachBench: Evaluating Reflective Questioning Quality in Large Language Models. Retrieved from https://shubhamverma.github.io/coachbench/