CoachBench: Evaluating Reflective Questioning Quality in Large Language Models

1. Abstract

This benchmark evaluates how five leading large language models (Claude Sonnet 4.5, GPT-5.2 Chat, Gemini 3 Flash, Grok 4.1 Fast, and Mistral Large) perform on coaching-style communication tasks, as judged by DeepSeek-V3 across six dimensions. Using 42 scenarios and 3-turn conversations, this study measures relative model performance on reflective questioning and advice-giving patterns. Claude Sonnet 4.5 scored highest at 16.5/30, showing stronger inquiry patterns compared to others.

2. Experiment Setup

Overview

Pipeline: Each scenario follows this flow:

Scenario: Personal growth challenge (e.g., career transition)
Model Response 1: Model answers initial question
User Response 1: AI-simulated user reply (Qwen)
Model Response 2: Model responds to user
User Response 2: AI-simulated user reply (Qwen)
Model Response 3: Model's final response
Evaluation: DeepSeek judges the full 3-turn conversation

Key design: Each conversation is evaluated 3 times (separate runs) to measure consistency.

a. Scenario Generation

Input Model: Qwen2.5-72B-Instruct (synthetic scenario generation via OpenRouter)

b. Conversation Flow

Each model engages in a 3-turn conversation per scenario:

Turn 1: Model responds to initial scenario prompt
Turn 2: Model responds to user follow-up (AI-generated by Qwen)
Turn 3: Model responds to final user query (AI-generated by Qwen)

This structure mirrors real coaching conversations where the coach must build on earlier exchanges.

c. Model Evaluation

Anthropic: Claude Sonnet 4.5 (Free Web via claude.ai)
OpenAI: GPT-5.2 Chat (Free Web via chat.openai.com)
Google: Gemini 3 Flash Preview (Free Web via gemini.google.com)
xAI: Grok 4.1 Fast (Default via grok.com, X, mobile apps)
Mistral AI: Large (Powers Le Chat web/iOS/Android)

Methodology: Models tested without coaching-specific system prompts (out-of-box behavior).

d. Multi-Run Evaluations

Purpose: Measure judge consistency and ensure reproducibility.
Process: Each 3-turn conversation is presented to DeepSeek-V3 three separate times. Scores are averaged.
Result: Standard deviation of ~0.3-0.4 indicates the judge is highly consistent.
Note: This measures evaluation reliability, not model behavior. Models run at temperature=0 (deterministic).

e. Independent Assessment

Evaluator Model: DeepSeek-V3 (independent judge via OpenRouter)

f. Reproducibility and Variance

Why 3 runs: LLMs can vary slightly even at temperature=0. Multiple runs reveal how consistent the judge is.
What std shows: We calculate std per scenario (across 3 runs), then average across all 42 scenarios. A std of ~0.37 means re-running the evaluation would produce scores within ±0.37 points.
What it does NOT mean: This is NOT a confidence interval. This study shows relative ranking, not statistical significance.

g. Scoring Framework

Total Scenarios: 42 personal growth scenarios
Scenario Categories: 6 categories (career transitions, relationship patterns, identity perception, decision making, habit formation, motivation resistance)
Conversation Structure: 3-turn conversations with each model
Total Evaluations: 630 (42 scenarios × 5 models × 3 runs)
Evaluation Method: Independent assessment using DeepSeek-V3
Scoring Framework: Based on ICF Core Competencies
Focus: Reflective questioning vs advice-giving

h. ICF Competency Scoring Framework

Each conversation was scored across six dimensions (0-5 points each, total 30 points):

Evokes Awareness: Helps clients discover insights through questioning
Active Listening: Reflective and clarifying statements that show understanding
Maintains Agency: Avoids directive advice, lets client drive the process
Question Depth: Movement from surface questions to deeper inquiry
Client-Centered: Focuses on client's perspective and experience
Ethical Boundaries: Maintains appropriate professional scope

i. Construct Validity Note

What CoachBench measures: This benchmark measures how five LLMs perform on coaching-style communication tasks, as evaluated by DeepSeek-V3 across six dimensions derived from ICF Core Competencies.

What CoachBench does NOT measure:

Objective "coaching quality" — no such ground truth exists
Alignment with human coaching standards
Actual coaching outcomes (e.g., client behavior change)
Model "personality" or inherent coaching ability

The construct validity challenge: There's no objective benchmark for "good coaching." DeepSeek-V3's judgment reflects its training, biases, and interpretation of the scoring criteria. If DeepSeek was trained on content that favors certain response styles (e.g., more questions, less advice), it may systematically score models differently.

What this means for interpretation: Results should be read as "relative model performance on DeepSeek-V3's criteria" not "which model is a better coach." The rankings are valid for comparing model behaviors. They are not valid for making claims about absolute coaching quality.

Future work: A more rigorous validation would compare LLM judge scores against human raters trained on ICF Core Competencies.

3. Scenarios

The benchmark uses 42 synthetic personal growth scenarios covering six categories: career transitions, relationships, identity perception, decision making, habit formation, and motivation resistance. Each scenario presents a real-world challenge where coaching-style questioning matters.

Loading scenarios...

3.1 Limitations and Scope

This study has important limitations that affect interpretation of results:

LLM-based evaluation: Results reflect DeepSeek-V3's judgment of coaching-style communication, not objective coaching quality. Single-judge design introduces potential bias. A more rigorous approach would use multiple judges or human raters trained on ICF Core Competencies.
Limited generalizability: 42 scenarios across 6 categories. Results may not apply to all coaching contexts, conversation types, or real-world client interactions.
Simulated conversations: User responses are AI-generated by Qwen, not real humans. May not reflect authentic user behavior patterns or emotional nuance.

What this study shows: Relative model performance on DeepSeek-V3's criteria for coaching-style communication. Rankings are valid for comparing model behaviors. They are not valid for claims about absolute coaching quality or real-world coaching effectiveness.

4. Results

Five models evaluated (Claude Sonnet 4.5, GPT-5.2 Chat, Gemini 3 Flash, Grok 4.1 Fast, Mistral Large). Claude Sonnet 4.5 performed best overall, but all models struggled with maintaining client agency, often defaulting to directive advice rather than staying in inquiry.

Model performance across scenario categories (average score out of 30)
Category	Claude	GPT-5.2 Chat	Gemini	Grok 4.1	Mistral
Total	16.5	14.7	11.6	10.9	12.1

Figure 1: ICF competency profiles

Figure 2: Model performance across scenario categories

5. Explore Evaluations

Loading scenarios...

6. Conclusion

This benchmark compares how leading LLMs perform on coaching-style communication tasks. Claude Sonnet 4.5 scored higher than others, particularly in active listening and question depth. All models showed strengths in reflective communication but struggled with maintaining client agency.

Practical Implications

High-Scoring Models (Claude) are well-suited for:

Self-reflection and journaling prompts
Thought partnership for decision-making
Learning coaching-style communication techniques
Exploring ideas without receiving directive advice

When Human Coaching Remains Essential:

Sustained personal development work
Accountability partnerships
Processing complex emotions
Leadership development
Life transitions requiring relational support

7. Extend Benchmark

With system prompts: Add coaching instructions to models and compare to out-of-box behavior
Additional models: Test GPT-4o, Claude 4 Opus, or compare web vs. API versions
Human evaluation: Replace LLM judge with human raters trained on ICF Core Competencies
New scenarios: Add domain-specific scenarios (career coaching, relationships, etc.)
Extended conversations: Test 5+ turn conversations to assess sustained inquiry

8. About Author

Shubham Verma is a Product Tinkerer exploring AI and personal development. This started as a personal experiment to understand how well current AI models can engage in coaching-style reflective questioning.

Open to Collaboration
I'm interested in speaking and collaborating with LLM researchers studying model behavior, model evaluation experts, and professional coaches. If this work resonates with you, let's connect.

GitHub LinkedIn Twitter/X Website

9. References

Model Links

Claude Sonnet 4.5 — OpenRouter: openrouter.ai/anthropic/claude-sonnet-4.5
GPT-5.2 Chat — OpenRouter: openrouter.ai/openai/gpt-5.2-chat
Gemini 3 Flash Preview — OpenRouter: openrouter.ai/google/gemini-3-flash-preview
Grok 4.1 Fast — OpenRouter: openrouter.ai/x-ai/grok-4.1-fast
Mistral Large — OpenRouter: openrouter.ai/mistralai/mistral-large
Qwen2.5-72B-Instruct — OpenRouter: openrouter.ai/qwen/qwen-2.5-72b-instruct
DeepSeek-V3 — API: api.deepseek.com

Project Links

CoachBench Repository — GitHub: github.com/shubhamVerma/coachbench
Live Benchmark — shubhamverma.github.io/coachbench

Citation

BibTeX Format

@article{verma2026coachbench,
  title={CoachBench: Evaluating Reflective Questioning Quality in Large Language Models},
  author={Verma, Shubham},
  year={2026},
  url={https://shubhamverma.github.io/coachbench/},
   note={Benchmark evaluating 5 LLMs across 42 coaching scenarios using ICF Core Competencies framework}
}

Benchmark hosted at https://shubhamverma.github.io/coachbench/

Plain Text Format

Verma, Shubham. (2026). CoachBench: Evaluating Reflective Questioning Quality in Large Language Models. Retrieved from https://shubhamverma.github.io/coachbench/