How I Built an AI-Powered Mock Interview Platform

After 3 months of development, here's how I engineered an AI system that conducts realistic mock interviews — candidates spend hours preparing for interviews without getting realistic practice, while companies struggle to assess technical skills effectively. After watching countless friends fumble through technical interviews despite being excellent developers, I decided to build something better.

The result? An AI-powered mock interview platform that conducts natural, contextual conversations with sub-10ms response times.

This is Part 1 of a 2-part series documenting the complete technical journey. Today, I'll walk you through the three core engines that power the platform: synthetic data generation, LLM fine-tuning, and production deployment.

Part 2 Preview: In the next article, I'll dive deep into the real-time conversation pipeline that makes natural voice interactions possible.

The Challenge: Making AI Interviews Feel Human

Building an AI interviewer isn't just about connecting ChatGPT to a microphone. Real interviews are messy, unpredictable, and deeply contextual. Candidates ask for clarification, give rambling answers, try to deflect difficult questions, or sometimes don't know the answer at all.

The technical challenges were immense:

No training data existed for realistic interview conversations at scale
Generic language models lack the nuanced understanding of interview dynamics
Real-time performance requirements demand sub-10ms response times
Natural conversation flow requires understanding when to speak and when to listen

Engine 1: Synthetic Training Data Generation Using ChatGPT-4.0

The Data Problem

Every AI system is only as good as its training data. For interview conversations, this data simply didn't exist. I needed thousands of realistic interview conversations covering:

25+ different job roles (frontend, backend, data science, product management)
4 experience levels (junior, mid, senior, lead)
13 candidate behavior patterns (hesitant, verbose, deflective, professional, etc.)

Step 1: Comprehensive Role Matrix

TECH_ROLES = {
    "software_engineering": {
        "positions": [
            {
                "title": "Frontend Developer",
                "levels": ["Junior", "Mid", "Senior", "Lead"],
                "core_skills": [
                    "HTML5", "CSS3", "JavaScript", "TypeScript",
                    "React", "Angular", "Vue.js", "Responsive Design",
                    "Web Performance", "Browser APIs", "Testing"
                ]
            },
            {
                "title": "Backend Developer",
                "levels": ["Junior", "Mid", "Senior", "Lead"],
                "core_skills": [
                    "Java/Python/Node.js", "RESTful APIs",
                    "Database Design", "SQL", "System Architecture",
                    "Microservices", "Authentication", "Caching"
                ]
            }
            // ... 25+ more positions
        ]
    }
}

Step 2: Realistic Candidate Behaviors

The secret sauce was simulating 13 different candidate behavior patterns:

candidate_behaviors = [
    "Provide realistic responses based on skill level",
    "Occasionally ask for clarification to maintain realism",
    "Candidate doesn't know the answer - interviewer moves to next question",
    "Candidate gives wrong answers - interviewer doesn't correct",
    "Candidate gives very long answers to every question (4-5 lines)",
    "Candidate asks for clarification at every question multiple times",
    "Candidate tries to get answers from interviewer - stop immediately",
    "Candidate deviates from interview - pull them back",
    "Candidate asks to reschedule - stop the interview"
]

Step 3: Structured Interview Generation

def generate_interview(role, skills, level, candidate_name, behavior_pattern):
    prompt = f'''
    Generate a realistic interview between Toshi (interviewer) and {candidate_name}.

    Role: {role} ({level})
    Skills: {skills}
    Behavior: {behavior_pattern}

    Structure:
    - Initial Introduction (1 exchange)
    - 3 Technical Questions
    - 2 Problem Solving Questions
    - 2 Coding Questions (for engineering roles)
    - 1 Behavioral Question
    - Professional Closing

    Format: ###Recruiter: and ###Candidate:
    '''

Real Examples from the Dataset

Here's what ChatGPT-4.0 generated for a "hesitant candidate" scenario — this is exactly how real nervous candidates behave: asking for clarification, seeking validation, and breaking down complex questions.

###Recruiter: Hello! I'm Toshi. Can you explain the difference between
let, const, and var in JavaScript?

###Candidate: Hi Toshi, um — could you clarify what you mean?
Are you asking about scope, or how they behave differently?

###Recruiter: I'm asking about scope behavior, hoisting differences,
and when you would use each one.

###Candidate: Okay so var has function scope, let and const have block
scope. Const can't be reassigned. Is that the level of detail you need?

Final Dataset Results

15K

Conversations

300K+

Conv. Turns

96%

Scenario Adherence

Engine 2: LLaMA 3.1 Fine-Tuning for Interview Excellence

Why Fine-Tuning Was Essential

Generic language models, even powerful ones like LLaMA 3.1, don't understand the nuanced dynamics of professional interviews. They need to learn when to probe deeper vs. when to move on, how to handle difficult candidates professionally, and how to ask relevant follow-up questions based on candidate responses.

The Fine-Tuning Setup

training_config = {
    "base_model": "meta-llama/Llama-3.1-8B-Instruct",
    "dataset_size": "15k_conversations",
    "learning_rate": 2e-5,
    "batch_size": 4,
    "gradient_accumulation_steps": 8,
    "num_train_epochs": 3,
    "max_seq_length": 2048,

    # LoRA for efficient fine-tuning
    "lora_config": {
        "r": 16,
        "lora_alpha": 32,
        "target_modules": [
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ]
    }
}

Why Completion-Only Training Was Essential

DataCollatorForCompletionOnlyLM was crucial for training quality through loss masking. Without it, the model learns to predict everything — including candidate responses — creating a confused model that doesn't know its role.

# Loss masking in action:
"<|start_header_id|>interviewer<|end_header_id|>"  # ← MASKED (loss = 0)
"Let's start with a technical question..."           # ← TRAINED (loss calculated)
"<|eot_id|>"                                        # ← MASKED (loss = 0)

# What gets completely masked:
"<|start_header_id|>candidate<|end_header_id|>"    # ← MASKED (loss = 0)
"I think the answer is..."                           # ← MASKED (loss = 0)

Role Clarity: Model learns it's the interviewer, not the candidate
Efficient Training: Loss masking focuses compute on relevant tokens only
Better Responses: Focuses learning on professional interview conduct
Faster Convergence: More efficient training with clearer objectives

Training Results

0.736

Final Train Loss

2.22

Perplexity

1.04%

Trainable Params (LoRA)

94%

Interview Coherence

96%

Professional Tone

91%

Context Retention (20+ turns)

Engine 3: Production Deployment with 6ms Response Times

For natural conversations, latency is everything. Humans expect responses within 200–300ms total, meaning language model inference needed to be sub-10ms.

4-Bit Quantization Strategy

class OptimizedInterviewLLM:
    def __init__(self):
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4"
        )

        self.llm = LLM(
            model="./interview-llama-3.1-fine-tuned",
            quantization="bitsandbytes",
            gpu_memory_utilization=0.85,
            max_model_len=2048
        )

6ms

Time to First Token

4.2GB

Memory (down from 16GB)

98.3%

Quality Retention

Streaming Response Optimization

For even better UX, streaming responses were implemented with punctuation-based splitting — users hear responses starting in ~50ms instead of waiting for complete generation.

def split_for_speech(text_stream):
    """Split response on punctuation for natural TTS flow"""
    buffer = ""

    for token in text_stream:
        buffer += token

        if token.endswith(('.', '!', '?', ',', ';')):
            yield buffer.strip()
            buffer = ""

    if buffer.strip():
        yield buffer.strip()

~50ms

First Chunk Latency

120

Tokens/Second

Concurrent Users

Addressing the Elephant: "But What About Unlimited Candidate Behaviors?"

A question I get frequently: "You hardcoded 13 behavior patterns — doesn't this mean your AI can only handle these specific cases?"

Short answer: No. The 13 patterns were training methodology, not production limitations.

Think of it like training a doctor on different patient types — they don't become limited to only treating those specific cases. The fine-tuned model learned universal interviewing principles that apply to any candidate behavior, including completely novel combinations.

The training patterns taught interviewing principles, not scripts — adaptive communication based on real-time candidate needs, contextual awareness of conversation flow, emotional intelligence for diverse personality types, and when to probe deeper vs. when to move forward.

Lessons Learned & Key Insights

Synthetic Data Quality Matters More Than Quantity — 15,000 high-quality diverse conversations outperformed 100,000 generic ones.
LoRA Makes Fine-Tuning Accessible — Training only 1.04% of parameters dramatically reduced costs while maintaining quality.
Streaming + Punctuation Splitting = Magic — Real-time conversation is about perceived responsiveness, not just raw speed.
AWS G5.xlarge is the Sweet Spot — Perfect balance of GPU power (A10G), cost efficiency, and memory capacity.
Interview Context is Everything — Generic LLMs can't conduct good interviews without domain-specific fine-tuning.

Coming in Part 2

The real-time conversation pipeline that brings it all to life:

🎙️ Dual-LLM Architecture for intelligent turn-taking
🗣️ Real-time speech processing with Deepgram
⚡ Conversation flow management and context handling
🎯 Complete voice-to-voice interaction pipeline

👨‍💻

Prashant Joshi

Building AI systems at PeopleLabs AI. Reach out at [email protected]