When Nobody's Listening, Everyone's Overpromising
The Issue
Here is a universal truth about sales teams: the moment you stop listening to their calls, they start freelancing the pitch.
I was working at an edtech startup - the kind that sells upskilling and career-transition courses to working professionals. The sales motion was high-volume outbound: reps called prospective learners, walked them through course curricula, handled objections about pricing and time commitment, and tried to close enrollments. Think hundreds of calls per day across a growing team.
The problem was visibility. Leadership had none. Were reps following the pitch framework? Were they accurately describing placement support, or quietly guaranteeing six-figure salaries? Were they being dismissive to prospects who pushed back on pricing? Nobody knew. The only feedback loop was conversion numbers - a lagging indicator that tells you what happened but never why. A rep could close well and still be overpromising outcomes that would blow up as refund requests three months later.
The irony was rich: a company selling education had no way to learn from its own conversations.
The Goal
Build an end-to-end pipeline that automatically: (1) transcribes every sales call, (2) scores call quality using AI against a weighted rubric covering behavioral skills, pitch coverage, and negative indicators, (3) stores structured summaries linked to call records, and (4) routes insights to the right team leads via Slack - all with zero manual review.
The Solution
The sales operation was pure outbound -- reps cold-calling working professionals who had expressed interest in career-transition programs, walking them through course curricula, handling objections about pricing and time commitment, and closing enrollments over the phone. On a busy day, the team pushed 300+ calls across dozens of reps. The quality gap between a top closer and someone winging the pitch could mean the difference between a satisfied learner and a refund request six weeks later. Here is how the pipeline caught that gap in real time.
Step 1: Capture and Transcribe. When a call recording lands in S3, a BullMQ job (createCallTranscription) fires. The worker downloads the audio, pipes it through ffmpeg at 1.5x speed to cut API costs and processing time, then sends it to OpenAI's transcription endpoint. The result comes back as WebVTT with speaker diarization. We store both the speed-adjusted original and a corrected VTT where timestamps are multiplied back by the speed factor so they align with the actual recording.
Step 2: AI-Powered Scoring. A second BullMQ job (createCallSummary) picks up the transcription and runs it through a multi-pass GPT evaluation. This was not a single-prompt affair. We designed separate prompts for:
- Behavioral scoring (6 weighted categories: greeting/tone at 10%, needs assessment at 20%, problem-solving at 25%, communication style at 15%, engagement at 10%, closure at 20%)
- Pitch coverage (did the rep cover community benefits, course details, career support - each with sub-question weights)
- Negative indicators (overpromising placements, guaranteeing salary hikes, rude remarks, pushy enrollment pressure)
- Overall summary, feedback, and promises made
These categories were not arbitrary. In edtech course sales, the biggest risk is not a lost deal -- it is a closed deal built on false expectations. A rep who skips needs assessment might enroll a working parent into a full-time bootcamp they cannot attend. A rep who scores low on problem-solving is probably steamrolling objections instead of addressing them. And negative indicators like salary guarantees are not just bad selling -- in the Indian edtech space, they are the kind of claims that trigger consumer complaints and regulatory scrutiny.
Each sub-question produces a score. Weighted aggregation rolls up into section scores, then into an overall score. The prompt engineering required structured JSON output with evidence quotes from the transcript - not just "the agent overpromised" but the exact line they said it.
Step 3: Storage. Summaries are persisted both as JSON in an adminUserCallSummary table linked to the call log and as text files uploaded to S3. This gives the admin dashboard structured data to query and raw files for audit trails.
Step 4: Slack Distribution. Each sales rep gets a private Slack channel named call-summary-{agent-name}-{uid}. Channel membership is derived from the org hierarchy - the rep plus every ancestor in their reporting chain. When a summary is generated, it is posted to the rep's channel. If overpromises or red flags are detected, a separate generic call evaluation alert fires.
For a company selling high-ticket upskilling courses -- where a single enrollment runs into lakhs -- every misrepresented promise carried real financial risk. One rep overpromising placement rates across a dozen calls could snowball into a cohort of angry learners and a wave of refund requests. The Slack routing meant that the moment a rep veered off-script, their team lead knew about it before the next dial.
Architecture
Complexities Faced
Async transcription at scale. Calls take minutes to transcribe. You cannot block the request that logged the call. BullMQ gave us fire-and-forget semantics with retry logic, but coordinating the two-stage pipeline (transcribe first, then summarize) meant the summary worker had to gracefully handle missing transcriptions.
Prompt engineering for consistent structured output. Getting GPT to return valid JSON with weighted scores and not hallucinate evidence was the hardest part. Early prompts produced inconsistent keys, invented transcript quotes, or scored things it was told to ignore. The fix was extremely detailed example outputs in the prompt itself and using temperature: 0. We iterated the prompts across at least three PRs before they stabilized.
Speed-adjusted VTT timestamps. Transcribing at 1.5x speed saves ~33% on API costs, but the resulting timestamps are wrong for playback against the original recording. We wrote a adjustWebvttTimestamps utility that parses every timestamp line and multiplies by the speed factor. Sounds simple until you are debugging off-by-one millisecond rounding errors in hour-long calls.
Org-hierarchy-aware Slack routing. Slack channels needed to include not just the rep but every manager up the chain. We used an adminUserClosure table (a closure table pattern for tree hierarchies) to resolve ancestors, then synced channel membership on every summary post.
92 files changed. The initial PR touched models, migrations, BullMQ workers, GPT service, Slack service, cron jobs, route handlers, and utility functions. Coordinating schema changes with code that depends on them, across that many files, was an exercise in careful ordering and not breaking production mid-deploy.
What I Learned
Prompt engineering is software engineering. Treat prompts like code: version them, test them against real transcripts, and expect to iterate. The weighted scoring prompts went through three major revisions before producing reliable output.
BullMQ is the right abstraction for AI pipelines. Named jobs in a single queue with a switch-case worker pattern kept things manageable without over-engineering into separate microservices.
Ship the transcription first, score later. We rolled out transcription-only for two weeks before enabling AI scoring. This let us validate audio quality, catch edge cases (silent calls, voicemails), and build confidence before layering on GPT costs.
Speed-adjusted transcription is a legitimate cost optimization. Processing audio at 1.5x speed with a timestamp correction pass saved real money at scale, but only works when you own the VTT parsing layer end-to-end.