- Video
- Google Veo 3.1 fast
- Voice
- Synthesized by Veo (generic AI voice)
- Consent
- None required — no real voice cloned
- Length
- ~70 seconds (Veo's pace, naturalistic pauses)
Over the past 24 hours we built a working prototype that turns one of Dr. Lam's existing photos and a sample of his voice into a fully AI-generated short-form FAQ video. Below are the two finished V1 videos for review, followed by a side-by-side comparison of the tools we'd use to scale this to production, the costs involved, the team skills required, and the decisions we need from leadership to move forward.
Both videos use the same script (four common patient FAQs about hair transplant), the same source photo of Dr. Lam, and the same 9:16 vertical format for Reels / Shorts / TikTok. The only thing that changes between them is the voice source. Watch both end-to-end and tell us which one sits closer to your publishable bar.
V1 was the proof of concept. To run this weekly at ~100 videos per month, four numbers shift dramatically.
Nominal subscription cost across the three production bundles. ~50s of finished output per video.
Nominal cost only — does not include legal review, governance, content moderation, or operational overhead. See "fully loaded cost" further below.
Each tool scored on the five dimensions that drive our outcome. Five dots = leader on that dimension; fewer = trade-off.
Photo + audio in, finished talking-head out. One API call per video.
Train a persistent doctor "character" once; reuse across many future videos with cinematic motion controls.
General image-to-video. Built V1 cleanly but identity drift and cost rule it out as the primary V2 tool.
V1 required 6 stages because Veo and the voice clone are separate systems we had to glue together. The recommended V2 path collapses to 3 stages — Hedra takes a photo and an audio file and returns a finished video.
Operator overhead drops from ~10 minutes per video (monitor 11 Veo polls, retry failures, trim, stitch) to ~2 minutes (one API call, one preview, one upload).
The prototype proves the concept. Scaling to ~100 videos per month requires picking production-grade tools at each of three layers. We evaluated 12 video-generation services, 11 talking-head specialists, and 13 voice-cloning tools. The clear winners for our exact use case (real doctor, single photo, vertical short-form, ~60 seconds):
| Layer | Recommended tool | Why | Cost |
|---|---|---|---|
| Talking-head specialist | Hedra Character-3 | Photo + audio file → 60s vertical talking-head in a single API call. Best lip-sync; designed for exactly this use case. | $0.94–$2.25 per finished minute |
| Voice clone | ElevenLabs IVC | Industry leader for fidelity from a 90s–3min sample. Built for commercial use, consent-first. | $22 / month at our volume |
| Video gen (creative variety) | HiggsField Soul ID + Lipsync | Trains a persistent "character" of Dr. Lam once, reuses across all future videos. Cinematic motion controls. | ~$5–$9 per video |
| Video gen (value) | Kling 2.x Pro (via fal.ai) | Strong general video model, native audio support, very cheap. Best alternative if Hedra falls short on a specific shoot. | ~$0.17 / sec ≈ $8 / video |
| Video gen (value) | Seedance 2.0 (ByteDance, via fal.ai) | Excellent prompt adherence and motion quality; popular for short-form. Sits alongside Kling as the cheap-and-cheerful tier. | ~$0.18–$0.30 / sec |
| Video gen (current baseline) | Google Veo 3.1 fast | What we used for V1. Works for one-off clips; problems begin at production scale. See below. | ~$15–$35 / video at our scene structure |
So Veo stays in the kit as a fallback for non-likeness shots, but Hedra is the right primary tool for the talking-head workload that drives most of the volume.
Before approving any tool, watch a few real outputs from its own channel. The links below open each vendor's homepage and their public demo reels.
| Vendor | What it does | Homepage | Demos |
|---|---|---|---|
| Hedra | Photo + audio → talking-head video (our top pick) | hedra.com | Showcase · YouTube |
| ElevenLabs | Voice cloning + TTS (our top pick for the voice layer) | elevenlabs.io | Voice samples · YouTube channel |
| HiggsField | Soul ID character training + Lipsync module + DoP cinematic motion | higgsfield.ai | Showcase · YouTube |
| Kling AI | General image-to-video + native audio. Strong character consistency. | kling.ai | YouTube demos · fal.ai page |
| Seedance / Seedream | ByteDance's video model. Often paired with Seedream image gen. | seed.bytedance.com | fal.ai page · YouTube demos |
| Google Veo 3.1 | Our V1 baseline. General image-to-video with native audio. | ai.google.dev | DeepMind page · DeepMind YouTube |
| Runway Gen-4 | General video model. Strong creative tooling, no native audio. | runwayml.com | Research demos · YouTube channel |
| Luma Ray 3 | Image-to-video. Cinematic looks but face drift on portraits. | lumalabs.ai | YouTube channel |
| HeyGen | Enterprise avatar / talking-head, more expensive than Hedra. | heygen.com | Templates · YouTube |
| D-ID | Mature talking-head API; ~3× our recommended cost. | d-id.com | YouTube channel |
Some YouTube links use search rather than a fixed channel ID — vendor channel names change; the search query always resolves to current demos.
Each bundle is end-to-end (video + voice) and ready to ship. Pick one based on priorities.
Hedra Character-3 for the video, ElevenLabs for the voice.
HiggsField (Soul ID + Speak + cinematic shots) + ElevenLabs.
Build a pluggable layer; run the same script through Hedra, HiggsField, and Veo. Decide on data.
Suggested approach: Start in Bundle 3 (pilot) and collapse to Bundle 1 (Hedra + ElevenLabs) once data confirms it. Total elapsed time to production: 2–3 weeks, gated almost entirely on the consent paperwork — not engineering.
Below: tools mapped against the skills they require. Green dots = required skill. Our existing in-house stack already covers every recommended tool — no outside hires needed. The only "skill wall" path is full open-source, which trades $0 in licensing for ~20–40 hours of setup and ongoing maintenance.
| Tool | Python | REST APIs | Prompt design | Video utilities (ffmpeg) | ML setup (PyTorch / HF) | Cloud admin | In-house ready? |
|---|---|---|---|---|---|---|---|
| Hedra Character-3 | — | — | Yes | ||||
| ElevenLabs | — | — | — | Yes | |||
| HiggsField (Soul ID + Speak) | — | — | Yes | ||||
| Veo 3.1 (current) | — | Yes | |||||
| HeyGen / D-ID (talking-head SaaS) | — | — | Yes | ||||
| Open-source models (LivePortrait, MuseTalk, F5-TTS, etc.) | — | — | — | Maintainable but slow | |||
| Synthesia / Tavus / Argil | — | — | — | — | — | — | Disqualified (consent video required) |
Bottom line on staffing. Bundle 1 or Bundle 3 = roughly one focused engineering day to set up, ~2 minutes of operator time per finished video. No new hires.
Assumes ~50 seconds of finished output per video.
| Path | Monthly fixed | Per video | Monthly total | Annual |
|---|---|---|---|---|
| Hedra + ElevenLabs (Bundle 1) | $82 | ~$1.10 overage | ~$115 | ~$1,380 |
| Hedra + ElevenLabs + Veo fallback (Bundle 3 final) | ~$135 | ~$1.10 | ~$165 | ~$1,980 |
| HiggsField + ElevenLabs (Bundle 2) | $22 | ~$6.00 | ~$620 | ~$7,460 |
| Current V1 baseline (Veo + F5-TTS) | $0 | ~$8–$20 | ~$750–$2,000 | ~$9k–$24k |
| Full open-source path | $0 | $0 | $0 | $0 (but ~20h/mo of engineering time) |
Subscription cost is the smallest line. Honest fully-loaded total includes legal review, governance overhead, social-platform management, analytics tooling, content moderation, brand monitoring, insurance, drift testing, and vendor-continuity reserves.
| Cost layer | Monthly estimate | Notes |
|---|---|---|
| Nominal subscriptions (Bundle 1) | ~$115 | Hedra Pro $60 + ElevenLabs Creator $22 + overage |
| Outside legal counsel (ongoing) | $500–$1,500 | Quarterly disclosure review, Texas medical-board posture, FTC monitoring |
| Dr. Lam's script-review time | $400–$800 | ~2 hrs/week at surgeon hourly rate, per-batch script approval |
| Social platform management | $500–$1,500 | Scheduling, captions per platform, community + comment moderation |
| Analytics + brand monitoring | $200–$500 | Conversion tracking, sentiment monitoring, link-attribution tooling |
| Insurance review / uplift | $100–$400 | Confirm malpractice and general liability cover AI marketing |
| Quarterly drift + regression testing | $100–$300 | Engineering time to verify face/voice fidelity hasn't drifted with vendor updates |
| Engineering carry (refactor, vendor swaps, incidents) | $500–$1,000 | ~5–10 hrs/mo at internal cost |
| Total fully loaded | ~$2.4k–$6k / month | ~$29k–$72k / year |
Honest comparison vs. the do-nothing baseline. Today HairTX produces talking-head video the traditional way (planned shoots, manual editing). The AI pipeline isn't competing with $0 — it's competing with that. A traditional surgeon-led short-form pipeline runs ~$4k–$10k/mo all-in (videographer, editor, scheduling, multiple cuts per shoot day). The AI pipeline's real edge is throughput-at-fixed-cost, not raw price.
The cost argument alone doesn't justify a new program. These are the metrics the team will report on at the 90-day decision point, and the kill criteria if the program fails to land.
| Metric | Definition | Pilot target | Kill threshold |
|---|---|---|---|
| Cost per qualified consult | Loaded program cost ÷ consults attributed to the videos | ≤ existing channel CAC | > 2× existing CAC |
| View → consult conversion | Click-throughs from video CTA that book a consult | ≥ 0.3% per video at 30 days | < 0.05% sustained over 6 weeks |
| Engagement vs. existing posts | Median engagement rate across V2 videos vs. last 90 days of organic | Within 30% of organic median | < 50% of organic median (algorithm suppression signal) |
| Time-to-publish per video | Script draft → published, batch median | < 3 business days | > 7 business days sustained |
| Identity / voice fidelity rubric pass-rate | Internal QA against approval rubric per batch | ≥ 95% pass at first review | < 80% (drift / regression) |
| Patient or staff complaints attributable to AI content | Comments, DMs, calls flagging the AI nature | 0 critical incidents | Any incident requiring video takedown |
| Compliance audit | FTC / Texas medical-board / platform disclosure rules met on every video | 100% pass | Any video out of compliance |
Decision at 90 days: expand, hold, or kill — based on which targets cleared, not on subjective preference. Pilot data goes to leadership in a structured memo.
The four-team review identified governance gaps as the biggest risk to this program, ahead of any engineering risk. These are not optional polish items.
| Pre-condition | Owner | Status |
|---|---|---|
| Scoped voice + likeness release signed by Dr. Lam | Practice / NKP | Not started |
| Outside counsel review of release + disclosure plan (TX-licensed, healthcare-marketing experience) | Practice | Not started |
| Script-approval workflow with Dr. Lam in the loop (per-batch sign-off, 24-hr turnaround SLA) | NKP + Practice | Drafted, not implemented |
| Kill-switch SOP — written, tested procedure to pull any video from all platforms within 4 hours | NKP | Not started |
| On-frame "AI-generated" disclosure label spec (font, position, contrast, persistence) | NKP | Not started |
| Per-batch audit ledger (prompt, model, version, output URL, approval timestamps) | NKP | Spec drafted |
| Insurance confirmation that practice malpractice + general liability cover AI marketing content | Practice | Not started |
| Comment moderation policy and on-call rotation | Practice | Not started |
| Quarterly vendor drift QA (face fidelity, voice fidelity, prompt regression suite) | NKP | Not started |
| Documented vendor failover plan — Hedra down or price-spiked, switch to HiggsField in < 24h | NKP | Pluggable layer in design |
Caption-only "AI-assisted" disclosure is legally insufficient under FTC May 2026 endorsement guidance and Texas SB 1188 (synthetic-media disclosure for licensed-physician statements). Every video that publishes externally must pass all six layers below.
| # | Layer | Specifics |
|---|---|---|
| 1 | On-frame burn-in label | "AI-generated representation. Approved by Dr. Sam Lam." Persistent, high-contrast, lower-third position |
| 2 | In-script acknowledgement | First 3 seconds of every video: Dr. Lam's cloned voice acknowledges the AI assistance |
| 3 | Platform-native toggles | TikTok "AI-generated content" toggle on; Meta AI-info label on; YouTube synthetic-media disclosure on |
| 4 | Caption boilerplate | "AI-assisted video. Script reviewed and approved by Dr. Sam Lam. For a consultation, link in bio." |
| 5 | Prompt-level guard-rails | Script template forbids outcome claims, synthetic testimonials, before/after stills not separately verified by Dr. Lam |
| 6 | Audit ledger entry | Per-video record of prompt, model + version, render timestamp, approval signature, disclosure status — retained 7 years |
This stack is designed to satisfy current FTC endorsement guidance, Texas SB 1188, and Meta / TikTok / YouTube platform-specific AI policies as of May 2026. Counsel to review and re-confirm quarterly.
The social-media review identified that 2026 platform algorithms (TikTok in particular) actively suppress AI-avatar + AI-voiceover content of the V1 shape, with documented effects on reach and shares. The right volume target depends on whether we want to optimize for cost-per-video or for actual distribution.
Original target. Maximum throughput, lowest cost-per-video.
Real surgeon footage (procedure room, before/after stills with Dr. Lam, day-in-the-life, myth-busts) drives the calendar. AI talking-head used for the ~30% FAQ slice where it adds genuine value.
Run the 30-day bounded pilot (~10 videos, single platform). Use the actual engagement data to set the V2 volume target.
Decision needed at the leadership meeting: Option X, Y, or Z. The engineering pipeline supports any of the three with no rework.
Roughly $130 nominal for the first month across the three pilot accounts (Hedra Pro $60, ElevenLabs Creator $22, HiggsField trial ~$50). Loaded cost during pilot ~$3–5k including legal review. Cancel any vendor that loses the bake-off.
See the AI Disclosure section above. Required for legal posture under FTC May 2026 and Texas SB 1188. Caption-only disclosure is insufficient.
Outside counsel drafts and reviews the release. Dr. Lam signs and records 2–3 min of clean voice in a quiet room. Legal track runs in parallel to engineering.
Option X (100/mo all AI), Option Y (40/mo with 70% real footage), or Option Z (decide post-pilot). See "Open decision" section above. The engineering pipeline supports any of the three.
Cost-per-consult target, conversion target, kill criteria. See "What success looks like" section. Numbers in the table are defaults — leadership should set the actual targets.
Script-approval signoff (Dr. Lam), comment moderation, kill-switch authority, vendor relationship owner. See "Governance pre-conditions" section.
Pluggable provider layer in the existing codebase. No vendor commitments yet beyond month-1 trials.
Same FAQ script through all three tools. Side-by-side videos delivered to leadership for the data-driven call.
Production runbook: one script per procedure, batch render, hand-off to social scheduler. Weekly cadence becomes routine.
docs/SPRINT_LOG.md| Risk | Likelihood | Severity | Mitigation |
|---|---|---|---|
| Hedra quality doesn't hold for Dr. Lam specifically | Low | Medium | Pilot 1 video before committing to a month; cancel within trial window if poor. |
| Voice clone audio fidelity below expectations | Medium | Low | ElevenLabs supports re-clone in same voice slot; re-record sample if needed. |
| Vendor concentration (Bundle 1 = 2 vendors) | Medium | Medium | Pluggable provider layer; HiggsField + Kling pre-wired as failover paths; documented vendor failover SOP. |
| Vendor acquired, shut down, or 5× price spike | Low–Medium | High | Same. Bundle 3 framing keeps two providers warm at all times. |
| Model drift (face / voice fidelity shifts as vendor updates models) | Medium | Medium | Quarterly drift-QA against a frozen reference set. Roll back to last known-good config if drift detected. |
| Platform algorithmic suppression of AI content (TikTok 2026 policy) | Medium–High | Medium | Depends on volume / mix decision. Option Y (40/mo, 70% real footage) is the strategist's mitigation. Option Z defers via pilot. |
| Patient confusion or backlash at AI-generated doctor content | Medium | High | Full six-layer disclosure stack. Real signed consent. On-frame label. Active comment moderation. |
| Medical board scrutiny (TX) — AI doctor making health statements | Low–Medium | High | Outside counsel review of release + disclosure plan before launch. Script template forbids outcome / treatment claims. |
| Malpractice exposure if AI says something Dr. Lam wouldn't | Low | High | Mandatory per-batch script approval by Dr. Lam. Prompt-level guard-rails. Insurance confirmation pre-launch. |
| Regulatory shift (FTC, state medical board, platform policy) mid-rollout | Medium | Medium | Quarterly counsel review. Pluggable layer allows fast pivots. Pilot phase tests one platform before scaling. |
| Comment moderation overload — hostile or confused comments at scale | Medium | Medium | Documented moderation policy. On-call rotation. Pre-prepared response templates. |
| Hidden ops costs erode the cost advantage | Medium | Medium | Fully-loaded cost table presented to leadership up front. 90-day decision gate uses loaded TCO, not nominal. |
| Staff / physician morale (real doctors being AI-replicated) | Low–Medium | Low | Frame as augmenting Dr. Lam's reach, not replacing him. Engage clinical team in approval workflow. |