GhostLine: When Your Red Team Needs a Voice That Doesn’t Sound Like a Robot

Hook

The weakest link in enterprise security isn’t the firewall—it’s the person who answers the phone. GhostLine turns voice phishing from an art form requiring improvisational talent into a repeatable, auditable science that runs on autopilot.

Context

Traditional vishing campaigns have a scaling problem. Email phishing can target thousands simultaneously with templated messages, but voice-based social engineering has always required a human operator with acting skills, cultural knowledge, and the ability to think on their feet. A single red team engagement might include dozens of phone calls to test different departments, and documenting those conversations for client reports means manual note-taking while maintaining a convincing persona. The result? Most security assessments skip voice testing entirely, leaving organizations blind to their telephonic attack surface.

GhostLine emerged from this gap—a Python framework that orchestrates AI services into a fully autonomous vishing platform. Feed it a target phone number and a playbook describing your social engineering scenario (IT helpdesk, HR verification, vendor credential harvesting), and it conducts the entire conversation using a cloned voice that sounds human, responds intelligently to objections, and logs every interaction with cryptographic precision. It’s the difference between a penetration tester manually calling 20 employees over two days versus launching 200 calls overnight with complete transcripts ready for the final report.

Technical Insight

At its core, GhostLine is an async orchestration engine that coordinates four distinct AI services into a coherent telephone conversation. The architecture deserves unpacking because it reveals thoughtful design choices that balance operational security with practical deployment constraints. When you initiate a call, a FastAPI server spins up a WebSocket endpoint and exposes it through an ngrok tunnel. This tunnel becomes the bridge between Twilio’s PSTN network and your local machine—Twilio streams raw μ-law audio packets to your ngrok URL, which forwards them to three concurrent coroutines that handle the conversation loop.

The pump_in coroutine receives audio chunks from Twilio via WebSocket, accumulates them into buffers, and ships them to Deepgram’s streaming API for real-time transcription. Once Deepgram returns text, it hits OpenAI’s GPT models with the conversation history and a system prompt derived from your YAML playbook. Here’s where the 12-stage persuasion engine comes into play. The playbook defines stages like “establish_rapport,” “create_urgency,” “request_credentials,” each with specific instructions for the LLM. The code tracks which stage you’re in and adjusts the prompt accordingly:

# Simplified from the stage transition logic
def get_system_prompt(current_stage, playbook):
    stage_config = playbook['stages'][current_stage]
    return f"""
    You are {playbook['persona']['name']}, a {playbook['persona']['role']}.
    Current objective: {stage_config['objective']}
    Tactics: {', '.join(stage_config['tactics'])}
    
    If the target {stage_config['success_condition']}, 
    respond with [STAGE_COMPLETE] to advance.
    
    Maintain these behavioral constraints:
    - {stage_config['tone']}
    - Never break character
    - Log any credential disclosures immediately
    """

Once GPT generates a response, the pump_out coroutine sends that text to ElevenLabs for voice synthesis. ElevenLabs returns audio data that gets base64-encoded and pushed back through the WebSocket to Twilio, which plays it to the target over the phone line. The third coroutine handles silence detection—if neither party speaks for a configurable threshold, it triggers a gentle prompt or graceful call termination.

The YAML playbook system is where operational flexibility lives. Instead of hardcoding social engineering scripts into Python, operators define conversation flows declaratively:

playbook_name: "IT Password Reset"
persona:
  name: "David Chen"
  role: "IT Support Technician"
  voice_id: "elevenlabs_voice_id_here"

stages:
  - id: establish_rapport
    objective: "Build trust by referencing legitimate internal systems"
    tactics:
      - "Mention ServiceNow ticket number from reconnaissance"
      - "Use insider terminology (VPN, Okta, specific software)"
    success_condition: "target confirms identity or asks how to help"
    
  - id: create_urgency
    objective: "Manufacture time pressure"
    tactics:
      - "Claim account will be locked in 30 minutes"
      - "Reference 'policy compliance deadline'"
    success_condition: "target expresses willingness to comply quickly"

The evidence chain is where GhostLine shows its enterprise pedigree. Every transcript snippet, stage transition, and extracted credential gets SHA-256 hashed before insertion into SQLite. The schema includes timestamps, audio file paths, and success/failure flags that map cleanly to red team report templates. This isn’t just logging—it’s creating legally defensible proof that specific information was disclosed during authorized testing, complete with exact timestamps if disputes arise about what was said.

The ngrok tunnel is both clever and controversial. On one hand, it solves the deployment problem—no need to configure firewall rules, obtain static IPs, or set up cloud infrastructure just to test a vishing scenario. Twilio needs a public webhook, and ngrok provides one instantly. On the other, it introduces a third-party proxy into your operational chain, and the free tier imposes connection limits that make large-scale campaigns impractical. The documentation suggests ngrok alternatives (PageKite, localhost.run, Tailscale Funnel), but the core architectural assumption remains: you’re running this from a laptop or on-prem machine, not containerized in AWS.

Gotcha

The cost structure will surprise you. A single 10-minute call involves approximately: $0.013/min for Twilio voice ($0.13 total), $0.0043/min for Deepgram streaming transcription ($0.043), ~$0.002 for OpenAI API calls depending on GPT-4 usage, and $0.18-0.30 for ElevenLabs voice synthesis depending on character count. That’s roughly $0.35-0.50 per call before accounting for ngrok Pro if you need custom domains or simultaneous tunnels. A 100-call campaign costs $35-50 in API fees alone—manageable for billable red team engagements, but this isn’t something you casually experiment with on weekends.

Latency is the other operational constraint that isn’t obvious from the README. Your audio path looks like: Target phone → Twilio edge → ngrok server → your laptop → Deepgram → OpenAI → ElevenLabs → reverse the chain. Even with optimized coroutines, expect 2-4 seconds between when the target finishes speaking and when they hear the AI response. Sophisticated targets—especially those with security awareness training—may notice the unnatural pause pattern. The project lacks jitter injection or filler words (“um,” “let me check”) that would mask processing delays, so conversations can feel stilted compared to human attackers who interject naturally.

Verdict

Use if: You’re conducting authorized security assessments with documented Rules of Engagement, have budget for API services, and need repeatable vishing campaigns with audit-quality logging for client deliverables. GhostLine excels when you’re testing 20+ targets with similar scenarios and want transcripts auto-generated for your report. It’s also ideal when your team lacks voice acting talent but has Python skills—the YAML playbook system lets you iterate on social engineering tactics without touching code. Skip if: You’re exploring voice AI for legitimate telephony automation (customer service bots, appointment reminders)—the persuasion engine and evasion-oriented architecture are purpose-built for adversarial use cases and will complicate regulatory compliance. Also skip if you need sub-second response latency or can’t justify the per-call API costs. For learning environments or one-off tests, the setup overhead (five different API accounts, ngrok configuration, voice sample recording) outweighs the benefit. Finally, avoid if you’re in a jurisdiction where even authorized vishing requires special legal frameworks—the tool’s effectiveness makes it high-stakes, and misuse carries serious consequences.