Why use multi-agent architecture for bug bounty automation?

Multi-agent architecture allows parallel execution, independent failure recovery, and specialized agents for each task type. A monolithic scanner can't adapt--if it hits a rate limit, everything stops. With agents, only the affected agent pauses while others continue.

What is evidence-gated progression in bug bounty automation?

Evidence-gated progression means findings must pass confidence thresholds before advancing. A discovery starts at 0.0, gains confidence through validation testing, and only reaches human review at 0.85+. This prevents false positives from wasting reviewer time.

How does SQLite RAG improve bug bounty automation?

SQLite RAG stores semantic embeddings of previous findings, failures, and success patterns. When testing new targets, the system retrieves relevant context to inform strategy--avoiding known false positives and prioritizing techniques that worked on similar tech stacks.

What confidence score is needed before a finding gets human review?

Findings need 0.85+ confidence for immediate human review, 0.70-0.84 for same-day batch review, and 0.40-0.69 for weekly review. Below 0.40, findings are discarded but patterns are logged for learning.

How does session persistence work in bug bounty automation?

The system saves checkpoints every 5 minutes to SQLite--including session state, discovered assets, and findings. If the system crashes, you resume with --resume session-id and continue exactly where you left off without losing progress.

I Built an AI-Powered Bug Bounty Automation System

Why I chose multi-agent architecture over monolithic scanners, and how evidence-gated progression keeps findings honest. Part 1 of 5.

Chudi Nnorukam

Dec 19, 2025 6 min read

TL;DR

I ran my first automated vulnerability scan three months ago. Found 47 'critical' vulnerabilities. Submitted 12 reports. All rejected. I was trusting scanners over evidence. Well, it's more like I needed multi-agent architecture, not monolithic scanning. Findings must reach 0.85+ confidence before human review.

Key Takeaways:

Multi-agent architecture beats monolithic scanners. Agents can run in parallel and recover from failures independently
Evidence-gated progression: findings move through discovery, validation, review, submission stages with confidence thresholds
0.85+ confidence required before human review queue. This prevents wasted time on low-quality findings
SQLite RAG stores patterns from previous findings to inform future testing strategies
Session persistence enables resumable workflows. Survive crashes without losing progress

In this cluster

Bug Bounty Automation: Autonomous security testing with human-in-the-loop safeguards and evidence gates.

Pillar guide

Semi-Autonomous Bug Bounty System How I built a multi-agent bug bounty hunting system with evidence-gated progression, RAG-enhanced learning, and safety mechanisms that keeps humans in the loop.

Related in this cluster

Bug Bounty: Detection vs Validation Why 'finding' a vulnerability isn't enough, and how response diff analysis cut my false positive rate dramatically. Part 2 of 5.
Failure-Driven Learning: Auto-Recovery in Security Tools How my bug bounty automation learns from rate limits, bans, and crashes to get smarter over time. Part 3 of 5.
Multi-Platform Bug Bounty Tool How I built unified integration for HackerOne, Intigriti, and Bugcrowd with platform-specific formatters and a shared findings model. Part 4 of 5.

I ran my first automated vulnerability scan three months ago. Found 47 “critical” vulnerabilities. Submitted 12 reports.

Every single one was a false positive.

That specific embarrassment—of confidently submitting garbage to a program that now knows my name—still stings when I think about it. Traditional scanners generate noise. They don’t think. They pattern match and hope something sticks.

Building AI-powered bug bounty automation requires multi-agent architecture where specialized agents handle reconnaissance, testing, validation, and reporting independently. The key innovation isn’t automation itself—it’s evidence-gated progression where findings must reach 0.85+ confidence through validated proof-of-concept execution before ever reaching human review. This prevents the false positive flood that destroys researcher reputation.

Why Did I Choose Multi-Agent Over Monolithic Scanners?

Monolithic scanners are brittle. Hit a rate limit? Everything stops. Encounter a CAPTCHA? Dead. One endpoint times out? The whole queue backs up.

I built something different—a 4-tier agent system where each agent operates independently:

Recon Agents run passive discovery in parallel. Subdomain enumeration via certificate transparency. Technology fingerprinting with httpx. JavaScript analysis for hidden endpoints. GraphQL introspection. They feed assets into a shared database but never block each other.

Testing Agents take those assets and probe for vulnerabilities. IDOR testing with multi-account replay. XSS payload injection. SQL injection patterns. SSRF with metadata service probing. Maximum 4 concurrent agents to avoid rate limiting—but each recovers independently if throttled.

Validation Agent is the gatekeeper. Every finding goes through proof-of-concept execution before advancing. This is where I learned the hard lesson: detection is not exploitation. More on this in part 2 of this series.

Reporter Agent generates platform-specific reports only for validated findings. CVSS scoring, reproducible PoC code, evidence attachments. Different formatters for HackerOne, Intigriti, Bugcrowd—covered in part 4.

Without me realizing it, I was building a system that mirrors how good human researchers actually work—parallel reconnaissance, focused testing, ruthless validation, careful reporting.

[!TIP] The orchestrator (Claude Opus 4.6) coordinates all agents but doesn’t do the work itself. It distributes tasks, manages budgets, detects failures, and persists session state. Like a project manager who never touches code.

How Does Evidence-Gated Progression Actually Work?

Every finding has a confidence score from 0.0 to 1.0. But here’s what makes this different from typical severity ratings—the score changes as evidence accumulates.

A finding starts at maybe 0.3 when first detected. The Testing Agent found something that looks like reflected XSS. Could be real. Could be the payload appearing in an error message (harmless).

Then Validation runs. PoC execution in a sandboxed environment. Response diff analysis comparing baseline vs. vulnerable responses. False positive signature matching.

If the PoC executes successfully and the response actually demonstrates exploitation (not just reflection)—confidence jumps to 0.85+. Now it’s queued for human review.

If PoC fails? Confidence drops. Maybe to 0.4. Still logged for weekly batch review, but not wasting human attention.

Finding Lifecycle:
Discovered (0.3) → Validating (varies) → Reviewed (human) → Submitted/Dismissed
                         ↑
              Confidence may increase or DECREASE

I originally thought validation would only increase confidence. Well, it’s more like… validation is the adversary. It’s trying to disprove your finding. Survivng that adversarial process is what makes a finding credible.

What Is SQLite RAG and Why Does It Matter?

RAG stands for Retrieval-Augmented Generation. But forget the buzzword—here’s what it actually does:

The system remembers. Not just what vulnerabilities it found, but what worked, what failed, and why.

When I start testing a new target running Laravel, the system queries its knowledge base: “What techniques succeeded on Laravel targets before? What false positive patterns should I avoid?”

It retrieves relevant context and adjusts strategy accordingly.

The database has 13 tables but three matter most:

Table	Purpose
`knowledge_base`	Semantic embeddings of past findings and techniques
`false_positive_signatures`	Known patterns that look like vulns but aren’t
`failure_patterns`	Recovery strategies for different error types

That last table connects directly to part 3 of this series—failure-driven learning is where the system actually gets smarter over time.

How Does the Orchestrator Coordinate Everything?

Claude Opus 4.6 runs the show. But it’s constrained by design.

The orchestrator handles:

Task distribution: Which agents work on which assets
Budget management: API call limits per platform, token usage tracking
Failure detection: When an agent hits errors, classify and recover
Session persistence: Checkpoint every 5 minutes for crash recovery

Here’s the key pattern—the BaseAgent abstraction:

abstract class BaseAgent {
  protected config: AgentConfig;
  abstract execute(params: Record<string, unknown>): Promise<AgentResult>;
  protected async withTimeout<T>(promise, ms) { /* ... */ }
}

Every specialized agent inherits this contract. Testing agents, recon agents, validation—all share consistent timeout handling, error propagation, and rate limit enforcement.

I love constraints like this. They prevent the AI from getting creative in ways that break things.

[!WARNING] Without the BaseAgent contract, each agent would handle errors differently. Some might retry infinitely. Some might swallow errors silently. The abstraction enforces consistency across a system that could easily become chaotic.

How Do You Resume After a Crash?

Session persistence was born from pain.

I was 6 hours into scanning a large program. Found 3 promising leads. System crashed because my laptop went to sleep.

Lost everything.

Now the system saves state every 5 minutes:

checkpoint = {
  sessionId, timestamp, phase, progress,
  discoveredAssets: [...],
  findings: [...]
};
db.insert('context_snapshots', checkpoint);

Resume with pnpm run start --resume session-id and you’re back exactly where you left off.

The database persists everything: programs, assets, findings, sessions, failure logs. Even if the application crashes, the SQLite file survives.

What’s the Real Code Pattern?

The finding lifecycle is a state machine:

States:
- new (just detected)
- validating (PoC execution)
- reviewed (human decision)
- submitted (sent to platform)
- dismissed (false positive)

Transitions:
new → validating (automatic)
validating → validating (confidence adjustment)
validating → reviewed (0.70+ confidence)
reviewed → submitted (human approval)
reviewed → dismissed (human rejection)

Confidence isn’t binary. A finding can bounce between states, gaining or losing credibility based on evidence.

I hated state machines in CS classes. But I need them here. The complexity they handle—partial validation, human-in-the-loop gates, platform-specific submission—would be chaos without them.

Where Does This Series Go Next?

This is part 1 of a 5-part series on building bug bounty automation:

Architecture & Multi-Agent Design (you are here)
From Detection to Proof: Validation & False Positives
Failure-Driven Learning: Auto-Recovery Patterns
One Tool, Three Platforms: Multi-Platform Integration
Human-in-the-Loop: The Ethics of Security Automation

Next up: why response diff analysis beats payload detection, and how the validation agent reduced my false positive rate from “embarrassing” to “acceptable.”

Maybe the goal isn’t to automate bug bounty hunting. Maybe it’s to automate the parts that don’t require judgment—so human attention goes where it actually matters.

Written by Chudi Nnorukam

I design and deploy agent-based AI automation systems that eliminate manual workflows, scale content, and power recursive learning. Specializing in micro-SaaS tools, content automation, and high-performance web applications.

Twitter/X LinkedIn GitHub

FAQ

Sources & Further Reading

Sources

OWASP Web Security Testing Guide OWASP standard Baseline methodology reference for web application security testing.
OWASP Top Ten Web Application Security Risks OWASP standard Canonical list of common web application risks for prioritization.
MITRE CWE - Common Weakness Enumeration MITRE dataset Authoritative taxonomy for classifying software weaknesses.