Skip to main content

Failure-Driven Learning: Auto-Recovery in Security Tools

How my bug bounty automation learns from rate limits, bans, and crashes to get smarter over time. Part 3 of 5.

Chudi Nnorukam
Chudi Nnorukam
Dec 20, 2025 7 min read
Failure-Driven Learning: Auto-Recovery in Security Tools

My testing agent hit a rate limit at 2 AM. It retried immediately. Got rate limited again. Retried. Rate limited. Retried faster.

By the time I woke up, my IP was banned from the target’s entire infrastructure.

That specific frustration—of a system that worked against itself, making things worse with every “fix”—taught me that failure handling isn’t optional. It’s the difference between a tool and a weapon aimed at yourself.

Failure-driven learning in security automation requires classifying errors into distinct categories and applying specific recovery strategies. Rate limits need exponential backoff. Bans need immediate halt and human alert. Timeouts need reduced parallelism. The system must learn from recurring failures to prevent future damage and improve recovery over time.


What Are the 6 Failure Categories?

Every error gets classified. No generic “try again” logic.

CategoryDetection PatternRecovery Strategy
Rate LimitHTTP 429, “too many requests”Exponential backoff (2x, max 1hr)
Ban DetectedCAPTCHA, IP block, consecutive 403Immediate halt + human alert
Auth Error401, expired token, invalid sessionCredential refresh + retry (3 max)
TimeoutNo response > 30 secondsReduce parallelism + extend timeout
Scope ViolationTesting out-of-scope domainRemove from queue + blacklist
False PositiveValidation rejectionLog pattern + update signatures

Each category has specific recovery logic. The failure detector classifies first, then routes to the right handler.

In part 1, I explained how agents operate independently. This matters for failure recovery—when one agent gets rate limited, others continue. The failure is isolated.


How Does Exponential Backoff Actually Work?

Simple concept, careful implementation:

Attempt 1: Fail → Wait 30s
Attempt 2: Fail → Wait 60s (2x)
Attempt 3: Fail → Wait 120s (2x)
Attempt 4: Fail → Wait 240s (2x)
...
Maximum: 1 hour wait

The multiplier is 2x. The ceiling is 1 hour. Why a ceiling? Because some rate limits reset faster than exponential would suggest. Waiting 4 hours when the limit resets in 15 minutes wastes time.

class RateLimiter {
  private baseDelay = 30000; // 30 seconds
  private multiplier = 2;
  private maxDelay = 3600000; // 1 hour

  getDelay(attemptNumber: number): number {
    const delay = this.baseDelay * Math.pow(this.multiplier, attemptNumber - 1);
    return Math.min(delay, this.maxDelay);
  }
}

I originally set no ceiling—exponential forever. Well, it’s more like… I trusted the math. But the math doesn’t know that HackerOne resets rate limits every 15 minutes. Context matters.

[!TIP] Token bucket rate limiting works better for proactive throttling. Refill tokens at a steady rate (e.g., 10/second), consume on each request. When bucket empties, wait. Smoother than reactive exponential backoff.


What Triggers Ban Detection?

Bans are different from rate limits. Rate limits say “slow down.” Bans say “go away.”

Detection patterns:

1

CAPTCHA challenge

Response body contains CAPTCHA JavaScript, reCAPTCHA, hCaptcha, or CloudFlare challenge page. System cannot solve these automatically.
2

IP block response

Consistent 403 or 503 from all endpoints. Usually with WAF headers indicating permanent block.
3

Consecutive failures

5+ requests in a row fail with same error. Likely systematic rejection, not transient issue.
4

Block patterns in body

"Your IP has been banned", "Access denied permanently", "Contact security team". Explicit rejection.

When ban detected:

  1. Immediate halt - All agents stop testing this target
  2. Human alert - Notification sent (Slack, email, database flag)
  3. Session preserved - State saved so human can investigate
  4. Never auto-resume - Human must explicitly approve continuation

I’ve been banned once. It happened because my failure detection was checking for rate limits but not bans. The scanner kept hammering while the target escalated from rate limit → temporary block → permanent ban.

Now ban detection has highest priority. It runs before rate limit checks.

[!WARNING] A ban from a bug bounty program can affect your reputation. Programs talk to each other. Getting permanently blocked from one target for aggressive scanning could impact your standing elsewhere. The automation must respect this.


How Does the Failure Patterns Database Work?

Recurring failures teach patterns:

// failure_patterns table schema
interface FailurePattern {
  pattern_id: string;        // Primary key
  error_signature: string;   // regex or exact match
  category: string;          // rate_limit, ban_detected, etc.
  recovery_strategy: string; // JSON config for recovery
  occurrences: number;       // how many times seen
  last_seen: Date;
  target_specific: boolean;  // applies to specific target or all
}

When a new error arrives:

  1. Check if it matches existing pattern
  2. If match found, apply learned recovery strategy
  3. If no match, use default recovery for that category
  4. After recovery, log this occurrence

Over time, the system learns:

  • “Target X rate limits after 50 requests per minute” → Proactively throttle to 40
  • “This WAF pattern means temporary block, wait 10 minutes” → Auto-resume after delay
  • “This error always precedes a ban” → Halt immediately, don’t wait for ban confirmation

The validation false positive signatures from part 2 use the same pattern database. Failures during validation teach what responses indicate “not a vulnerability” vs. “just an error.”


When Does the System Escalate to Humans?

Automation can’t solve everything. Escalation rules:

Immediate escalation:

  • Ban detected (any severity)
  • Scope violation detected
  • Critical system error (database corruption, etc.)

Threshold escalation:

  • Same error category 5+ times in 5 minutes
  • Auth errors not resolved after 3 credential refreshes
  • Timeout persists after reducing to minimum parallelism

Never escalate:

  • First occurrence of rate limit (handled automatically)
  • Single timeout (transient network issue)
  • False positive detection (just learning, not blocking)

The escalation notification includes:

  • Error category and pattern
  • What recovery was attempted
  • Current session state (so human can resume)
  • Suggested manual action

I hated adding escalation logic. It felt like admitting failure. But I needed it. Without escalation, the system either gives up too easily (abandoning valid targets) or pushes too hard (getting banned). Human judgment bridges the gap.


What’s the Recovery-Oriented Error Handling Pattern?

Traditional error handling:

try {
  await scanTarget(target);
} catch (error) {
  throw error; // Propagate up, let someone else deal with it
}

Recovery-oriented handling:

async function scanWithRecovery(target: Target): Promise<void> {
  const error = await detectError(lastResponse);

  if (!error) return; // No error, continue

  const signal = classifyError(error); // Returns FailureSignal

  const strategy = getRecoveryStrategy(signal);

  await executeRecovery(strategy, target);

  // Recovery might mean: wait, retry, refresh creds, or halt
}

Errors don’t propagate—they trigger recovery flows. The system assumes errors are normal and plans for them.

Error Occurs

Classify (which category?)

Check failure_patterns (known issue?)

Apply recovery strategy

Log for learning

Continue or escalate

How Does This Connect to Session Persistence?

In part 1, I described session checkpointing. Failure recovery depends on it.

When recovery requires waiting (exponential backoff, ban cooldown), the session saves state and sleeps. When it wakes:

// On resume after failure-induced pause
const checkpoint = db.get('context_snapshots', sessionId);
const failureState = db.get('failures', sessionId);

// Check if recovery period passed
if (failureState.recoveryUntil > Date.now()) {
  // Still waiting, sleep more
  await sleep(failureState.recoveryUntil - Date.now());
}

// Resume from checkpoint
await resumeSession(checkpoint);

The system can be killed during backoff and resume correctly. No lost state, no duplicate requests, no memory of “where was I?”


What Happens After Repeated False Positives?

False positives are a special failure category. They don’t need exponential backoff—they need pattern learning.

When validation rejects a finding:

  1. Extract the pattern that triggered detection
  2. Extract the pattern that caused rejection
  3. Add to false_positive_signatures database
  4. Adjust Testing Agent’s detection threshold for similar patterns

Over time:

  • “Reflected input in error messages → false positive” becomes a signature
  • Testing Agent learns to not report these as findings at all
  • Validation workload decreases
  • Human review queue gets cleaner

This connects to human-in-the-loop design in part 5. Human feedback on false positives feeds the learning system. Every rejection teaches.


What’s the Actual Failure Recovery Rate?

Before failure-driven learning:

  • ~30% of scans interrupted by unhandled errors
  • Manual intervention needed 2-3 times per target
  • Bans happened monthly (yes, really)
  • No pattern learning—same mistakes repeated

After implementation:

  • ~5% of scans need human intervention
  • Automatic recovery handles rate limits, timeouts, auth refreshes
  • Zero bans in 6 months (knock on wood)
  • Pattern database has 200+ learned signatures

The system still fails. But it fails gracefully. It preserves state, notifies humans, and learns for next time.


Where Does This Series Go Next?

This is part 3 of a 5-part series on building bug bounty automation:

  1. Architecture & Multi-Agent Design
  2. From Detection to Proof: Validation & False Positives
  3. Failure-Driven Learning: Auto-Recovery Patterns (you are here)
  4. One Tool, Three Platforms: Multi-Platform Integration
  5. Human-in-the-Loop: The Ethics of Security Automation

Next up: how one system handles three different bug bounty platforms with their own APIs, report formats, and quirks.


Maybe failure isn’t the opposite of success. Maybe it’s the input data for getting smarter—every rate limit, every timeout, every ban teaching the system what not to do next time.

Chudi Nnorukam

Written by Chudi Nnorukam

I design and deploy agent-based AI automation systems that eliminate manual workflows, scale content, and power recursive learning. Specializing in micro-SaaS tools, content automation, and high-performance web applications.