Skip to main content

Serverless PDF Processing: Why unpdf Beats pdf-parse on Vercel

The technical story of debugging PDF processing failures on Vercel and why unpdf is the serverless-compatible solution that actually works.

Chudi Nnorukam
Chudi Nnorukam
Dec 28, 2025 3 min read
Serverless PDF Processing: Why unpdf Beats pdf-parse on Vercel

It was 2 AM. StatementSync was ready to deploy. I pushed to Vercel and watched the build fail.

Error: Cannot find module 'canvas'
    at Function.Module._resolveFilename

Canvas? I’m processing PDFs, not drawing graphics. Three hours later, I learned why pdf-parse breaks on serverless.

The Problem

pdf-parse is the go-to library for PDF text extraction in Node.js:

import pdf from 'pdf-parse';

const dataBuffer = fs.readFileSync('statement.pdf');
const data = await pdf(dataBuffer);
console.log(data.text);

Works perfectly locally. Crashes spectacularly on Vercel.

Why It Fails

pdf-parse depends on pdfjs-dist, Mozilla’s PDF.js port for Node. pdfjs-dist has optional dependencies:

{
  "optionalDependencies": {
    "canvas": "^2.x",
    "node-fetch": "^2.x"
  }
}

Canvas is a native module that requires:

  • Python
  • node-gyp
  • C++ build tools

Vercel’s serverless runtime doesn’t have these. The build either:

  1. Fails outright with missing module errors
  2. Succeeds but crashes at runtime with segfaults

The Debugging Journey

Attempt 1: Exclude Canvas

“Just mark canvas as external,” Stack Overflow said.

// next.config.js
module.exports = {
  webpack: (config) => {
    config.externals = [...(config.externals || []), 'canvas'];
    return config;
  },
};

Result: Different error.

Error: Could not load the "canvas" module

pdfjs-dist tries to load canvas at runtime, not just build time.

Attempt 2: Legacy Build

“Use pdf-parse legacy mode,” another answer suggested.

const pdf = require('pdf-parse/lib/pdf-parse');

Result: Still fails. The dependency chain remains.

Attempt 3: pdfjs-dist Directly

“Skip pdf-parse, use pdfjs-dist with worker disabled.”

import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = '';

const pdf = await pdfjsLib.getDocument({ data: buffer }).promise;

Result: Works locally, memory errors on Vercel.

Vercel functions have 1GB memory limit. pdfjs-dist’s memory usage is unpredictable with large PDFs.

The Solution: unpdf

After three hours, I found unpdf:

import { getDocument, extractText } from 'unpdf';

const pdf = await getDocument({ data: buffer }).promise;
const text = await extractText(pdf);

Result: Works. First try.

Why unpdf Works

unpdf is built specifically for serverless:

Featurepdf-parseunpdf
Native depsYes (canvas)No
Vercel compatibleNoYes
Edge runtimeNoYes
Bundle sizeLargeSmall
Memory usageUnpredictableControlled

The library uses a pure JavaScript PDF parser without native modules. No build-time compilation, no runtime loading issues.

Implementation

Here’s the complete pattern for serverless PDF processing:

import { getDocument, extractText } from 'unpdf';

interface Transaction {
  date: string;
  description: string;
  amount: number;
  type: 'debit' | 'credit';
}

async function processPdf(buffer: Buffer): Promise<Transaction[]> {
  // Load PDF
  const pdf = await getDocument({ data: buffer }).promise;

  // Extract text
  const text = await extractText(pdf);

  // Parse transactions (pattern-based for bank statements)
  const transactions = parseTransactions(text);

  // Cleanup
  pdf.destroy();

  return transactions;
}

function parseTransactions(text: string): Transaction[] {
  // Bank-specific parsing patterns
  const lines = text.split('
');
  const transactions: Transaction[] = [];

  for (const line of lines) {
    const match = line.match(/(d{2}/d{2})s+(.+?)s+(-?$[d,]+.d{2})/);
    if (match) {
      transactions.push({
        date: match[1],
        description: match[2].trim(),
        amount: parseFloat(match[3].replace(/[$,]/g, '')),
        type: match[3].startsWith('-') ? 'debit' : 'credit'
      });
    }
  }

  return transactions;
}

Performance

On Vercel’s free tier (1GB memory, 10s timeout):

PDF SizeProcessing TimeMemory Used
1 page1-2 seconds~100MB
5 pages3-4 seconds~200MB
10 pages5-6 seconds~350MB
20 pages8-9 seconds~500MB

Comfortable margins for typical bank statements (1-5 pages).

Pattern-Based vs LLM Extraction

For structured documents like bank statements, pattern-based extraction beats LLM:

ApproachAccuracyCostSpeed
Pattern-based99%$03-5s
LLM (GPT-4)99.5%$0.01-0.0510-30s
OCR + LLM95%$0.02-0.0815-45s

For StatementSync processing 1000 statements/month:

  • Pattern-based: $0
  • LLM: $10-50/month

The 0.5% accuracy difference doesn’t justify the cost for this use case.

When to Use What

Use unpdf when:

  • Deploying to Vercel, Netlify, or Cloudflare
  • Processing structured documents (statements, invoices)
  • Need low memory footprint
  • Running on edge runtimes

Use pdf-parse when:

  • Running on traditional servers (EC2, DigitalOcean)
  • Need advanced PDF features (annotations, forms)
  • Have native build tools available

Use LLM extraction when:

  • Documents are unstructured or variable
  • Accuracy is more important than cost
  • Processing low volumes

The Lesson

The right library matters more than clever workarounds. I spent 3 hours trying to make pdf-parse work on serverless. unpdf worked in 10 minutes.

If you’re building PDF processing for serverless, start with unpdf. Save yourself the 2 AM debugging.


Chudi Nnorukam

Written by Chudi Nnorukam

I design and deploy agent-based AI automation systems that eliminate manual workflows, scale content, and power recursive learning. Specializing in micro-SaaS tools, content automation, and high-performance web applications.

Related: From Pain Point to MVP: StatementSync in One Week | Portfolio: StatementSync