Serverless PDF Processing: Why unpdf Beats pdf-parse on Vercel
The technical story of debugging PDF processing failures on Vercel and why unpdf is the serverless-compatible solution that actually works.

It was 2 AM. StatementSync was ready to deploy. I pushed to Vercel and watched the build fail.
Error: Cannot find module 'canvas'
at Function.Module._resolveFilename Canvas? I’m processing PDFs, not drawing graphics. Three hours later, I learned why pdf-parse breaks on serverless.
The Problem
pdf-parse is the go-to library for PDF text extraction in Node.js:
import pdf from 'pdf-parse';
const dataBuffer = fs.readFileSync('statement.pdf');
const data = await pdf(dataBuffer);
console.log(data.text); Works perfectly locally. Crashes spectacularly on Vercel.
Why It Fails
pdf-parse depends on pdfjs-dist, Mozilla’s PDF.js port for Node. pdfjs-dist has optional dependencies:
{
"optionalDependencies": {
"canvas": "^2.x",
"node-fetch": "^2.x"
}
} Canvas is a native module that requires:
- Python
- node-gyp
- C++ build tools
Vercel’s serverless runtime doesn’t have these. The build either:
- Fails outright with missing module errors
- Succeeds but crashes at runtime with segfaults
The Debugging Journey
Attempt 1: Exclude Canvas
“Just mark canvas as external,” Stack Overflow said.
// next.config.js
module.exports = {
webpack: (config) => {
config.externals = [...(config.externals || []), 'canvas'];
return config;
},
}; Result: Different error.
Error: Could not load the "canvas" module pdfjs-dist tries to load canvas at runtime, not just build time.
Attempt 2: Legacy Build
“Use pdf-parse legacy mode,” another answer suggested.
const pdf = require('pdf-parse/lib/pdf-parse'); Result: Still fails. The dependency chain remains.
Attempt 3: pdfjs-dist Directly
“Skip pdf-parse, use pdfjs-dist with worker disabled.”
import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = '';
const pdf = await pdfjsLib.getDocument({ data: buffer }).promise; Result: Works locally, memory errors on Vercel.
Vercel functions have 1GB memory limit. pdfjs-dist’s memory usage is unpredictable with large PDFs.
The Solution: unpdf
After three hours, I found unpdf:
import { getDocument, extractText } from 'unpdf';
const pdf = await getDocument({ data: buffer }).promise;
const text = await extractText(pdf); Result: Works. First try.
Why unpdf Works
unpdf is built specifically for serverless:
| Feature | pdf-parse | unpdf |
|---|---|---|
| Native deps | Yes (canvas) | No |
| Vercel compatible | No | Yes |
| Edge runtime | No | Yes |
| Bundle size | Large | Small |
| Memory usage | Unpredictable | Controlled |
The library uses a pure JavaScript PDF parser without native modules. No build-time compilation, no runtime loading issues.
Implementation
Here’s the complete pattern for serverless PDF processing:
import { getDocument, extractText } from 'unpdf';
interface Transaction {
date: string;
description: string;
amount: number;
type: 'debit' | 'credit';
}
async function processPdf(buffer: Buffer): Promise<Transaction[]> {
// Load PDF
const pdf = await getDocument({ data: buffer }).promise;
// Extract text
const text = await extractText(pdf);
// Parse transactions (pattern-based for bank statements)
const transactions = parseTransactions(text);
// Cleanup
pdf.destroy();
return transactions;
}
function parseTransactions(text: string): Transaction[] {
// Bank-specific parsing patterns
const lines = text.split('
');
const transactions: Transaction[] = [];
for (const line of lines) {
const match = line.match(/(d{2}/d{2})s+(.+?)s+(-?$[d,]+.d{2})/);
if (match) {
transactions.push({
date: match[1],
description: match[2].trim(),
amount: parseFloat(match[3].replace(/[$,]/g, '')),
type: match[3].startsWith('-') ? 'debit' : 'credit'
});
}
}
return transactions;
} Performance
On Vercel’s free tier (1GB memory, 10s timeout):
| PDF Size | Processing Time | Memory Used |
|---|---|---|
| 1 page | 1-2 seconds | ~100MB |
| 5 pages | 3-4 seconds | ~200MB |
| 10 pages | 5-6 seconds | ~350MB |
| 20 pages | 8-9 seconds | ~500MB |
Comfortable margins for typical bank statements (1-5 pages).
Pattern-Based vs LLM Extraction
For structured documents like bank statements, pattern-based extraction beats LLM:
| Approach | Accuracy | Cost | Speed |
|---|---|---|---|
| Pattern-based | 99% | $0 | 3-5s |
| LLM (GPT-4) | 99.5% | $0.01-0.05 | 10-30s |
| OCR + LLM | 95% | $0.02-0.08 | 15-45s |
For StatementSync processing 1000 statements/month:
- Pattern-based: $0
- LLM: $10-50/month
The 0.5% accuracy difference doesn’t justify the cost for this use case.
When to Use What
Use unpdf when:
- Deploying to Vercel, Netlify, or Cloudflare
- Processing structured documents (statements, invoices)
- Need low memory footprint
- Running on edge runtimes
Use pdf-parse when:
- Running on traditional servers (EC2, DigitalOcean)
- Need advanced PDF features (annotations, forms)
- Have native build tools available
Use LLM extraction when:
- Documents are unstructured or variable
- Accuracy is more important than cost
- Processing low volumes
The Lesson
The right library matters more than clever workarounds. I spent 3 hours trying to make pdf-parse work on serverless. unpdf worked in 10 minutes.
If you’re building PDF processing for serverless, start with unpdf. Save yourself the 2 AM debugging.
Related: From Pain Point to MVP: StatementSync in One Week | Portfolio: StatementSync