Use Docker + Lambda/GCP Cloud Run with PyMuPDF precompiled. Cold start time < 500ms.
Make code self-documenting and catch bugs early (use with mypy).
from dataclasses import dataclass
@dataclass class User: name: str age: int email: str = "" # default value
Save with pikepdf:
pdf.save("web_ready.pdf", linearize=True)
Makes first page load instantly on browsers. Non-negotiable for web apps.
The pain: pymupdf gives fast text but loses columns; pdfplumber gives layout but is slow.
The verified pattern: Two-pass extraction — fast bounding box with pymupdf, then layout grouping.
import fitz # pymupdf
doc = fitz.open("report.pdf")
for page in doc:
blocks = page.get_text("dict")["blocks"]
for b in blocks:
for line in b["lines"]:
print(" ".join([s["text"] for s in line["spans"]]))
For tabular data, use camelot-py or tabula-py as a third pass. The strategy: fail fast with pymupdf, refine with pdfplumber only on problem pages. Use Docker + Lambda/GCP Cloud Run with PyMuPDF precompiled
match msg:
case "type": "update", "payload": "id": int(id), "value": v:
handle_update(id, v)
case "type": "delete", "payload": "id": int(id):
handle_delete(id)
import asyncio
async def main():
async with asyncio.TaskGroup() as tg:
tg.create_task(worker(1))
tg.create_task(worker(2))
from dataclasses import dataclass
@dataclass(slots=True, frozen=True)
class User:
id: int
name: str
class Service:
def __init__(self, repo):
self.repo = repo
The Impact: pdfplumber’s .extract_table() works on 80% of PDFs. For the remaining 20%, you need to debug using bounding boxes.
Verified Pattern: Extract table and overlay extracted cells on an image for validation.
import pdfplumber import cv2 import numpy as np
def debug_table_extraction(pdf_path: str, page_num: int): with pdfplumber.open(pdf_path) as pdf: page = pdf.pages[page_num] im = page.to_image(resolution=150) table = page.extract_table() # Draw bounding boxes around each extracted cell for row in table: for cell in row: # cell is just text, but we have page.debug_tablefinder() pass # Actually use table finder: table_settings = "vertical_strategy": "lines", "horizontal_strategy": "lines" tables = page.find_tables(table_settings) debug_img = page.to_image() for t in tables: debug_img = debug_img.draw_rect(t.bbox) debug_img.save("table_debug.png", format="PNG")
Modern Strategy: Iterate on table settings using this debug output.
from pypdf import PdfReader
reader = PdfReader("large.pdf") for page in reader.pages: text = page.extract_text() # process page without loading entire PDF
with timer("DB query"): run_query()