Abdullahi Mohamed // Engineering Home Base

In my last post, I talked about building a thread-safe bridge to run Python from Go. Now, I want to show a real use case: PDF Extraction.

If you've ever tried parsing complex PDFs in Go, you know it can be a bit hit-or-miss. On the other hand, Python has PyMuPDF (fitz), which is basically the gold standard for speed and accuracy. Instead of settling for a less mature Go library, I used my bridge to "borrow" PyMuPDF.

The Implementation

I created a pdfextractor package. It embeds a small Python script directly into the Go binary using //go:embed. When the Go code starts, it spawns the Python worker and passes the PDF bytes over the bridge.

Here is the Go side:

package pdfextractor

import (
	_ "embed"
	"errors"
	"io"
	"log"
	"github.com/Abdallemo/bridge" // My bridge package
)

//go:embed script.py
var pdfScript []byte

type Extractor struct {
	py BridgeCaller
}

func New() (*Extractor, error) {
	// Start bridge with pymupdf dependency using 'uv'
	b, err := bridge.New(pdfScript, 500, "--with", "pymupdf")
	if err != nil {
		return nil, err
	}
	return &Extractor{py: b}, nil
}

func (e *Extractor) Extract(r io.Reader) (Extract, error) {
	pdfBytes, err := io.ReadAll(r)
	if err != nil {
		return Extract{}, err
	}

	req := ExtractorRequest{
		Type: "bytes",
		Size: len(pdfBytes),
	}

	var resp ExtractorResponse
	// Send bytes directly over the bridge!
	if err := e.py.Call(req, &resp, pdfBytes); err != nil {
		return Extract{}, err
	}

	if resp.Error != "" {
		return Extract{}, errors.New(resp.Error)
	}

	return Extract{
		Content:    resp.Text,
		TotalPages: resp.TotalPages,
	}, nil
}

The "Borrowed" Python Script

The Python script is surprisingly simple. It just uses the runtime.py I showed in the last post to listen for requests, parses the PDF bytes using fitz, and returns the text as JSON.

from typing import Any, BinaryIO, Dict
import fitz
from runtime import run

def handler(req: Dict[str, Any], stdin_buffer: BinaryIO) -> Dict[str, Any]:
    if req["type"] == "bytes":
        size = req["size"]
        pdf_bytes = stdin_buffer.read(size)
    else:
        raise ValueError("invalid request type")

    # PyMuPDF doing the heavy lifting
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    count = doc.page_count
    text = "".join(str(page.get_text()) for page in doc)
    doc.close()

    return {"text": text, "total_pages": count}

if __name__ == "__main__":
    run(handler)

Why I like this setup

Because the Python script is embedded in the Go binary, the deployment is still just a single executable (provided uv is installed on the machine).

I don't have to worry about Python's global interpreter lock (GIL) for the whole app, and I get the rock-solid accuracy of PyMuPDF. If the Python process ever leaks memory (which it often does with large PDFs), the bridge just kills it and restarts it after 500 tasks.

This setup has been a lifesaver for my RAG pipeline. Next, I'll talk about how I actually chunk all this extracted text using a recursive splitter I ported to Go.