$tahirl_
[01]
LabBlogPuzzleMCP↗AboutSupport🇹🇷 TR
tahirl
LabBlogAboutSupportMCP ↗

© 2026 tahirl. All rights reserved

Back to Blog

Sahaf: Converting PDF and EPUB to Markdown for AI

Published on March 14, 2026•7 min read
markdownsahafpdfllmocropen-sourcebooks

Related Section

Sahaf

Sahaf: Converting PDF and EPUB to Markdown for AI

I love reading. Always have. I tend to gravitate toward books on obscure, "shelved" topics. Turkish sources are extremely limited when it comes to digital availability, so at some point I shifted to English and a whole different world opened up.

That came with a new problem though. Not understanding :).

Some of the books I found had heavy terminology, some were just old, over a hundred years old. The English in those pages isn't modern. Dense, archaic, heavy. Not even a vocabulary issue really, you need to decode the era the text belongs to. I never had the time to sit through all of them.

And then there are the completely inaccessible ones for me. Russian, Indonesian, Arabic. Sitting on my shelf, beautiful but unreadable. I kept them all hoping maybe someday.

The AI Moment

When LLMs became usable, I pulled those books off the shelf immediately. I could feed a text to an AI, ask it to explain, translate, summarize, cross-reference. Suddenly those untouchable books had a door. I started experimenting: copying text, pasting into chat windows, asking for translations and commentary.

It worked. Kind of.

The problem was the format. Most of my books existed as scanned PDFs. Some were digital PDFs with selectable text, but the formatting was a mess. And even when the text was somewhat clean, feeding large amounts of it to an AI caused a different problem: the model would lose track of what it was doing. Too much text at once and the AI starts forgetting context, mixing things up, giving shallow answers. It's not just about fitting text into a window. It's about giving the AI something it can actually focus on.

Why Markdown

I discovered that Markdown is one of the cleanest formats for LLM consumption. No heavy formatting overhead, no proprietary structure, just text with minimal markup. Headings, paragraphs, emphasis. That's all you need for a book.

If I could convert my books to clean Markdown, I could feed specific chapters to an AI without drowning it in noise. I could get translations that preserved the structure of the original. I could search and cross-reference across books. And I could keep everything local, organized, readable by both humans and machines.

That's when I decided to build Sahaf.

The Name

A sahaf is a secondhand bookshop in Turkish. The old kind. Cramped, dusty, stacked floor to ceiling with books nobody else wants anymore (browsing one feels like a treasure hunt :)). You walk in not knowing what you'll find. The owner knows every spine by memory.

It felt right. This tool takes old, forgotten books and gives them a second life in a format the modern world can work with. A digital sahaf.

Sahaf started with old books but it's not limited to that.

Any PDF or EPUB you want as clean Markdown, it handles. Academic papers, technical docs, ebooks. If format is getting in the way of working with AI, Sahaf gets rid of it.

How It Works

You drag a PDF or EPUB into the browser. Sahaf figures out what kind of file it is: digital text, scanned images, or a mix of both. You pick a page range if you want (say, just chapter 3), hit convert, and get clean Markdown back. Optionally, you can split the output into parts that respect heading and paragraph boundaries, so you never get a chunk that cuts off mid-sentence.

One detail I like: when a PDF contains images (diagrams, illustrations, charts), Sahaf extracts them and embeds them directly into the Markdown as base64. The output is fully self-contained. No separate image folder, no broken links. One .md file, everything inside.

EPUB support came later and turned out to work surprisingly well. It's lightweight, needs no GPU, and converts almost instantly.

Everything runs on your machine. No cloud APIs, no uploads to third-party servers. Your books stay yours.

Why Marker

The conversion engine under the hood is Marker. It benchmarks at 95.67% accuracy on mixed documents, and it bundles Surya OCR which supports 90+ languages out of the box. That last part was critical for me. Turkish, English, Arabic, Russian. I needed something that handled all of them without language-specific configuration, ran locally, and produced clean output. Marker checked every box.

The Splitting Problem

This was the feature I didn't know I needed until I started actually using the tool. Converting a book to Markdown is step one. But if the output is a single 200-page file, you still can't work with it effectively. Feed the whole thing to an AI and it loses focus. Important details get buried, answers get vague, context bleeds between sections.

So I built a smart splitter. You tell it "split into 5 parts" and it finds natural boundaries: headings, horizontal rules, paragraph breaks. No mid-sentence splits, no orphaned sections.

This turned out to be the feature that made the whole workflow practical. Convert a book, split it into chapter-sized chunks, work through them one by one with an AI. Translate, annotate, discuss. It even changed the way I read.

Local-First, On Purpose

I could have wired this up to a cloud OCR service and probably gotten slightly better accuracy on edge cases. I chose not to.

Privacy. Some of the books I work with are personal. Rare manuscripts, documents I don't want sitting on a server somewhere.

Cost. Cloud OCR charges per page. When you're processing entire books, that adds up fast. Sahaf costs nothing to run after the initial model download (~2-3GB).

Independence. Services shut down, APIs change, pricing shifts. A local tool works as long as your hardware does.

What It Can't Do (Yet)

I'll be honest about the limitations.

GPU dependency for PDFs. Marker's OCR models are heavy (around 2-3GB download on first run). On a CPU-only machine, converting a 27-page scanned PDF took over an hour on my i5 with 40GB RAM. With a CUDA GPU, the same file takes minutes. EPUB conversion is instant regardless, but if you're working with scanned PDFs and don't have a GPU, it's going to be slow.

It's a prototype. The core works and I use it, but it's not polished software. The UI is functional, not fancy. There are edge cases with unusual PDF structures that don't convert cleanly. I keep developing it when I find the time.

No batch processing. One book at a time. If you want to convert an entire library, you'll need to do them sequentially.

What's Next

One thing I want to add is built-in translation. Right now, you convert a book to Markdown and then take it to an AI separately for translation. I'd like to close that loop inside Sahaf itself: convert, split, translate, all in one flow.

The Bigger Picture

Sahaf started as a personal tool to solve a personal problem. I had books I couldn't read, and I found a way to make them readable. But it connects to something larger that I think about a lot: the gap between knowledge that exists and knowledge that's accessible.

There are millions of books in languages most people can't read, in formats most tools can't parse, sitting in libraries and personal collections around the world. The combination of OCR and LLMs has the potential to unlock all of that. Not by replacing the act of reading, but by removing the barriers that prevent it.

Sahaf is my small contribution to that. A local, open source tool that turns old books into something the modern world can work with. A digital secondhand bookshop, open 24/7, no dust required.


Sahaf is open source under GPL-3.0. You can find it on GitHub.

If you find it useful, a star on the repo goes a long way. And if you have suggestions or run into issues, feel free to open an issue or

reach out
EmailTelegramXGitHub
.

Comments

No comments yet. Be the first to comment!

Back to Blog