Google Docs Archivist OCR: a free tool for turning vintage scans into clean, usable text

Iโ€™ve been heads-down relaunching VintageReveries, but I also shipped a little gift to fellow archivists and curious tinkerers: Google Docs Archivist OCR โ€” a simple Google Apps Script that connects Google Drive + Google Docs + Gemini to batch-OCR folders of images with formatting you can actually use.

Repo: https://github.com/HelloJessicaM/Google-Docs-Archivist-OCR


Why I built it

Between 2011โ€“2014 I scanned a lot of magazines and ephemera. The OCR from that eraโ€ฆ was not great. This month I revisited my archive with a fresh workflow and a new scanner (hi, IRIScan 7 Pro) and wanted something:

  • fast enough to process an entire folder,
  • smart enough to detect columns/tables, and
  • simple enough to run inside Google Docs where I already edit.

Gemini 2.5 Flash handles the heavy lifting; Apps Script glues it together.


What it does (in plain English)

  • Batch OCR: Point it at a Google Drive folder of JPEGs; it processes the set in one click.
  • Format aware: Detects columns and tables; outputs tables as pipe-formatted rows (| col1 | col2 |) so you can paste directly into Sheets/CSV.
  • Hands-on inside Docs: Adds a custom menu (Archivist Tools) so you never leave the document.
  • Ordering fix: Latest update processes files strictly in page order (page1.jpg, page2.jpg, โ€ฆ) for predictable output.

How to use it (5 minutes)

  1. Open a new Google Doc โ†’ Extensions โ†’ Apps Script.
  2. Paste the code from Code.gs in the repo.
  3. Grab a free API key in Google AI Studio and paste it where the script says PASTE_YOUR_API_KEY_HERE.
  4. In Apps Script, click Services (+) โ†’ add Drive API.
  5. Save, return to the Doc, refresh.
  6. Use the new menu ๐ŸŒŸ Archivist Tools โ†’ Run Image-to-Text OCR and paste a Drive folder link.

Tip: Name your files sequentially if they arenโ€™t already. The tool will follow page order.


Promotional graphic for Google Docs Archivist OCR, showing a scanner, Google Docs, and Gemini AI icons, with examples of scanned vintage documents and OCR results.

Real-world use: 1920s magazines at scale

I stress-tested the script on two living projects:

The tool ripped through 100+ pages in about an hour or two, producing clean text blocks and pipe-tables I could drop into Sheets. Itโ€™s night-and-day compared with my early-2010s OCR โ€” fewer hallucinations, better column handling, and far less cleanup.


Who this helps

  • Archivists & librarians turning scans into searchable text.
  • Resellers & historians extracting ads, captions, and indexes for research.
  • Bloggers & editors who want ready-to-edit copy instead of raw images.

If youโ€™ve got a box of scans and an afternoon, this will save you many hours.


Roadmap / ideas Iโ€™m exploring

  • Optional per-page Google Docs output (one Doc per image).
  • A โ€œpreserve layoutโ€ mode for magazine spreads.
  • Rate-limit helper & progress meter for very large folders.
  • One-click CSV export for detected tables.

If any of that would be useful, open an issue or star the repo so I know where to focus.


Practical notes

  • Privacy & cost: You can stay within Geminiโ€™s free tier for moderate batches; large runs may require throttling.
  • Accuracy: Vintage typefaces, skewed scans, and low contrast will still need light human cleanup โ€” but youโ€™ll be editing instead of retyping.
  • File hygiene: Square up scans, crop borders, and keep a consistent DPI for best results.

Try it, break it, tell me

This is a small tool, but it unlocks big momentum: instead of typing, Iโ€™m architecting. And that leaves me more time to list vintage, write, and keep building the AI-assisted systems that make my one-woman operation move.


Postscript: how this ties into my broader work

Iโ€™m actively building a fashion dictionary knowledgebase for an AI listing assistant. Clean OCR is the foundation. The better my source text, the smarter my tooling gets. In the meantime, this script lets me revisitโ€”and finally useโ€”the scans I made a decade ago, with much cleaner results.

0 Comments

Post Categories