OCRmyPDF 8.1.0

OCRmyPDF adds an invisible text layer to PDF documents after passing it through the Tesseract OCR engine. The output will be PDF/A with a selectable but invisible text layer above scanned image-documents. This allows later searching and archiving.

Tags pdf ocr scanning
License GNU GPLv3
State stable

Recent Releases

8.1.011 Feb 2019 06:05 minor feature: Docs: Clarify ArchLinux edition is in AUR . Add --unpaper-args. . Docs: remove reference to --skip-repair since the argument was removed. . Adjust the docker pull command for webservice. . Unpaper-args: add test case and harden feature. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Merge 'feature/unpaper-args'. . Docs: --unpaper-args. . --clean-final implies --clean. . Fuzz. . Webservice: add an optional config and larger upload limit. . Be os.nice()-r. . If --tesseract-timeout 0, say nothing when we time out. . Docs: more unpaper details. . Activate black precommit. . Exception on traversing corrupt ToC entries. . When weave handoff occurs with no OCR font present. . v8.1.0 release notes.
8.0.118 Jan 2019 03:17 minor feature: Ensure XObjects with no subtype don't cause an exception . Docs: Explain intermediate files. . Docs: Update some install procedures for v8 changes. . v8.0.1 notes.
8.0.007 Jan 2019 06:25 minor feature: docs: Ghostscript PDF/A XMP metadata loss; ocrmypdf-webservice . New template. . Readme: more media. . docs: try to readthedocs. . Travis: remove Brewfile. . v7.4.1 release notes. . leptonica.py: exception on certain types of barcode failures. . Detect when metadata is dropped during PDF/A conversion. . Drop support for Python 3.5. . Drop support for Tesseract 3. . Generate test cache. . Reformat with black. . Sort imports with isort. . Remove always-false Tess v3 tests. . pdfa: remove a pile of deprecated code. . Make pdfminer.six optional. . travis: Convert to f-strings where it makes sense. . pikepdf: version bump. . Delinting. . Add fish completions. . use pikepdf 0.10.2. . Prevent Ghostscript from generating invalid XMP metadata. . Bump pikepdf version, point to release notes. . v8.0.0 release notes.
7.4.016 Dec 2018 13:45 minor feature: leptonica: delete file junkpixt.png if created . Support using --force-ocr and --threshold or --mask-barcodes together. . comment in layout.py. . Add webapp stuff. . webapp docker: Build from polyglot. . Rename webapp to webservice. . pdfa: replace PDF/A checking with pikepdf implementation. . Deprecate encode/decode_pdf_date and remap to pikepdf version. . Remove more libxmp dependencies. . pdfinfo: FutureWarning. . setup: suppress XMLParser() warning - defusedxml related. . Replace Ghostscript DOCINFO and.25 metadata date regression. . regression on Ghostscript path. . Refactor pipeline to make PDF/A conversion a separate step. . Don't open encrypted files, even if password is empty. . Merge branches 'feature/newer-pike' and 'feature/webapp'. . Update webservice.py with separate license. . Rename to polyglot.dockerfile. . Require pikepdf 0.9.0. . pikepdf 0.9.0. . reqs/main.txt for pikepdf 0.9.0. . Require pikepdf 0.9.1. . pdfinfo: tolerate PDFs that overflow and underflow the graphics stack. . v7.4.0 release notes.
7.3.117 Nov 2018 15:45 minor feature: Docs build . Add ReadTheDocs yml so we can build with Py3.6. . Detailed page analysis enabled at wrong time. . Name2unicode ignoring certain markers. . 'del draw' exception. . Erasure of undetectable barcodes. . Leptonica: make threshold functions more flexible. . Pdfminer: detect TrueType fonts with no valid encoding information. . More argument checking. . Test case: true type font without Unicode mapping. . Add test case for Type3 fonts with no Unicode mapping. . Barcodes error handling. . Unsupported operand Decimal, float. . v7.3.1 release notes.
7.3.013 Nov 2018 22:05 minor feature: optimize: error in Py3.5 . Create deenvvar to override Creator or Producer. . Adjust for pikepdf API change. . Use Ghostscript for text region detection. . Remove other references to PyMuPDF. . Remove obsolete _naive_find_text. . Remove fitz from Travis. . Ghostscript, PDF/A: support pathlib. . PEP8 docstring convention misuse in a few places. . Rename _optimize to optimize.py. . Remove helpers.universal_open(). . Replace several uses of str(path) with fspath(path). . Remove special of TypeError from ruffus. . Remove qpdf.merge. . several pylint errors and warnings. . Cleanup unused imports. . tesseract.get_orientation: removed unused language parameter. . pipeline: search_window variable not actually used. . Cleanup some cases where log was lazy and should be. . Trailing whitespace. . leptonica: variables defined on class outside __init__. . pdfa: function using closure when it shouldn't. . Disable a pylint. . Reactivate two tests that weren't using their tures properly. . Regenerate test cache. . recent versions of tesseract not registering as textonly_pdf. . Ignore whether or not textonly_pdf was used in cache. . optimize: use new pikepdf api for objgen. . optimize: skip incremental images if any. . Use newer pikepdf API for objgen. . Merge branch 'test/ignore-masks'. . Add Python 3.7 support. . leptonica remove_colormap was replaced with a no-op at some point. . Replace all Pix.read with Pix.open. . Compress test images more heavily. . test resources naming inconsistency. . optimize: PNGs that were reduced to 1-bit being inverted. . Add test case to ensure mono is not inverted. . Optimize some of our bigger test files. . Update test cache with naming rule change. . Hopefully workaround Py3.5 marshal error. . installation for Python 3.7. . Improve release notes. . Make jpeg/png quality tunable args. . Update macOS Brewfile. . Upgrade to Py3.7 locally and resolve a few. Don't use --optimize in test since jbig2enc is no
6.2.530 Oct 2018 21:45 minor feature: Cherrypick Ghostscript 9.25 DOCINFO from 7.x . Ghostscript: in strict ASCII implementation. . Ghostscript: disable JPEG passthrough for ocrmypdf v6.x. . Backport blacklist of Ghostscript 9.24. . v6.2.4 release notes. . Disable failing test for tess 4.0rc1. . Remove macOS from testing entirely. . Drop support for PyMuPDF. . v6.2.5 notes. . travis.yml.
7.0.515 Sep 2018 03:16 minor feature: Docs: hyperlinking of jbig2 page (again) and cleanup release notes . Updating Arch Linux instalation. . Rst formatting in release notes. . Pdfinfo: remove some dead code. . Leptonica: update comments. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF: docs. . Tests: Migrate metadata tests to pikepdf. . Ghostscript: for 9.24 having jpeg passthrough available. . Ghostscript: no need to specify ProcessColorModel when ColorConversio . . Work around invalid TOC entries. . Work around loss of Unicode DOCINFO in Ghostscript 9.24+. . Check for and reject Adobe LiveCycle Designer PDFs. . Pikepdf version for Travis. . v7.0.5 release notes.
7.0.425 Aug 2018 07:05 minor feature: Error in optimize.py on PNGs at -O2 . Try setuptools_scm_git_archive again. . Docs: mention pikepdf install more clearly. . Require pikepdf 0.3.2. . v7.0.4 notes.
7.0.314 Aug 2018 03:16 minor feature: Optimize: Use new pikepdf Object.write API . Docs: links to JBIG2 encoder page. . Require pikepdf 0.3.1. . Remove pikepdf 0.3 compatibility shims since 0.3.1 is now required.
7.0.206 Aug 2018 05:25 minor feature: release notes typos . pipeline: remove unused function. . Add intensive (optional) rotation test. . ghostscript: never use autorotatepages. . pipeline: revise logic of rotations to pages with nonzero /Rotate. . Explain pytest --runslow. . Update pinned requirements. . Travis: use xenial for Python 3.7. . Regroup installation page content around platforms. . docs: Describe PDF optimization. . Draw preview image at full resolution. . Notes for v7.0.2. . travis.yml syntax.
6.2.302 Aug 2018 06:05 minor feature: Discard alpha channel when triaging images . Revert previous commit amd reject input images with alpha channel.
6.2.215 Jul 2018 03:19 minor feature: Ignore masks when deciding what color to rasterize at . Backport Python 3.7 for ruffus 2.7.0 from ocrmypdf v7.0.0. . Cherrypick Python 3.7 documentation updates from v7.0.0. . a comment about Tesseract behavior in certain versions. . Cherrypick warning about --user-words not having any effect. . main: do better parameter validation. . Tests: Add ability to disable use of cache. . Tests: Speed up a slow test (cherry-picked from v7). . Travis: modernize with v7.0.0 updates. . problem iterating ruffus exceptions and rotate-pages-threshold pa . . ocrmypdf.exec: trap FileNotFoundError too. . Skip locale check on Python 3.7. . Update release notes for v6.2.2. . Travis: v6 build failures. . Travis: nevermind xenial, then.
7.0.011 Jul 2018 07:25 minor feature: Ignore masks when deciding what color to rasterize at . Remove gpg. . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build. . Merge branch 'test/ignore-masks'. . : doesn't work when installed in non-Unicode path. . path error on Py3.5. . Remove dependency on private fork of ruffus, change to official 2.7. . Remove ruffus 2.6.3 exception special casing. . Update release notes. . Update readme. . Declare certain APIs public. . typo introduced in. . Merge branch 'develop' (7.0.0) into master.
6.2.125 Jun 2018 05:25 minor feature: Remove gpg . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build.
7.0.0rc107 Jun 2018 17:45 minor feature: Use python-xmp-toolkit for xmp check . Optimize: use tempdir for cmdline invocation. . Suppress some spurious tesseract errors. . Optimize: error in Py3.5.
6.2.007 May 2018 16:45 minor feature: Use more standard __version__ rather than PILLOW_VERSION . Add support for PDF/A-3. . helpers: missing call to complain(). . Don't suppress error message from config_notfound. . helpers.py again. . Add gpg key to template. . test_pageinfo: remove duplicate import. . --remove-background error on PDFs with colormapped images. . Expand size growth reasons to other arguments that trigger transcoding. . Update Dockerfile for Ubuntu 18.04. . Add 18.04 update procedure. . XMP validation with /CreationDate. . Merge branch 'feature/pdfa3'. . v6.2.0 Release notes. . v6.2.0. failure to prevent use of Ghostscript on /UserUnit files. . Trap PDF/A-3 errors on old Ghostscript.
6.1.503 May 2018 22:00 minor feature:
3.014 Sep 2015 17:45 minor feature: bump to v3.0 and move repos. Test case: No longer using JHOVE. Move to my repo: github.com/fritz-hh = jbarlow83.
3.0-rc931 Aug 2015 01:45 minor feature: Throw exception if iccprofiles not found instead of returning None. unpaper: support paletted files by conversion instead of bailing. Use png256 raster device when possible. Prevent running validation on missing file after an exception is thrown. Add test cases for additional image formats. ghostscript: quiet startup on rasterize. Bump version to -rc9.