如何用 Python 中的 OCR 技术读取 PDF 内容

Python 是当今世界最受欢迎的编程语言之一。我们可以用它来分析数据，但数据并不总是以所需的格式提供。在这种情况下，我们可以将文件的格式从 pdf、jpg 转换为文本(。txt)格式，以便以更好的方式分析数据。有许多库可用于执行此类任务。

我们可以使用 Python 的 PyPDF2 模块来执行转换。pdf 文件转换为文本格式。使用该模块时，我们面临的主要缺点是编码方案。PDF 文档文件可以包含多种编码，如 Unicode、ASCII、UTF-8 等。因此，将 PDF 文件转换为文本可能会因为编码模式而导致数据丢失。

在本教程中，我们将学习如何读取 PDF 文件的内容并将其存储在文本文件中。txt)格式，采用“OCR”方法。

首先，我们必须将 PDF 文档文件的页面转换成图像，然后，我们将使用 OCR 从图像中读取内容并将其存储在文本中()。txt)格式文件。

所需模块:

在本教程中，我们必须使用给定的命令安装以下模块:

电池:


!pip3 install PIL

中小企业:-


!pip3 install pytesseract

pdf2image: -


!pip3 install pdf2image

宇宙魔方-ocr: -


!pip3 install tesseract-ocr

(为此用户应该拥有微软 Visual C++ 14.0，使用“Visual Studio 构建工具”获取:https://visualstudio.microsoft.com/downloads/)

第一部分:

第一部分将处理转换我的 PDF 页面为图像文件。PDF 文件的每一页都将存储为图像文件，图像的名称将存储为:


PDF page no. 1: page_no_1.jpg
PDF page no. 2: page_no_2.jpg
PDF page no. 3: page_no_3.jpg
PDF page no. 4: page_no_4.jpg
.
.
PDF page no. n: page_no_n.jpg

第二部分:

第二部分将处理从图像文件中识别文本，将其分类到文本文件中。txt”格式。在这里，我们将处理图像文件，将其转换为文本内容。一旦我们将文本作为字符串变量，我们就可以开始处理文本了。txt)文件。比如，在很多 PFD 文件中，我们可以看到，当一行完整，但最后一个字不能完全写在同一行，那么，在最后加一个连字符，这个字就延续到下一行。例如:


This is an example to show the above explanation of the wo-
rd which cannot be written entirely in the same line and is conti-
nued in the next line.

对于这样的单词，我们将做一个基本的预处理，将连字符和下面的行转换成一个完整的单词。当我们完成预处理后，这些文本将被分类到一个单独的文本文件中。

代码:


from PIL import Image as img
import pytesseract as PT
import sys
from pdf2image import convert_from_path as CFP
import os
# Importing the pdf file
PDF_file_1 = "exp.pdf"
pages_1 = CFP(PDF_file1, 9)

# Now, we will create a counter for storing images of each page of PDF to image
image_counter1 = 1

# Iterating through all the pages of the pdf file stored above
for page in pages_1:

    # We will Declare the  filename for each page of PDF file as JPG file
    # For each page, the filename will be:
    # PDF page no. 1: Page_no_1.jpg
    # PDF page no.2: Page_no_2.jpg
    # PDF page no. 3: Page_no_3.jpg
    # PDF page no. 4: Page_no_4.jpg
    # .... and so on..
    # PDF page n: page_n.jpg
    filename1 = "Page_no_" + str(image_counter) + " .jpg"

    # Now, we will save the image of the page in system
    page.save(filename1, 'JPEG')

    # Then, we will increase the counter for updating filenames
    image_counter1 = image_counter1 + 1

'''
Part #2 - Recognize the text content from the image files by using OCR
'''
# Variable for getting the count of the total number of pages
filelimit1 = image_counter1 - 1

# then, we will create a text file for writing the output
out_file1 = "output_text.txt"

# Now, we will open the output file in append mode so that all contents of the # images will be added in the same output file.
f_1 = open(out_file1, "a")

# Iterating from 1 to total number of pages
for K in range(1, filelimit1 + 1):

    # Now, we will set filename for recognizing text from images
    # Again, these files will be:
    # Page_no_1.jpg
    # Page_no_2.jpg
    # Page_no_3.jpg
    # ....
    # page_no_n.jpg
    filename1 = "Page_no_" + str(K) + " .jpg"

    # Here, we will write a code for recognizing the text as a string variable in an image file by using the pytesserct module
    text1 = str(((PT.image_to_string (Image.open (filename1)))))

    # : The recognized text will be stored in variable text
    # : Any string variable processing may be applied to text content
    # : Here, basic formatting will be done:-

    text1 = text1.replace('-\n', '')    

    # At last, we will write the processed text into the file.
    f_1.write(text1)

# Closing the file after writing all the text content.
f_1.close()

输出:

输入 PDF 文件:

输出文本文件:

我们可以看到，PDF 文件的页面是如何转换成图像的。然后这些图像被读取，它们的内容被写入文本文件。

使用 OCR 方法的优点:

用户可以避免基于文本的转换，这种转换会因编码方案而导致数据丢失。
OCR 模块还可以识别 PDF 文件中的手写内容。
用户还可以通过使用 OCR 模块来修改仅识别 PDF 的特定页面。
预处理可以大量完成，因为它以可变的形式获得文本。

使用 OCR 方法的缺点:

磁盘存储器用于在本地系统中存储图像文件。但是，这些文件占用的空间很少。
OCR 的使用不能保证 100%的准确性。而在计算机上创建的 PDF 文件文档具有很高的准确性。
OCR 模块可以识别手写内容，但准确性取决于许多因素，如页面的颜色、笔迹的清晰度等。

结论

在本教程中，我们讨论了如何使用 Python 中的 OCR 来读取 PDF 文件的内容。

Windows 软件

Linux 软件

Mac 软件

安卓软件

各类文章

如何用 Python 中的 OCR 技术读取 PDF 内容

所需模块:

第一部分:

第二部分:

代码:

使用 OCR 方法的优点:

使用 OCR 方法的缺点:

结论

如何用 Python 中的 OCR 技术读取 PDF 内容

Python图片转PDF的实现

JavaTesseract： OCR识别技术在Java中的应

Python pytesseract 全能OCR库

python读取pdf文件尺寸,python读取pdf内容

如何复制网站上不能被复制的内容

PDF复制出来的文字乱码

python技巧笔记（python自学笔记）

使用Python技术在Android上实现PDF阅读功能

python解析pdf文档,python的pdf库

关于python学习第四次笔记的信息

最新一代OCR文字识别技术——EasyOCR，令你的应用更智

python笔记第六天,python第六周笔记

Python读取PDF文件

python逆向工程笔记（Python逆向工程）

python标准库的pdf文档,python中文手册 pdf

使用Python Pytesseract进行OCR识别

python的用法笔记本（笔记本学python）

PDF提取表格全面解析

python基础学习整理笔记,Python课堂笔记

Windows 软件

Linux 软件

Mac 软件

安卓软件

各类文章

如何用 Python 中的 OCR 技术读取 PDF 内容

所需模块:

第一部分:

第二部分:

代码:

使用 OCR 方法的优点:

使用 OCR 方法的缺点:

结论

如何用 Python 中的 OCR 技术读取 PDF 内容

Python图片转PDF的实现

JavaTesseract： OCR识别技术在Java中的应

Python pytesseract 全能OCR库

python读取pdf文件尺寸,python读取pdf内容

如何复制网站上不能被复制的内容

PDF复制出来的文字乱码

python技巧笔记（python自学笔记）

使用Python技术在Android上实现PDF阅读功能

python解析pdf文档,python的pdf库

关于python学习第四次笔记的信息

最新一代OCR文字识别技术——EasyOCR，令你的应用更智

python笔记第六天,python第六周笔记

Python读取PDF文件

python逆向工程笔记（Python逆向工程）

python标准库的pdf文档,python中文手册 pdf

使用Python Pytesseract进行OCR识别

python的用法笔记本（笔记本学python）

PDF提取表格全面解析

python基础学习整理笔记,Python课堂笔记

人机检测，请谅解