一、CRNN概述
CRNN(Convolutional Recurrent Neural Network)是由全卷积神经网络(FCN)和循环神经网络(RNN)结合而成,主要应用于图像与文本中的场景文本识别(Scene Text Recognition,STR)任务。CRNN网络结合了CNN网络能够提取高维特征的优点和RNN网络能够捕捉上下文关系的优点,因此在文本识别任务中取得了优秀的表现。
二、CRNN结构
CRNN网络结构包括卷积层(Convolutional Layer)、循环层(Recurrent Layer)和转录层(Transcription Layer)三个部分。
1.卷积层
卷积层负责从原始图像中提取特征。一般的,训练好的卷积层包括了数个卷积层和池化层,其中卷积层负责提取特征,池化层负责保证计算速度和空间不变性。最后在特征图上进行特征选择,删去无用特征。
import torch.nn as nn
import torch
class Conv(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, dilation=1, groups=1,
norm_layer=None, activation_layer=None, bias=True):
super(Conv, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
if activation_layer is None:
activation_layer = nn.ReLU(inplace=True)
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, dilation, groups, bias=bias)
self.bn = norm_layer(out_channels)
self.act = activation_layer
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.act(x)
return x
2.循环层
循环层负责对特征序列进行处理。由于文本向量是一个序列,需要一种能够捕捉序列信息的算法。RNN即循环神经网络,它的输出状态一方面与上一次的状态相关,一方面与当前的输入相关。
class BidirectionalLSTM(nn.Module):
def __init__(self, nIn, nHidden, nOut):
super(BidirectionalLSTM, self).__init__()
self.rnn = nn.LSTM(nIn, nHidden, bidirectional=True)
self.embedding = nn.Linear(nHidden * 2, nOut)
def forward(self, input):
recurrent, _ = self.rnn(input)
T, b, h = recurrent.size()
t_rec = recurrent.view(T*b, h)
output = self.embedding(t_rec) # [T * b, nOut]
output = output.view(T, b, -1)
return output
3.转录层
转录层负责将特征图转化为文本。具体来说是对卷积层和循环层处理后为一个序列的特征图进行转录。转录可以采用CTC算法(Connectionist Temporal Classification)。
class Transcription(nn.Module):
def __init__(self, n_class):
super(Transcription, self).__init__()
self.fc = nn.Linear(512, n_class)
def forward(self, x):
T = x.size(0)
x = x.view(T, -1)
x = self.fc(x)
return x
三、CRNN参数设置
CRNN网络参数设置如下:
n_class = 37 # 26个字母+数字+一些特殊符号
input_height = 32 # 图像高度
n_channel = 1 # 图像通道数,黑白图像为1
n_hidden = 256 # 循环层隐藏单元个数
四、CRNN训练
CRNN网络的训练需要准备训练集和验证集数据,并按照批次大小(batch size)进行训练。
from torchvision import transforms, datasets
from torch.utils.data import DataLoader
transform = transforms.Compose([
transforms.Grayscale(), # 将彩色图像转为灰度图像
transforms.Resize((input_height, 100)), # 将图像高度设置为32,宽度压缩到100
transforms.ToTensor(), # 将图像转化为Tensor
])
train_dataset = datasets.ImageFolder(root="./train", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
test_dataset = datasets.ImageFolder(root="./test", transform=transform)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=True)
crnn = CRNN(n_channel, n_hidden, n_class)
optimizer = torch.optim.Adam(crnn.parameters(), lr=0.0001)
loss_fn = nn.CTCLoss()
num_epoch = 20
for epoch in range(num_epoch):
train_loss = 0.0
for idx, (image, label) in enumerate(train_loader):
image = image.to(device)
label = label.to(device)
output = crnn(image)
output_size = torch.IntTensor([output.size(0)] * output.size(1))
loss = loss_fn(output, label, output_size, label.size(0))
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
print("Epoch: ", epoch, "Loss: ", train_loss/len(train_loader))
五、CRNN识别
CRNN网络可以通过输入待识别的图像,得到对应的文本。代码如下:
image_path = "./test/1.png"
image = Image.open(image_path)
image = transform(image).unsqueeze(0)
image = image.to(device)
crnn.eval()
output = crnn(image)
output_argmax = output.argmax(dim=2).squeeze()
predicted_sentence = convert_to_text(output_argmax, id_to_char)
print("Predicted sentence: ", predicted_sentence)