Python深度学习12——Keras实现self-attention中文文本情感分类(详解)

Keras封装性比较高,现在的注意力机制都是用pytorch较为多。但是使用函数API也可以实现,Keras处理文本并且转化为词向量也很方便。

本文使用了一个外卖评价的数据集,标签是0和1,1代表好评,0代表差评。并且构建了12种模型,即 MLP,1DCNN,RNN,GRU,LSTM,   CNN+LSTM,TextCNN,BiLSTM, Attention, BiLSTM+Attention,BiGRU+Attention,Attention*3(3个注意力层堆叠)

大家也可以在此基础上参考改进,组合出更好的模型。(需要数据集和停用词可以留言)

本文的注释算是我写博客最详细的一篇了。


中文数据预处理 

由于中文不像英文中间有空白可以直接划分词语,需要依靠jieba库切词,然后把没有用的标点符号,或者是“了”,‘的’,‘也’,‘就’,‘很’…..等等没有用的虚词去掉。这就需要一个停用词库,大家可以网上找常用的停用词文本,也可以留言找博主要。我这有一个比较全的停用词,我还有一个简化版的。本次使用的是简化版的停用词。

首先看数据长这样

导入包和数据,读取停用词,用jieba库划分词汇并处理

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['KaiTi']  #指定默认字体 SimHei黑体
plt.rcParams['axes.unicode_minus'] = False   #解决保存图像是负号'
import jieba
stop_list  = pd.read_csv("stopwords_简略版.txt",index_col=False,quoting=3,
                         sep="\t",names=['stopword'], encoding='utf-8')


#Jieba分词函数
def txt_cut(juzi):
    lis=[w for w in jieba.lcut(juzi) if w not in stop_list.values]
    return " ".join(lis)

df=pd.read_excel('外卖.xlsx')
data=pd.DataFrame()
data['label']=df['label']
data['cutword']=df['review'].astype('str').apply(txt_cut)
data

词汇切割好了,得到如下结果

 查看标签y的分布

data['label'].value_counts().plot(kind='bar')

 负面评价0有将近8000个,正面评价1有4000个,不平衡,划分训练测试集时要分层抽样。

下面将文本变为数组,利用Keras里面的Tokenizer类实现,首先将词汇都索引化。这里有个参数num_words=6000很重要,意思是选择6000个词汇作为索引字典,也就是这个模型里面最多只有6000个词。

from os import listdir
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
# 将文件分割成单字, 建立词索引字典     
tok = Tokenizer(num_words=6000)
tok.fit_on_texts(data['cutword'].values)
print("样本数 : ", tok.document_count)

print({k: tok.word_index[k] for k in list(tok.word_index)[:10]})

 由于每个评论的词汇长度不一样,我们训练时需要弄成一样长的张量(多剪少补),需要确定这个词汇最大长度为多少,也就是max_words参数,这个是循环神经网络的时间步的长度,也是注意力机制的维度。如果max_words过小则很多语句的信息损失了,而max_words过大数据矩阵又会过于稀疏,并且计算量过大。我们查看一下X的长度的分布频率:

# 建立训练和测试数据集 
X= tok.texts_to_sequences(data['cutword'].values)
#查看x的长度的分布
length=[]
for i in X:
    length.append(len(i))
v_c=pd.Series(length).value_counts()
print(v_c[v_c>20])   #频率大于20才展现
v_c[v_c>20].plot(kind='bar',figsize=(12,5))

可以看出绝大多数的句子单词长度不超过10….长度为5的评论是最多的,本次选择max_words=20,将句子都裁剪为长为20 的向量。并取出y

# 将序列数据填充成相同长度 
X= sequence.pad_sequences(X, maxlen=20)
Y=data['label'].values
print("X.shape: ", X.shape)
print("Y.shape: ", Y.shape)
#X=np.array(X)
#Y=np.array(Y)

然后划分训练测试集,查看形状: 

X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=0)
X_train.shape,X_test.shape,Y_train.shape, Y_test.shape

将y进行独立热编码,并且保留原始的测试集y_test,方便后面做评价。查看x和y前3个

Y_test_original=Y_test.copy()
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

print(X_train[:3])
print(Y_test[:3])

 


开始构建神经网络

由于Keras里面没有封装好的注意力层,需要我们自己定义一个:

#自定义注意力层
from keras import initializers, constraints,activations,regularizers
from keras import backend as K
from keras.layers import Layer
class Attention(Layer):
    #返回值:返回的不是attention权重,而是每个timestep乘以权重后相加得到的向量。
    #输入:输入是rnn的timesteps,也是最长输入序列的长度
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[-1],),initializer=self.init,name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),initializer='zero', name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,constraint=self.b_constraint)
        else:
            self.b = None
        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None     ## 后面的层不需要mask了,所以这里可以直接返回none

    def call(self, x, mask=None):
        features_dim = self.features_dim    ## 这里应该是 step_dim是我们指定的参数,它等于input_shape[1],也就是rnn的timesteps
        step_dim = self.step_dim
        
        # 输入和参数分别reshape再点乘后,tensor.shape变成了(batch_size*timesteps, 1),之后每个batch要分开进行归一化
         # 所以应该有 eij = K.reshape(..., (-1, timesteps))

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
        if self.bias:
            eij += self.b        
        eij = K.tanh(eij)    #RNN一般默认激活函数为tanh, 对attention来说激活函数差别不大,因为要做softmax
        a = K.exp(eij)
        if mask is not None:    ## 如果前面的层有mask,那么后面这些被mask掉的timestep肯定是不能参与计算输出的,也就是将他们attention权重设为0
            a *= K.cast(mask, K.floatx())   ## cast是做类型转换,keras计算时会检查类型,可能是因为用gpu的原因

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)      # a = K.expand_dims(a, axis=-1) , axis默认为-1, 表示在最后扩充一个维度。比如shape = (3,)变成 (3, 1)
        ## 此时a.shape = (batch_size, timesteps, 1), x.shape = (batch_size, timesteps, units)
        weighted_input = x * a    
        # weighted_input的shape为 (batch_size, timesteps, units), 每个timestep的输出向量已经乘上了该timestep的权重
        # weighted_input在axis=1上取和,返回值的shape为 (batch_size, 1, units)
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):    ## 返回的结果是c,其shape为 (batch_size, units)
        return input_shape[0],  self.features_dim

别管这个类多复杂…..不用看,后面直接当成函数用就行。

下面导入Keras里面的常用的神经网络层,定义一些参数

from keras.preprocessing import sequence
from keras.models import Sequential,Model
from keras.layers import Dense,Input, Dropout, Embedding, Flatten,MaxPooling1D,Conv1D,SimpleRNN,LSTM,GRU,Multiply
from keras.layers import Bidirectional,Activation,BatchNormalization
from keras.layers.merge import concatenate
seed = 10
np.random.seed(seed)  # 指定随机数种子  
#单词索引的最大个数6000,单句话最大长度20
top_words=6000  
max_words=20
num_labels=2  #2分类

下面构建模型函数,这个函数较为复杂,因为是12个模型一起定义的,方便代码的复用。但每个模型对应的那一块都写的很清楚:

def build_model(top_words=top_words,max_words=max_words,num_labels=num_labels,mode='LSTM',hidden_dim=[32]):
    if mode=='RNN':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(SimpleRNN(32))  
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='MLP':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(Flatten())
        model.add(Dense(256, activation="relu"))  
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='LSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(LSTM(32))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='GRU':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(GRU(32))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='CNN':        #一维卷积
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(Flatten())
        model.add(Dense(256, activation="relu"))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='CNN+LSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))    
        model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(LSTM(64))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='BiLSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Bidirectional(LSTM(64)))
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.25))
        model.add(Dense(num_labels, activation='softmax'))
    #下面的网络采用Funcional API实现
    elif mode=='TextCNN':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        ## 词嵌入使用预训练的词向量
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        ## 词窗大小分别为3,4,5
        cnn1 = Conv1D(32, 3, padding='same', strides = 1, activation='relu')(layer)
        cnn1 = MaxPooling1D(pool_size=2)(cnn1)
        cnn2 = Conv1D(32, 4, padding='same', strides = 1, activation='relu')(layer)
        cnn2 = MaxPooling1D(pool_size=2)(cnn2)
        cnn3 = Conv1D(32, 5, padding='same', strides = 1, activation='relu')(layer)
        cnn3 = MaxPooling1D(pool_size=2)(cnn3)
        # 合并三个模型的输出向量
        cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)
        flat = Flatten()(cnn) 
        drop = Dropout(0.2)(flat)
        main_output = Dense(num_labels, activation='softmax')(drop)
        model = Model(inputs=inputs, outputs=main_output)
        
    elif mode=='Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(64)(attention_mul) #原始的全连接
        fla=Flatten()(mlp)
        output = Dense(num_labels, activation='softmax')(fla)
        model = Model(inputs=[inputs], outputs=output)  
    elif mode=='Attention*3':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(32,activation='relu')(attention_mul) 
        attention_probs = Dense(32, activation='softmax', name='attention_vec1')(mlp)
        attention_mul =  Multiply()([mlp, attention_probs])
        mlp2 = Dense(32,activation='relu')(attention_mul) 
        attention_probs = Dense(32, activation='softmax', name='attention_vec2')(mlp2)
        attention_mul =  Multiply()([mlp2, attention_probs])
        mlp3 = Dense(32,activation='relu')(attention_mul)           
        fla=Flatten()(mlp3)
        output = Dense(num_labels, activation='softmax')(fla)
        model = Model(inputs=[inputs], outputs=output)      
        
    elif mode=='BiLSTM+Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        bilstm = Bidirectional(LSTM(64, return_sequences=True))(layer)  #参数保持维度3
        bilstm = Bidirectional(LSTM(64, return_sequences=True))(bilstm)
        layer = Dense(256, activation='relu')(bilstm)
        layer = Dropout(0.2)(layer)
        ## 注意力机制 
        attention = Attention(step_dim=max_words)(layer)
        layer = Dense(128, activation='relu')(attention)
        output = Dense(num_labels, activation='softmax')(layer)
        model = Model(inputs=inputs, outputs=output)  
        
    elif mode=='BiGRU+Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(64,activation='relu')(attention_mul) #原始的全连接
        #bat=BatchNormalization()(mlp)
        #act=Activation('relu')
        gru=Bidirectional(GRU(32))(mlp)
        mlp = Dense(16,activation='relu')(gru)
        output = Dense(num_labels, activation='softmax')(mlp)
        model = Model(inputs=[inputs], outputs=output) 
        
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model

 前几个简单的单一模型使用的是搭积木一样最简单的定义方式。后面复杂一点的模型都是使用的Functional API实现的。

 下面再定义损失和精度的图,和混淆矩阵指标等等评价体系的函数

#定义损失和精度的图,和混淆矩阵指标等等
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
def plot_loss(history):
    # 显示训练和验证损失图表
    plt.subplots(1,2,figsize=(10,3))
    plt.subplot(121)
    loss = history.history["loss"]
    epochs = range(1, len(loss)+1)
    val_loss = history.history["val_loss"]
    plt.plot(epochs, loss, "bo", label="Training Loss")
    plt.plot(epochs, val_loss, "r", label="Validation Loss")
    plt.title("Training and Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()  
    plt.subplot(122)
    acc = history.history["accuracy"]
    val_acc = history.history["val_accuracy"]
    plt.plot(epochs, acc, "b-", label="Training Acc")
    plt.plot(epochs, val_acc, "r--", label="Validation Acc")
    plt.title("Training and Validation Accuracy")
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.legend()
    plt.tight_layout()
    plt.show()
def plot_confusion_matrix(model,X_test,Y_test_original):
    #预测概率
    prob=model.predict(X_test) 
    #预测类别
    pred=np.argmax(prob,axis=1)
    #数据透视表,混淆矩阵
    table = pd.crosstab(Y_test_original, pred, rownames=['Actual'], colnames=['Predicted'])
    #print(table)
    sns.heatmap(table,cmap='Blues',fmt='.20g', annot=True)
    plt.tight_layout()
    plt.show()
    #计算混淆矩阵的各项指标
    print(classification_report(Y_test_original, pred))
    #科恩Kappa指标
    print('科恩Kappa'+str(cohen_kappa_score(Y_test_original, pred)))

 定义训练函数

#定义训练函数
def train_fuc(max_words=max_words,mode='BiLSTM+Attention',batch_size=32,epochs=10,hidden_dim=[32],show_loss=True,show_confusion_matrix=True):
    #构建模型
    model=build_model(max_words=max_words,mode=mode)
    print(model.summary())
    history=model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,validation_split=0.2, verbose=1)
    print('————————————训练完毕————————————')
    # 评估模型
    loss, accuracy = model.evaluate(X_test, Y_test)
    print("测试数据集的准确度 = {:.4f}".format(accuracy))
    
    if show_loss:
        plot_loss(history)
    if show_confusion_matrix:
        plot_confusion_matrix(model=model,X_test=X_test,Y_test_original=Y_test_original)

设定一些参数

top_words=6000
max_words=20
batch_size=32
epochs=4
show_confusion_matrix=True
show_loss=True
mode='MLP'  

训练轮数为4,比较少,因为这个数据集少,而且太简单了,每个句子很短,所以前面单一模型很容易过拟合,就只训练个4轮,也能节约时间。


下面开始一个个模型去训练并且评价:

MLP

train_fuc(mode='MLP',batch_size=batch_size,epochs=epochs)

 如图,给出了训练每一轮的损失精度,和验证集的损失精度。并且画图,然后测试集的精度,画出的混淆矩阵,计算了混淆矩阵的一些指标,还有科恩系数。MLP测试集精度为0.8795


1DCNN

#下面模型都是接受三维数据输入,先把X变个形状
X_train= X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test= X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
train_fuc(mode='CNN',batch_size=batch_size,epochs=epochs)

也差不多,精度为0.8882


RNN

model='RNN' 
train_fuc(mode=model,epochs=epochs)

结果类似,不展示那么多了,测试集精度为0.8912


LSTM

train_fuc(mode='LSTM',epochs=epochs)

结果类似,不展示了,测试集精度为0.8966  (目前来看最高)


GRU

train_fuc(mode='GRU',epochs=epochs)

测试集精度为0.8912


CNN+LSTM

train_fuc(mode='CNN+LSTM',epochs=epochs)

测试集精度为0.8916


BiLSTM 

train_fuc(mode='BiLSTM',epochs=epochs)
测试数据集的准确度 0.8816

TextCNN

train_fuc(mode='TextCNN',epochs=30)

这里加大了训练轮数,因为下面的模型都开始比较复杂,不容易过拟合,而且需要更多的训练轮数

测试集精度为0.8474


 Attention

train_fuc(mode='Attention',epochs=100)

测试集精度为0.8207

BiLSTM+Attention

train_fuc(mode='BiLSTM+Attention',epochs=30)

测试集精度0.8236


BiGRU+Attention

train_fuc(mode='BiGRU+Attention',epochs=100)

测试集精度0.8607


  Attention*3

train_fuc(mode='Attention*3',epochs=50)

测试集精度0.8057


很明显,加了注意力机制的模型训练更加不容易过拟合。单一的循环网络才四轮就会过拟合,而注意力机制同时需要的训练轮数也更多,可以看到验证集精度一直在上升,损失一直在下降。

虽然最后整体的测试集的准确率不如前面的单一网络,但我猜测这应该是训练轮数不够和数据量过小的原因。
这个外卖的数据集实在是太短了,比较简单,而且样本量也不大。

而且和Transform比起来,这里的注意力机制没有采用残差连接,批量归一化等技巧,没有使用编码解码器,也没有堆叠很多层(Transform有18个注意力层)

以后可以在更复杂,更多的数据集上进行测试和训练注意力机制,把网络做大做深一点,多调参尝试,当然前提是需要有更多的计算资源(买台好电脑)…..
 

物联沃分享整理
物联沃-IOTWORD物联网 » Python深度学习12——Keras实现self-attention中文文本情感分类(详解)

发表评论