人工智能不过尔尔，基于 Python3 深度学习库 Keras/TensorFlow 打造属于自己的聊天机器人 (ChatRobot)

关注

发布于: 2020 年 12 月 29 日

人工智能不过尔尔，基于Python3深度学习库Keras/TensorFlow打造属于自己的聊天机器人(ChatRobot)

聊天机器人(ChatRobot)的概念我们并不陌生，也许你曾经在百无聊赖之下和 Siri 打情骂俏过，亦或是闲暇之余与小爱同学谈笑风生，无论如何，我们都得承认，人工智能已经深入了我们的生活。目前市面上提供三方 api 的机器人不胜枚举：微软小冰、图灵机器人、腾讯闲聊、青云客机器人等等，只要我们想，就随时可以在 app 端或者 web 应用上进行接入。但是，这些应用的底层到底如何实现的？在没有网络接入的情况下，我们能不能像美剧《西部世界》(Westworld)里面描绘的那样，机器人只需要存储在本地的“心智球”就可以和人类沟通交流，如果你不仅仅满足于当一个“调包侠”，请跟随我们的旅程，本次我们将首度使用深度学习库 Keras/TensorFlow 打造属于自己的本地聊天机器人，不依赖任何三方接口与网络。

首先安装相关依赖：

pip3 install Tensorflow  pip3 install Keras  pip3 install nltk

复制代码

然后撰写脚本 test\_bot.py 导入需要的库：

import nltk  import ssl  from nltk.stem.lancaster import LancasterStemmer  stemmer = LancasterStemmer()    import numpy as np  from keras.models import Sequential  from keras.layers import Dense, Activation, Dropout  from keras.optimizers import SGD  import pandas as pd  import pickle  import random

复制代码

这里有一个坑，就是自然语言分析库 NLTK 会报一个错误：



Resource punkt not found

复制代码

正常情况下，只要加上一行下载器代码即可

import nltk  nltk.download('punkt')

复制代码

但是由于学术上网的原因，很难通过 python 下载器正常下载，所以我们玩一次曲线救国，手动自己下载压缩包：

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip

复制代码

解压之后，放在你的用户目录下即可：

C:\Users\liuyue\tokenizers\nltk_data\punkt

复制代码

ok，言归正传，开发聊天机器人所面对的最主要挑战是对用户输入信息进行分类，以及能够识别人类的正确意图（这个可以用机器学习解决，但是太复杂，我偷懒了，所以用的深度学习 Keras）。第二就是怎样保持语境，也就是分析和跟踪上下文，通常情况下，我们不太需要对用户意图进行分类，只需要把用户输入的信息当作聊天机器人问题的答案即可，所这里我们使用 Keras 深度学习库用于构建分类模型。

聊天机器人的意向和需要学习的模式都定义在一个简单的变量中。不需要动辄上 T 的语料库。我们知道如果玩机器人的，手里没有语料库，就会被人嘲笑，但是我们的目标只是为某一个特定的语境建立一个特定聊天机器人。所以分类模型作为小词汇量创建，它仅仅将能够识别为训练提供的一小组模式。

说白了就是，所谓的机器学习，就是你重复的教机器做某一件或几件正确的事情，在训练中，你不停的演示怎么做是正确的，然后期望机器在学习中能够举一反三，只不过这次我们不教它很多事情，只一件，用来测试它的反应而已，是不是有点像你在家里训练你的宠物狗？只不过狗子可没法和你聊天。

这里的意向数据变量我就简单举个例子，如果愿意，你可以用语料库对变量进行无限扩充：

intents = {"intents": [          {"tag": "打招呼",           "patterns": ["你好", "您好", "请问", "有人吗", "师傅","不好意思","美女","帅哥","靓妹","hi"],           "responses": ["您好", "又是您啊", "吃了么您内","您有事吗"],           "context": [""]          },          {"tag": "告别",           "patterns": ["再见", "拜拜", "88", "回见", "回头见"],           "responses": ["再见", "一路顺风", "下次见", "拜拜了您内"],           "context": [""]          },     ]  }

复制代码

可以看到，我插入了两个语境标签，打招呼和告别，包括用户输入信息以及机器回应数据。

在开始分类模型训练之前，我们需要先建立词汇。模式经过处理后建立词汇库。每一个词都会有词干产生通用词根，这将有助于能够匹配更多用户输入的组合。

for intent in intents['intents']:      for pattern in intent['patterns']:          # tokenize each word in the sentence          w = nltk.word_tokenize(pattern)          # add to our words list          words.extend(w)          # add to documents in our corpus          documents.append((w, intent['tag']))          # add to our classes list          if intent['tag'] not in classes:              classes.append(intent['tag'])    words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]  words = sorted(list(set(words)))    classes = sorted(list(set(classes)))    print (len(classes), "语境", classes)    print (len(words), "词数", words)

复制代码

输出：

2 语境 ['告别', '打招呼']  14 词数 ['88', '不好意思', '你好', '再见', '回头见', '回见', '帅哥', '师傅', '您好', '拜拜', '有人吗', '美女', '请问', '靓妹']

复制代码

训练不会根据词汇来分析，因为词汇对于机器来说是没有任何意义的，这也是很多中文分词库所陷入的误区，其实机器并不理解你输入的到底是英文还是中文，我们只需要将单词或者中文转化为包含 0/1 的数组的词袋。数组长度将等于词汇量大小，当当前模式中的一个单词或词汇位于给定位置时，将设置为 1。

# create our training data  training = []  # create an empty array for our output  output_empty = [0] * len(classes)  # training set, bag of words for each sentence  for doc in documents:      # initialize our bag of words      bag = []        pattern_words = doc[0]           pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]        for w in words:          bag.append(1) if w in pattern_words else bag.append(0)               output_row = list(output_empty)      output_row[classes.index(doc[1])] = 1            training.append([bag, output_row])    random.shuffle(training)  training = np.array(training)    train_x = list(training[:,0])  train_y = list(training[:,1])

复制代码

我们开始进行数据训练，模型是用 Keras 建立的，基于三层。由于数据基数小，分类输出将是多类数组，这将有助于识别编码意图。使用 softmax 激活来产生多类分类输出（结果返回一个 0/1 的数组：\[1,0,0,...,0\]--这个数组可以识别编码意图）。

model = Sequential()  model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))  model.add(Dropout(0.5))  model.add(Dense(64, activation='relu'))  model.add(Dropout(0.5))  model.add(Dense(len(train_y[0]), activation='softmax'))      sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)  model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])      model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)

复制代码

这块是以 200 次迭代的方式执行训练，批处理量为 5 个，因为我的测试数据样本小，所以 100 次也可以，这不是重点。

开始训练：

14/14 [==============================] - 0s 32ms/step - loss: 0.7305 - acc: 0.5000  Epoch 2/200  14/14 [==============================] - 0s 391us/step - loss: 0.7458 - acc: 0.4286  Epoch 3/200  14/14 [==============================] - 0s 390us/step - loss: 0.7086 - acc: 0.3571  Epoch 4/200  14/14 [==============================] - 0s 395us/step - loss: 0.6941 - acc: 0.6429  Epoch 5/200  14/14 [==============================] - 0s 426us/step - loss: 0.6358 - acc: 0.7143  Epoch 6/200  14/14 [==============================] - 0s 356us/step - loss: 0.6287 - acc: 0.5714  Epoch 7/200  14/14 [==============================] - 0s 366us/step - loss: 0.6457 - acc: 0.6429  Epoch 8/200  14/14 [==============================] - 0s 899us/step - loss: 0.6336 - acc: 0.6429  Epoch 9/200  14/14 [==============================] - 0s 464us/step - loss: 0.5815 - acc: 0.6429  Epoch 10/200  14/14 [==============================] - 0s 408us/step - loss: 0.5895 - acc: 0.6429  Epoch 11/200  14/14 [==============================] - 0s 548us/step - loss: 0.6050 - acc: 0.6429  Epoch 12/200  14/14 [==============================] - 0s 468us/step - loss: 0.6254 - acc: 0.6429  Epoch 13/200  14/14 [==============================] - 0s 388us/step - loss: 0.4990 - acc: 0.7857  Epoch 14/200  14/14 [==============================] - 0s 392us/step - loss: 0.5880 - acc: 0.7143  Epoch 15/200  14/14 [==============================] - 0s 370us/step - loss: 0.5118 - acc: 0.8571  Epoch 16/200  14/14 [==============================] - 0s 457us/step - loss: 0.5579 - acc: 0.7143  Epoch 17/200  14/14 [==============================] - 0s 432us/step - loss: 0.4535 - acc: 0.7857  Epoch 18/200  14/14 [==============================] - 0s 357us/step - loss: 0.4367 - acc: 0.7857  Epoch 19/200  14/14 [==============================] - 0s 384us/step - loss: 0.4751 - acc: 0.7857  Epoch 20/200  14/14 [==============================] - 0s 346us/step - loss: 0.4404 - acc: 0.9286  Epoch 21/200  14/14 [==============================] - 0s 500us/step - loss: 0.4325 - acc: 0.8571  Epoch 22/200  14/14 [==============================] - 0s 400us/step - loss: 0.4104 - acc: 0.9286  Epoch 23/200  14/14 [==============================] - 0s 738us/step - loss: 0.4296 - acc: 0.7857  Epoch 24/200  14/14 [==============================] - 0s 387us/step - loss: 0.3706 - acc: 0.9286  Epoch 25/200  14/14 [==============================] - 0s 430us/step - loss: 0.4213 - acc: 0.8571  Epoch 26/200  14/14 [==============================] - 0s 351us/step - loss: 0.2867 - acc: 1.0000  Epoch 27/200  14/14 [==============================] - 0s 3ms/step - loss: 0.2903 - acc: 1.0000  Epoch 28/200  14/14 [==============================] - 0s 366us/step - loss: 0.3010 - acc: 0.9286  Epoch 29/200  14/14 [==============================] - 0s 404us/step - loss: 0.2466 - acc: 0.9286  Epoch 30/200  14/14 [==============================] - 0s 428us/step - loss: 0.3035 - acc: 0.7857  Epoch 31/200  14/14 [==============================] - 0s 407us/step - loss: 0.2075 - acc: 1.0000  Epoch 32/200  14/14 [==============================] - 0s 457us/step - loss: 0.2167 - acc: 0.9286  Epoch 33/200  14/14 [==============================] - 0s 613us/step - loss: 0.1266 - acc: 1.0000  Epoch 34/200  14/14 [==============================] - 0s 534us/step - loss: 0.2906 - acc: 0.9286  Epoch 35/200  14/14 [==============================] - 0s 463us/step - loss: 0.2560 - acc: 0.9286  Epoch 36/200  14/14 [==============================] - 0s 500us/step - loss: 0.1686 - acc: 1.0000  Epoch 37/200  14/14 [==============================] - 0s 387us/step - loss: 0.0922 - acc: 1.0000  Epoch 38/200  14/14 [==============================] - 0s 430us/step - loss: 0.1620 - acc: 1.0000  Epoch 39/200  14/14 [==============================] - 0s 371us/step - loss: 0.1104 - acc: 1.0000  Epoch 40/200  14/14 [==============================] - 0s 488us/step - loss: 0.1330 - acc: 1.0000  Epoch 41/200  14/14 [==============================] - 0s 381us/step - loss: 0.1322 - acc: 1.0000  Epoch 42/200  14/14 [==============================] - 0s 462us/step - loss: 0.0575 - acc: 1.0000  Epoch 43/200  14/14 [==============================] - 0s 1ms/step - loss: 0.1137 - acc: 1.0000  Epoch 44/200  14/14 [==============================] - 0s 450us/step - loss: 0.0245 - acc: 1.0000  Epoch 45/200  14/14 [==============================] - 0s 470us/step - loss: 0.1824 - acc: 1.0000  Epoch 46/200  14/14 [==============================] - 0s 444us/step - loss: 0.0822 - acc: 1.0000  Epoch 47/200  14/14 [==============================] - 0s 436us/step - loss: 0.0939 - acc: 1.0000  Epoch 48/200  14/14 [==============================] - 0s 396us/step - loss: 0.0288 - acc: 1.0000  Epoch 49/200  14/14 [==============================] - 0s 580us/step - loss: 0.1367 - acc: 0.9286  Epoch 50/200  14/14 [==============================] - 0s 351us/step - loss: 0.0363 - acc: 1.0000  Epoch 51/200  14/14 [==============================] - 0s 379us/step - loss: 0.0272 - acc: 1.0000  Epoch 52/200  14/14 [==============================] - 0s 358us/step - loss: 0.0712 - acc: 1.0000  Epoch 53/200  14/14 [==============================] - 0s 4ms/step - loss: 0.0426 - acc: 1.0000  Epoch 54/200  14/14 [==============================] - 0s 370us/step - loss: 0.0430 - acc: 1.0000  Epoch 55/200  14/14 [==============================] - 0s 368us/step - loss: 0.0292 - acc: 1.0000  Epoch 56/200  14/14 [==============================] - 0s 494us/step - loss: 0.0777 - acc: 1.0000  Epoch 57/200  14/14 [==============================] - 0s 356us/step - loss: 0.0496 - acc: 1.0000  Epoch 58/200  14/14 [==============================] - 0s 427us/step - loss: 0.1485 - acc: 1.0000  Epoch 59/200  14/14 [==============================] - 0s 381us/step - loss: 0.1006 - acc: 1.0000  Epoch 60/200  14/14 [==============================] - 0s 421us/step - loss: 0.0183 - acc: 1.0000  Epoch 61/200  14/14 [==============================] - 0s 344us/step - loss: 0.0788 - acc: 0.9286  Epoch 62/200  14/14 [==============================] - 0s 529us/step - loss: 0.0176 - acc: 1.0000

复制代码

ok，200 次之后，现在模型已经训练好了，现在声明一个方法用来进行词袋转换：

def clean_up_sentence(sentence):      # tokenize the pattern - split words into array      sentence_words = nltk.word_tokenize(sentence)      # stem each word - create short form for word      sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]      return sentence_words
def bow(sentence, words, show_details=True):      # tokenize the pattern      sentence_words = clean_up_sentence(sentence)      # bag of words - matrix of N words, vocabulary matrix      bag = [0]*len(words)        for s in sentence_words:          for i,w in enumerate(words):              if w == s:                   # assign 1 if current word is in the vocabulary position                  bag[i] = 1                  if show_details:                      print ("found in bag: %s" % w)      return(np.array(bag))

复制代码

测试一下，看看是否可以命中词袋：

p = bow("你好", words)  print (p)

复制代码

返回值：

found in bag: 你好  [0 0 1 0 0 0 0 0 0 0 0 0 0 0]

复制代码

很明显匹配成功，词已入袋。

在我们打包模型之前，可以使用 model.predict 函数对用户输入进行分类测试，并根据计算出的概率返回用户意图（可以返回多个意图，根据概率倒序输出）：

def classify_local(sentence):      ERROR_THRESHOLD = 0.25            # generate probabilities from the model      input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])      results = model.predict([input_data])[0]      # filter out predictions below a threshold, and provide intent index      results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]      # sort by strength of probability      results.sort(key=lambda x: x[1], reverse=True)      return_list = []      for r in results:          return_list.append((classes[r[0]], str(r[1])))      # return tuple of intent and probability            return return_list

复制代码

测试一下：

print(classify_local('您好'))

复制代码

返回值：

found in bag: 您好  [('打招呼', '0.999913')]  liuyue:mytornado liuyue$

复制代码

再测：

print(classify_local('88'))

复制代码

返回值：

found in bag: 88  [('告别', '0.9995449')]

复制代码

完美，匹配出打招呼的语境标签，如果愿意，可以多测试几个，完善模型。

测试完成之后，我们可以将训练好的模型打包，这样每次调用之前就不用训练了：

json_file = model.to_json()  with open('v3ucn.json', "w") as file:     file.write(json_file)    model.save_weights('./v3ucn.h5f')

复制代码

这里模型分为数据文件(json)以及权重文件(h5f)，将它们保存好，一会儿会用到。

接下来，我们来搭建一个聊天机器人的 API，这里我们使用目前非常火的框架 Fastapi，将模型文件放入到项目的目录之后，编写 main.py:

import random  import uvicorn  from fastapi import FastAPI  app = FastAPI()      def classify_local(sentence):      ERROR_THRESHOLD = 0.25            # generate probabilities from the model      input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])      results = model.predict([input_data])[0]      # filter out predictions below a threshold, and provide intent index      results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]      # sort by strength of probability      results.sort(key=lambda x: x[1], reverse=True)      return_list = []      for r in results:          return_list.append((classes[r[0]], str(r[1])))      # return tuple of intent and probability            return return_list    @app.get('/')  async def root(word: str = None):            from keras.models import model_from_json      # # load json and create model      file = open("./v3ucn.json", 'r')      model_json = file.read()      file.close()      model = model_from_json(model_json)      model.load_weights("./v3ucn.h5f")        wordlist = classify_local(word)      a = ""      for intent in intents['intents']:          if intent['tag'] == wordlist[0][0]:              a = random.choice(intent['responses'])            return {'message':a}    if __name__ == "__main__":      uvicorn.run(app, host="127.0.0.1", port=8000)

复制代码

这里的：

from keras.models import model_from_json  file = open("./v3ucn.json", 'r')  model_json = file.read()  file.close()  model = model_from_json(model_json)  model.load_weights("./v3ucn.h5f")

复制代码

用来导入刚才训练好的模型库，随后启动服务：

uvicorn main:app --reload

复制代码

效果是这样的：

结语：毫无疑问，科技改变生活，聊天机器人可以让我们没有佳人相伴的情况下，也可以听闻莺啼燕语，相信不久的将来，笑语盈盈、衣香鬓影的“机械姬”亦能伴吾等于清风明月之下。

原文转载自「刘悦的技术博客」 https://v3u.cn/aid178

发布于: 2020 年 12 月 29 日阅读数: 40

原文链接:【http://xie.infoq.cn/article/6b33be58f924cf4e2b4e99a16】。文章转载请联系作者。

刘悦的技术博客

关注

专注技术，凝聚意志，解决问题 v3u.cn 2020.12.21 加入

还未添加个人简介

发布

暂无评论

创作场景

人工智能不过尔尔，基于 Python3 深度学习库 Keras/TensorFlow 打造属于自己的聊天机器人 (ChatRobot)

刘悦的技术博客

评论