写点什么

人工智能不过尔尔,基于 Python3 深度学习库 Keras/TensorFlow 打造属于自己的聊天机器人 (ChatRobot)

发布于: 2020 年 12 月 29 日
人工智能不过尔尔,基于Python3深度学习库Keras/TensorFlow打造属于自己的聊天机器人(ChatRobot)

聊天机器人(ChatRobot)的概念我们并不陌生,也许你曾经在百无聊赖之下和 Siri 打情骂俏过,亦或是闲暇之余与小爱同学谈笑风生,无论如何,我们都得承认,人工智能已经深入了我们的生活。目前市面上提供三方 api 的机器人不胜枚举:微软小冰、图灵机器人、腾讯闲聊、青云客机器人等等,只要我们想,就随时可以在 app 端或者 web 应用上进行接入。但是,这些应用的底层到底如何实现的?在没有网络接入的情况下,我们能不能像美剧《西部世界》(Westworld)里面描绘的那样,机器人只需要存储在本地的“心智球”就可以和人类沟通交流,如果你不仅仅满足于当一个“调包侠”,请跟随我们的旅程,本次我们将首度使用深度学习库 Keras/TensorFlow 打造属于自己的本地聊天机器人,不依赖任何三方接口与网络。


首先安装相关依赖:


pip3 install Tensorflow  pip3 install Keras  pip3 install nltk
复制代码


然后撰写脚本 test\_bot.py 导入需要的库:


import nltk  import ssl  from nltk.stem.lancaster import LancasterStemmer  stemmer = LancasterStemmer()    import numpy as np  from keras.models import Sequential  from keras.layers import Dense, Activation, Dropout  from keras.optimizers import SGD  import pandas as pd  import pickle  import random
复制代码


这里有一个坑,就是自然语言分析库 NLTK 会报一个错误:




Resource punkt not found

复制代码


正常情况下,只要加上一行下载器代码即可


import nltk  nltk.download('punkt')
复制代码


但是由于学术上网的原因,很难通过 python 下载器正常下载,所以我们玩一次曲线救国,手动自己下载压缩包:


https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
复制代码


解压之后,放在你的用户目录下即可:


C:\Users\liuyue\tokenizers\nltk_data\punkt
复制代码


ok,言归正传,开发聊天机器人所面对的最主要挑战是对用户输入信息进行分类,以及能够识别人类的正确意图(这个可以用机器学习解决,但是太复杂,我偷懒了,所以用的深度学习 Keras)。第二就是怎样保持语境,也就是分析和跟踪上下文,通常情况下,我们不太需要对用户意图进行分类,只需要把用户输入的信息当作聊天机器人问题的答案即可,所这里我们使用 Keras 深度学习库用于构建分类模型。


聊天机器人的意向和需要学习的模式都定义在一个简单的变量中。不需要动辄上 T 的语料库。我们知道如果玩机器人的,手里没有语料库,就会被人嘲笑,但是我们的目标只是为某一个特定的语境建立一个特定聊天机器人。所以分类模型作为小词汇量创建,它仅仅将能够识别为训练提供的一小组模式。


说白了就是,所谓的机器学习,就是你重复的教机器做某一件或几件正确的事情,在训练中,你不停的演示怎么做是正确的,然后期望机器在学习中能够举一反三,只不过这次我们不教它很多事情,只一件,用来测试它的反应而已,是不是有点像你在家里训练你的宠物狗?只不过狗子可没法和你聊天。


这里的意向数据变量我就简单举个例子,如果愿意,你可以用语料库对变量进行无限扩充:


intents = {"intents": [          {"tag": "打招呼",           "patterns": ["你好", "您好", "请问", "有人吗", "师傅","不好意思","美女","帅哥","靓妹","hi"],           "responses": ["您好", "又是您啊", "吃了么您内","您有事吗"],           "context": [""]          },          {"tag": "告别",           "patterns": ["再见", "拜拜", "88", "回见", "回头见"],           "responses": ["再见", "一路顺风", "下次见", "拜拜了您内"],           "context": [""]          },     ]  }
复制代码


可以看到,我插入了两个语境标签,打招呼和告别,包括用户输入信息以及机器回应数据。


在开始分类模型训练之前,我们需要先建立词汇。模式经过处理后建立词汇库。每一个词都会有词干产生通用词根,这将有助于能够匹配更多用户输入的组合。


for intent in intents['intents']:      for pattern in intent['patterns']:          # tokenize each word in the sentence          w = nltk.word_tokenize(pattern)          # add to our words list          words.extend(w)          # add to documents in our corpus          documents.append((w, intent['tag']))          # add to our classes list          if intent['tag'] not in classes:              classes.append(intent['tag'])    words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]  words = sorted(list(set(words)))    classes = sorted(list(set(classes)))    print (len(classes), "语境", classes)    print (len(words), "词数", words)
复制代码


输出:


2 语境 ['告别', '打招呼']  14 词数 ['88', '不好意思', '你好', '再见', '回头见', '回见', '帅哥', '师傅', '您好', '拜拜', '有人吗', '美女', '请问', '靓妹']
复制代码


训练不会根据词汇来分析,因为词汇对于机器来说是没有任何意义的,这也是很多中文分词库所陷入的误区,其实机器并不理解你输入的到底是英文还是中文,我们只需要将单词或者中文转化为包含 0/1 的数组的词袋。数组长度将等于词汇量大小,当当前模式中的一个单词或词汇位于给定位置时,将设置为 1。


# create our training data  training = []  # create an empty array for our output  output_empty = [0] * len(classes)  # training set, bag of words for each sentence  for doc in documents:      # initialize our bag of words      bag = []        pattern_words = doc[0]           pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]        for w in words:          bag.append(1) if w in pattern_words else bag.append(0)               output_row = list(output_empty)      output_row[classes.index(doc[1])] = 1            training.append([bag, output_row])    random.shuffle(training)  training = np.array(training)    train_x = list(training[:,0])  train_y = list(training[:,1])
复制代码


我们开始进行数据训练,模型是用 Keras 建立的,基于三层。由于数据基数小,分类输出将是多类数组,这将有助于识别编码意图。使用 softmax 激活来产生多类分类输出(结果返回一个 0/1 的数组:\[1,0,0,...,0\]--这个数组可以识别编码意图)。


model = Sequential()  model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))  model.add(Dropout(0.5))  model.add(Dense(64, activation='relu'))  model.add(Dropout(0.5))  model.add(Dense(len(train_y[0]), activation='softmax'))      sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)  model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])      model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
复制代码


这块是以 200 次迭代的方式执行训练,批处理量为 5 个,因为我的测试数据样本小,所以 100 次也可以,这不是重点。


开始训练:


14/14 [==============================] - 0s 32ms/step - loss: 0.7305 - acc: 0.5000  Epoch 2/200  14/14 [==============================] - 0s 391us/step - loss: 0.7458 - acc: 0.4286  Epoch 3/200  14/14 [==============================] - 0s 390us/step - loss: 0.7086 - acc: 0.3571  Epoch 4/200  14/14 [==============================] - 0s 395us/step - loss: 0.6941 - acc: 0.6429  Epoch 5/200  14/14 [==============================] - 0s 426us/step - loss: 0.6358 - acc: 0.7143  Epoch 6/200  14/14 [==============================] - 0s 356us/step - loss: 0.6287 - acc: 0.5714  Epoch 7/200  14/14 [==============================] - 0s 366us/step - loss: 0.6457 - acc: 0.6429  Epoch 8/200  14/14 [==============================] - 0s 899us/step - loss: 0.6336 - acc: 0.6429  Epoch 9/200  14/14 [==============================] - 0s 464us/step - loss: 0.5815 - acc: 0.6429  Epoch 10/200  14/14 [==============================] - 0s 408us/step - loss: 0.5895 - acc: 0.6429  Epoch 11/200  14/14 [==============================] - 0s 548us/step - loss: 0.6050 - acc: 0.6429  Epoch 12/200  14/14 [==============================] - 0s 468us/step - loss: 0.6254 - acc: 0.6429  Epoch 13/200  14/14 [==============================] - 0s 388us/step - loss: 0.4990 - acc: 0.7857  Epoch 14/200  14/14 [==============================] - 0s 392us/step - loss: 0.5880 - acc: 0.7143  Epoch 15/200  14/14 [==============================] - 0s 370us/step - loss: 0.5118 - acc: 0.8571  Epoch 16/200  14/14 [==============================] - 0s 457us/step - loss: 0.5579 - acc: 0.7143  Epoch 17/200  14/14 [==============================] - 0s 432us/step - loss: 0.4535 - acc: 0.7857  Epoch 18/200  14/14 [==============================] - 0s 357us/step - loss: 0.4367 - acc: 0.7857  Epoch 19/200  14/14 [==============================] - 0s 384us/step - loss: 0.4751 - acc: 0.7857  Epoch 20/200  14/14 [==============================] - 0s 346us/step - loss: 0.4404 - acc: 0.9286  Epoch 21/200  14/14 [==============================] - 0s 500us/step - loss: 0.4325 - acc: 0.8571  Epoch 22/200  14/14 [==============================] - 0s 400us/step - loss: 0.4104 - acc: 0.9286  Epoch 23/200  14/14 [==============================] - 0s 738us/step - loss: 0.4296 - acc: 0.7857  Epoch 24/200  14/14 [==============================] - 0s 387us/step - loss: 0.3706 - acc: 0.9286  Epoch 25/200  14/14 [==============================] - 0s 430us/step - loss: 0.4213 - acc: 0.8571  Epoch 26/200  14/14 [==============================] - 0s 351us/step - loss: 0.2867 - acc: 1.0000  Epoch 27/200  14/14 [==============================] - 0s 3ms/step - loss: 0.2903 - acc: 1.0000  Epoch 28/200  14/14 [==============================] - 0s 366us/step - loss: 0.3010 - acc: 0.9286  Epoch 29/200  14/14 [==============================] - 0s 404us/step - loss: 0.2466 - acc: 0.9286  Epoch 30/200  14/14 [==============================] - 0s 428us/step - loss: 0.3035 - acc: 0.7857  Epoch 31/200  14/14 [==============================] - 0s 407us/step - loss: 0.2075 - acc: 1.0000  Epoch 32/200  14/14 [==============================] - 0s 457us/step - loss: 0.2167 - acc: 0.9286  Epoch 33/200  14/14 [==============================] - 0s 613us/step - loss: 0.1266 - acc: 1.0000  Epoch 34/200  14/14 [==============================] - 0s 534us/step - loss: 0.2906 - acc: 0.9286  Epoch 35/200  14/14 [==============================] - 0s 463us/step - loss: 0.2560 - acc: 0.9286  Epoch 36/200  14/14 [==============================] - 0s 500us/step - loss: 0.1686 - acc: 1.0000  Epoch 37/200  14/14 [==============================] - 0s 387us/step - loss: 0.0922 - acc: 1.0000  Epoch 38/200  14/14 [==============================] - 0s 430us/step - loss: 0.1620 - acc: 1.0000  Epoch 39/200  14/14 [==============================] - 0s 371us/step - loss: 0.1104 - acc: 1.0000  Epoch 40/200  14/14 [==============================] - 0s 488us/step - loss: 0.1330 - acc: 1.0000  Epoch 41/200  14/14 [==============================] - 0s 381us/step - loss: 0.1322 - acc: 1.0000  Epoch 42/200  14/14 [==============================] - 0s 462us/step - loss: 0.0575 - acc: 1.0000  Epoch 43/200  14/14 [==============================] - 0s 1ms/step - loss: 0.1137 - acc: 1.0000  Epoch 44/200  14/14 [==============================] - 0s 450us/step - loss: 0.0245 - acc: 1.0000  Epoch 45/200  14/14 [==============================] - 0s 470us/step - loss: 0.1824 - acc: 1.0000  Epoch 46/200  14/14 [==============================] - 0s 444us/step - loss: 0.0822 - acc: 1.0000  Epoch 47/200  14/14 [==============================] - 0s 436us/step - loss: 0.0939 - acc: 1.0000  Epoch 48/200  14/14 [==============================] - 0s 396us/step - loss: 0.0288 - acc: 1.0000  Epoch 49/200  14/14 [==============================] - 0s 580us/step - loss: 0.1367 - acc: 0.9286  Epoch 50/200  14/14 [==============================] - 0s 351us/step - loss: 0.0363 - acc: 1.0000  Epoch 51/200  14/14 [==============================] - 0s 379us/step - loss: 0.0272 - acc: 1.0000  Epoch 52/200  14/14 [==============================] - 0s 358us/step - loss: 0.0712 - acc: 1.0000  Epoch 53/200  14/14 [==============================] - 0s 4ms/step - loss: 0.0426 - acc: 1.0000  Epoch 54/200  14/14 [==============================] - 0s 370us/step - loss: 0.0430 - acc: 1.0000  Epoch 55/200  14/14 [==============================] - 0s 368us/step - loss: 0.0292 - acc: 1.0000  Epoch 56/200  14/14 [==============================] - 0s 494us/step - loss: 0.0777 - acc: 1.0000  Epoch 57/200  14/14 [==============================] - 0s 356us/step - loss: 0.0496 - acc: 1.0000  Epoch 58/200  14/14 [==============================] - 0s 427us/step - loss: 0.1485 - acc: 1.0000  Epoch 59/200  14/14 [==============================] - 0s 381us/step - loss: 0.1006 - acc: 1.0000  Epoch 60/200  14/14 [==============================] - 0s 421us/step - loss: 0.0183 - acc: 1.0000  Epoch 61/200  14/14 [==============================] - 0s 344us/step - loss: 0.0788 - acc: 0.9286  Epoch 62/200  14/14 [==============================] - 0s 529us/step - loss: 0.0176 - acc: 1.0000
复制代码


ok,200 次之后,现在模型已经训练好了,现在声明一个方法用来进行词袋转换:


def clean_up_sentence(sentence):      # tokenize the pattern - split words into array      sentence_words = nltk.word_tokenize(sentence)      # stem each word - create short form for word      sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]      return sentence_words
def bow(sentence, words, show_details=True): # tokenize the pattern sentence_words = clean_up_sentence(sentence) # bag of words - matrix of N words, vocabulary matrix bag = [0]*len(words) for s in sentence_words: for i,w in enumerate(words): if w == s: # assign 1 if current word is in the vocabulary position bag[i] = 1 if show_details: print ("found in bag: %s" % w) return(np.array(bag))
复制代码


测试一下,看看是否可以命中词袋:


p = bow("你好", words)  print (p)
复制代码


返回值:


found in bag: 你好  [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
复制代码


很明显匹配成功,词已入袋。


在我们打包模型之前,可以使用 model.predict 函数对用户输入进行分类测试,并根据计算出的概率返回用户意图(可以返回多个意图,根据概率倒序输出):


def classify_local(sentence):      ERROR_THRESHOLD = 0.25            # generate probabilities from the model      input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])      results = model.predict([input_data])[0]      # filter out predictions below a threshold, and provide intent index      results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]      # sort by strength of probability      results.sort(key=lambda x: x[1], reverse=True)      return_list = []      for r in results:          return_list.append((classes[r[0]], str(r[1])))      # return tuple of intent and probability            return return_list
复制代码


测试一下:


print(classify_local('您好'))
复制代码


返回值:


found in bag: 您好  [('打招呼', '0.999913')]  liuyue:mytornado liuyue$
复制代码


再测:


print(classify_local('88'))
复制代码


返回值:


found in bag: 88  [('告别', '0.9995449')]
复制代码


完美,匹配出打招呼的语境标签,如果愿意,可以多测试几个,完善模型。


测试完成之后,我们可以将训练好的模型打包,这样每次调用之前就不用训练了:


json_file = model.to_json()  with open('v3ucn.json', "w") as file:     file.write(json_file)    model.save_weights('./v3ucn.h5f')
复制代码


这里模型分为数据文件(json)以及权重文件(h5f),将它们保存好,一会儿会用到。


接下来,我们来搭建一个聊天机器人的 API,这里我们使用目前非常火的框架 Fastapi,将模型文件放入到项目的目录之后,编写 main.py:


import random  import uvicorn  from fastapi import FastAPI  app = FastAPI()      def classify_local(sentence):      ERROR_THRESHOLD = 0.25            # generate probabilities from the model      input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])      results = model.predict([input_data])[0]      # filter out predictions below a threshold, and provide intent index      results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]      # sort by strength of probability      results.sort(key=lambda x: x[1], reverse=True)      return_list = []      for r in results:          return_list.append((classes[r[0]], str(r[1])))      # return tuple of intent and probability            return return_list    @app.get('/')  async def root(word: str = None):            from keras.models import model_from_json      # # load json and create model      file = open("./v3ucn.json", 'r')      model_json = file.read()      file.close()      model = model_from_json(model_json)      model.load_weights("./v3ucn.h5f")        wordlist = classify_local(word)      a = ""      for intent in intents['intents']:          if intent['tag'] == wordlist[0][0]:              a = random.choice(intent['responses'])            return {'message':a}    if __name__ == "__main__":      uvicorn.run(app, host="127.0.0.1", port=8000)
复制代码


这里的:


from keras.models import model_from_json  file = open("./v3ucn.json", 'r')  model_json = file.read()  file.close()  model = model_from_json(model_json)  model.load_weights("./v3ucn.h5f")
复制代码


用来导入刚才训练好的模型库,随后启动服务:


uvicorn main:app --reload
复制代码


效果是这样的:



结语:毫无疑问,科技改变生活,聊天机器人可以让我们没有佳人相伴的情况下,也可以听闻莺啼燕语,相信不久的将来,笑语盈盈、衣香鬓影的“机械姬”亦能伴吾等于清风明月之下。


原文转载自「刘悦的技术博客」 https://v3u.cn/aid178


发布于: 2020 年 12 月 29 日阅读数: 40
用户头像

专注技术,凝聚意志,解决问题 v3u.cn 2020.12.21 加入

还未添加个人简介

评论

发布
暂无评论
人工智能不过尔尔,基于Python3深度学习库Keras/TensorFlow打造属于自己的聊天机器人(ChatRobot)