写点什么

机器学习如何做到疫情可视化——疫情数据分析与预测实战

作者:是Dream呀
  • 2022 年 7 月 27 日
  • 本文字数:15826 字

    阅读完需:约 52 分钟

机器学习如何做到疫情可视化——疫情数据分析与预测实战

前言: 本文将带领大家爬取 11 个国家以及中国 31 个省(自治区、直辖市)在 2022.0101-2022.06.19 的新冠疫情数据。并且采用机器学习模型对 2022.6.20-2022.6.30 每一天的全国确诊人数、死亡人数、治愈人数进行预测,做出疫情可视化图形并且求出最终的相关系数 R2!


一、问题说明

1、爬取中国、美国、巴西、印度、俄罗斯、法国、英国、土耳其、阿根廷、哥伦比亚、日本等 11 个国家以及中国 31 个省(自治区、直辖市)在 2022.0101-2022.06.19 的新冠疫情数据。如果对数据爬虫技术不熟悉,可使用 data 文件中提供的数据,其中中国各省数据为 confirmedCount、curedCount、deadCount;world_confirmedCount、world_curedCount、world_deadCount 数据为 11 个国家的爬取数据。2、根据爬取或提供的疫情数据,将最近日期(2022.06.19)确诊病例数、死亡人数、康复人数在上述 11 个国家、国内各地区两个维度进行可视化展示(如柱状图或者饼状图)。3、采用机器学习模型对 2022.6.20-2022.6.30 每一天的全国确诊人数、死亡人数、治愈人数进行预测。4、2022.6.20-2022.6.30 的确诊人数、死亡人数、治愈人数结果将在 2022.7.1 公布,请根据真实结果,计算决定系数 R2,最终以该系数作为本项目的最终得分

二、模型与算法

在模型算法方面,这次我们选择的是 LSTM 算法,LSTM 是 RNN 的一个优秀的变种模型,继承了大部分 RNN 模型的特性,同时很利于解决本题大量数据的问题。Long ShortTerm 网络是一种 RNN 特殊的类型,可以学习长期依赖信息。LSTM 和基线 RNN 并没有特别大的结构不同,但是它们用了不同的函数来计算隐状态。LSTM 的“记忆”叫做细胞,可以直接把它们想做黑盒,这个黑盒的输入为前状态 h 和当前输入 x。这些“细胞”会决定哪些之前的信息和状态需要保留/记住,而哪些要被抹去。实际的应用中发现,这种方式可以有效地保存很长时间之前的关联信息。在 LSTM 模型算法方面,我们使用 LSTM 中的重复模块则包含四个交互的层,三个 Sigmoid 和一个 tanh 层,以一种非常特殊的方式进行交互,同时使用 LSTM 有通过精心设计的称作为“门”的结构来去除和增加信息到细胞状态。利用一个 sigmoid 神经网络层和一个 pointwise 乘法的非线性操作(0 代表“不许任何量通过”,1 就指“允许任意量通过”),从而使得网络就能了解哪些数据是需要我们去遗忘,哪些数据是需要我们去保存的,得到我们真正需要去训练的数据,即训练集,这点在死亡人数数据处理上很重要,对数据集进行反复的训练,得到我们最终的预测图以及预测结果。

三、实验设置过程

1.爬取各国各省数据

利用 json 将我们需要的中国、美国、巴西、印度、俄罗斯、法国、英国、土耳其、阿根廷、哥伦比亚、日本等 11 个国家以及中国 31 个省(自治区、直辖市)在 2022.0101-2022.06.19 的新冠疫情数据爬取下来,并将其导入我们的平台中。


1.1 国内数据:

代码部分:

国内部分:# @Time : 2022/6/30 19:20# @Author : 徐以鹏# @File : 国内数据.pyimport requestsfrom lxml import etreeimport jsonimport pandas as pdimport numpy as np
# 找出请求头以及url,xpath解析headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia'req = requests.get(url, headers=headers)req.encoding = "utf-8"html = etree.HTML(req.text)req_data=html.xpath("//*[@id='getAreaStat']/text()")# print(req_data)
# 提取字符串req_str = req_data[0]+''data_str = req_str[req_str.find('[{'):req_str.find('}catch')]data_json = json.loads(data_str) # 转化为json
# 存储每个省份的Json,provinceShortName存储省份简称countryJson=[] # 以列表形式存储国家和国家对应的Jsonfor i in data_json: countryJson.append([i['provinceShortName'],i['statisticsData']])# 此时的countryJson存储的便是要求的十一个国家以及其所对应的json
# 下载jsonfor i in countryJson: countryName = i[0] jsonAddress = i[1] print(countryName,jsonAddress) try: r = requests.get(jsonAddress,headers=headers) r.raise_for_status() r.encoding = "utf-8" # 防止乱码 CountryDataJson = json.loads(r.text) toWriteFilePath = countryName + '.json' with open(toWriteFilePath, 'w',encoding='UTF-8') as file: json.dump(CountryDataJson, file, ensure_ascii=False) print(countryName + "已经下载完毕") except: print(countryName+" 数据下载失败!")
# 爬取山东的疫情数据file = '山东.json'with open(file,'r') as f: data = json.load(f) data_list = data['data'] confirmed = pd.DataFrame([]) data_d = pd.DataFrame(data_list) data_d.set_index('dateId',inplace=True) data_u = data_d[data_d.index >= 20220101] data_u = data_u[data_u.index <= 20220619] data_u = data_u.T data_u.drop(data_u.index, inplace=True) confirmed = data_u.drop(index=data_u.index) cured = data_u.drop(index=data_u.index) dead = data_u.drop(index=data_u.index)
# 整合各省疫情数据nameList=[]for i in countryJson: nameList.append(i[0])for i in nameList: file = i + '.json' with open(file,'r') as f: data = json.load(f) data_list = data['data'] data_df = pd.DataFrame(data_list) data_df.set_index('dateId',inplace = True) data_df = data_df[['confirmedCount', 'curedCount', 'deadCount']] data_ult = data_df[data_df.index >= 20220101] data_ult = data_ult[data_ult.index <= 20220619] data_ult = data_ult.replace(0, np.nan) data_ult.bfill(inplace = True) data_ult.to_csv('各省数据/'+i + '.csv') data_confirmed = data_ult[['confirmedCount']].T data_cured = data_ult[['curedCount']].T data_dead = data_ult[['deadCount']].T confirmed = pd.concat([confirmed,data_confirmed]) cured = pd.concat([cured,data_cured]) dead = pd.concat([dead,data_dead])confirmed.index = nameListcured.index = nameListdead.index = nameListconfirmed.to_csv('各省数据/'+'China_confirmeCount'+'.csv')cured.to_csv('各省数据/'+'China_curedCount'+'.csv')dead.to_csv('各省数据/'+'China_deadCount'+'.csv')
复制代码

结果部分:

1.2 国外数据:

代码部分:

# @Time : 2022/6/30 9:59# @Author : 徐以鹏# @File : 国外数据.pyimport requestsfrom lxml import etreeimport jsonimport pandas as pdimport numpy as np
# 找出请求头以及url,xpath解析headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia'req = requests.get(url, headers=headers)req.encoding = "utf-8"html = etree.HTML(req.text)req_data=html.xpath("//*[@id='getListByCountryTypeService2true']/text()")
# 提取字符串req_str = req_data[0]+''data_str = req_str[req_str.find('[{'):req_str.find('}catch')]data_json = json.loads(data_str) # 转化为json
# 存储每个国家的jsoncountryJson=[] # 以列表形式存储国家和国家对应的Jsonfor i in data_json: #中国、美国、巴西、印度、俄罗斯、法国、英国、土耳其、阿根廷、哥伦比亚 if i["provinceName"] in ['中国','美国','巴西','印度','俄罗斯','英国','法国','土耳其','阿根廷','哥伦比亚','日本']: countryJson.append([i['provinceName'],i['statisticsData']])# 此时的countryJson存储的便是要求的十一个国家以及其所对应的json
# 下载jsonfor i in countryJson: countryName = i[0] jsonAddress = i[1] print(countryName,jsonAddress) try: r=requests.get(jsonAddress,headers=headers) r.raise_for_status() r.encoding = "utf-8" # 防止乱码 CountryDataJson = json.loads(r.text) toWriteFilePath = ''+countryName + '.json' with open(toWriteFilePath, 'w',encoding='UTF-8') as file: json.dump(CountryDataJson, file, ensure_ascii=False) print(countryName + "已经下载完毕") except: print(countryName+" 数据下载失败!")
# 爬取中国的疫情数据file = '中国.json'with open(file,'r') as f: data = json.load(f) data_list = data['data'] confirmed = pd.DataFrame([]) data_d = pd.DataFrame(data_list) data_d.set_index('dateId',inplace=True) data_u = data_d[data_d.index >= 20220101] data_u = data_u[data_u.index <= 20220619] data_u = data_u.T data_u.drop(data_u.index, inplace=True) confirmed = data_u.drop(index=data_u.index) cured = data_u.drop(index=data_u.index) dead = data_u.drop(index=data_u.index)
# 整合世界疫情数据nameList = ['中国','美国','巴西','印度','俄罗斯','英国','法国','土耳其','阿根廷','哥伦比亚','日本']for i in nameList: file = i + '.json' with open(file,'r') as f: data = json.load(f) data_list = data['data'] data_df = pd.DataFrame(data_list) data_df.set_index('dateId',inplace = True) data_df = data_df[['confirmedCount', 'curedCount', 'deadCount']] data_ult = data_df[data_df.index >= 20220101] data_ult = data_ult[data_ult.index <= 20220619] data_ult = data_ult.replace(0, np.nan) data_ult.bfill(inplace = True) data_ult.to_csv('各国数据/'+i + '.csv') data_confirmed = data_ult[['confirmedCount']].T data_cured = data_ult[['curedCount']].T data_dead = data_ult[['deadCount']].T confirmed = pd.concat([confirmed,data_confirmed]) cured = pd.concat([cured,data_cured]) dead = pd.concat([dead,data_dead])confirmed.index = nameListcured.index = nameListdead.index = nameListconfirmed.to_csv('各国数据/'+'World_confirmeCount'+'.csv')cured.to_csv('各国数据/'+'World_curedCount'+'.csv')dead.to_csv('各国数据/'+'World_deadCount'+'.csv')
复制代码

结果部分:

2.进行可视化处理

根据我们爬取下来的数据,利用 pandas、numpy、matplotlib 等库,将数据做一个可视化处理。

3.进行预测处理

因为这三个维度本质上都是一样的,所以说我们只需要对一个维度的数据进行处理,然后将其应用到其他的两个数据维度方面就可以,其中我们要注意一点,那就是我们得到的死亡数据中一部分出现了断层,所以说我们需要经过简单的插值处理,得到真实的需要处理的数据。最后通过我们的 LSTM,对三个维度的数据进行训练以及预测,画出疫情变化趋势图,得到每一项的决定系数 R2,再将这三项数据取平均值,得到我们最后的结果。

4.LSTM 模型代码:

# @Time : 2022/6/30 12:01# @Author : 徐以鹏# @File : 预测.pyimport numpy as npimport matplotlib.pyplot as pltimport paddleimport pandas as pd
path = "/home/aistudio/work/"Data = pd.read_csv(path + '中国.csv', index_col='dateId', parse_dates=['dateId']) # 读取文件Data.head()predict_name = 'confirmedCount' # 取文件中我们需要的数据training = Data[predict_name][:'20220619'].values # 训练的数据我们取到6月19号test = Data[predict_name]['20220620':].values # 测试的数据取到6月20号trainlist, testlist = [0], [0] # 将训练和测试的数据都存储在我们创建好的新列表中for i in range(1, len(training)): trainlist.append(training[i] - training[i - 1])for j in range(1, len(test)): testlist.append(test[j] - test[j - 1])# 用np.array()把我们的训练和测试的数据由列表转化为数组training = np.array(trainlist)test = np.array(testlist)# 取训练集中的最小值和最大值,分别为mintrain和maxtrainmintrain = training.min()maxtrain = training.max()train_set_range = maxtrain - mintraindef my_MinMaxScaler(data): return (data - mintrain) / (train_set_range)
def reverse_min_max_scaler(a_num): return a_num * train_set_range + mintrain
normalized_train_set = my_MinMaxScaler(training)normalized_test_set = my_MinMaxScaler(test)normalized_train_set = normalized_train_set.astype('float32')
# 定义MyDataset()类,定义出需要的transform函数class MyDataset(paddle.io.Dataset): def __init__(self, normalized_train_set): super(MyDataset, self).__init__() self.train_set_data_X = [] self.train_set_data_Y = [] self.transform(normalized_train_set)
def transform(self, data): for i in range(60, len(data)): self.train_set_data_X.append(np.array(data[i - 60:i].reshape(-1, 1))) self.train_set_data_Y.append(np.array(data[i]))
def __getitem__(self, index): data = self.train_set_data_X[index] label = self.train_set_data_Y[index] return data, label
def __len__(self): return len(self.train_set_data_X)
dataSet = MyDataset(normalized_train_set)trainLoader = paddle.io.DataLoader(dataSet, batch_size=200, shuffle=False)class StockNet(paddle.nn.Layer): def __init__(self): super(StockNet, self).__init__() self.lstm = paddle.nn.LSTM(input_size=1, hidden_size=50, num_layers=4, dropout=0.2, time_major=False) self.fc = paddle.nn.Linear(in_features=50, out_features=1)
def forward(self, inputs): outputs, final_states = self.lstm(inputs) y = self.fc(final_states[0][3]) return y# 由于在训练过程中会存在的梯度消失问题,所以我们采用LSTM模型来处理我们的数据,以下为模型:model = StockNet()optimstic = paddle.optimizer.RMSProp(parameters=model.parameters(), learning_rate=0.01)lossFN= paddle.nn.MSELoss()epochs = 1000for epoch in range(epochs): for batch_id, data in enumerate(trainLoader()): x_data = data[0] y_data = data[1] predicts = model(x_data) loss = lossFN(predicts, y_data.reshape((-1, 1))) loss.backward() optimstic.step() optimstic.clear_grad()tmpInput = np.hstack((normalized_train_set[-60:], normalized_test_set))tmpInput = tmpInput.astype('float32')testData = MyDataset(tmpInput)testLoader = paddle.io.DataLoader(testData, batch_size=len(testData), drop_last=False)model.train()testResult = Nonefor batch_id, data in enumerate(testLoader()): x_data = data[0] predicts = model(x_data) testResult = predicts.reshape((-1,))trainResult = Nonefor batch_id, data in enumerate(trainLoader()): trainData = data[0] trainPredicts = model(trainData) trainResult = trainPredicts.reshape((-1,))testPredicts = reverse_min_max_scaler(testResult.detach().numpy())trainPredicts = reverse_min_max_scaler(trainResult.detach().numpy())realtrain_predict, realtest_predict = [], []temptrain, temptest = Data[predict_name]['20220101'], Data[predict_name]['20220620']for i in trainPredicts: temptrain += i realtrain_predict.append(temptrain)for j in testPredicts: temptest += j realtest_predict.append(temptest)
# 画出预测图def plot_predictions(test, predicted): plt.plot(test, color='red', label='Realvalue') plt.plot(predicted, color='black', label='Predictedvalue') plt.title('confirmedCount_Prediction') plt.xlabel('Days') plt.ylabel('People') plt.legend() plt.show()plot_predictions(Data[predict_name]['20220620':].values, realtest_predict)
# 计算出预测结果的r2值from sklearn.metrics import r2_scoreconfirmedCount = r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11])print(r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11]))
复制代码

四、新冠疫情可视化

1.各国家确诊人数

plt.xlabel("confirmedCount")plt.barh(Country,last_confirmedCount2)
复制代码


2.各国家治愈人数

plt.xlabel("curedCount")plt.barh(Country,last_curedCount2)
复制代码


3.各国家死亡人数

plt.xlabel("deadCount")plt.barh(Country,last_deadCount2)
复制代码


4.全国各省总确诊人数分布饼状图

plt.figure(figsize=(10,10))plt.pie(last_confirmedCount1,radius=1.5,shadow=True,autopct='%1.1f%%')plt.legend(Province, loc='best')
复制代码


5.全国各省治愈人数

plt.figure(figsize=(10,10))plt.pie(last_curedCount1,radius=1.5,shadow=True,autopct='%1.1f%%')plt.legend(Province, loc='best')
复制代码


五、疫情数据预测

1.2022.6.20-2022.6.30 的全国确诊人数:

# 画出预测图def plot_predictions(test, predicted):    plt.plot(test, color='red', label='Realvalue')    plt.plot(predicted, color='black', label='Predictedvalue')    plt.title('confirmedCount_Prediction')    plt.xlabel('Days')    plt.ylabel('People')    plt.legend()    plt.show()plot_predictions(Data[predict_name]['20220620':].values, realtest_predict)
# 计算出预测结果的r2值from sklearn.metrics import r2_scoreconfirmedCount = r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11])print(r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11]))
复制代码



r2=0.9425994713615403

2.2022.6.20-2022.6.30 的全国死亡人数:

# 画出预测图def plot_predictions(test, predicted):    plt.plot(test, color='red', label='Realvalue')    plt.plot(predicted, color='black', label='Predictedvalue')    plt.title('deadCount_Count_Prediction')    plt.xlabel('Days')    plt.ylabel('People')    plt.legend()    plt.show()plot_predictions(Data[predict_name]['20220620':].values, realtest_predict)
# 计算出预测结果的r2值from sklearn.metrics import r2_scoredeadCount = r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11])print(r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11]))
复制代码



r2=0.9741672899742679

3.2022.6.20-2022.6.30 的全国治愈人数:

# 画出预测图def plot_predictions(test, predicted):    plt.plot(test, color='red', label='Realvalue')    plt.plot(predicted, color='black', label='Predictedvalue')    plt.title('curedCount_Count_Prediction')    plt.xlabel('Days')    plt.ylabel('People')    plt.legend()    plt.show()plot_predictions(Data[predict_name]['20220620':].values, realtest_predict)
# 计算出预测结果的r2值from sklearn.metrics import r2_scorecuredCount = r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11])print(r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11]))
复制代码



r2=0.9819537632078106

4.求出三者的平均值

# 计算出预测结果的r2值from sklearn.metrics import r2_scoreconfirmedCount = r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11])print(r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11]))print((confirmedCount+curedCount+deadCount)/3)
复制代码


r2=(confirmedCount+curedCount+deadCount)/3=0.96624

六、结果分析与总结

我们得到的最终的预测结果的 r2 值达到了 0.96624,说明我们的模型拟合程度非常不错,可以准确的预测以后的确诊人数、死亡人数和治愈人数。这种结果的达成,离不开我们优秀的 LSTM 模型,LSTM 与 RNNs 一样比 CNN 能更好地处理时间序列的任务;同时 LSTM 解决了 RNN 的长期依赖问题,并且缓解了 RNN 在训练时反向传播带来的“梯度消失”问题。LSTM 是 RNN 的一个优秀的变种模型,继承了大部分 RNN 模型的特性,同时解决了梯度反传过程由于逐步缩减而产生的 Vanishing Gradient 问题。但是 LSTM 本身的模型结构就相对复杂,训练比起 CNN 来说更加耗时,对于本问题而言,LSTM 模型预测准确,可以帮助我们很好的知道疫情趋势的变化。

七、代码分享

1.爬虫部分:

国内部分:# @Time : 2022/6/30 19:20# @Author : 徐以鹏# @File : 国内数据.pyimport requestsfrom lxml import etreeimport jsonimport pandas as pdimport numpy as np
# 找出请求头以及url,xpath解析headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia'req = requests.get(url, headers=headers)req.encoding = "utf-8"html = etree.HTML(req.text)req_data=html.xpath("//*[@id='getAreaStat']/text()")# print(req_data)
# 提取字符串req_str = req_data[0]+''data_str = req_str[req_str.find('[{'):req_str.find('}catch')]data_json = json.loads(data_str) # 转化为json
# 存储每个省份的Json,provinceShortName存储省份简称countryJson=[] # 以列表形式存储国家和国家对应的Jsonfor i in data_json: countryJson.append([i['provinceShortName'],i['statisticsData']])# 此时的countryJson存储的便是要求的十一个国家以及其所对应的json
# 下载jsonfor i in countryJson: countryName = i[0] jsonAddress = i[1] print(countryName,jsonAddress) try: r = requests.get(jsonAddress,headers=headers) r.raise_for_status() r.encoding = "utf-8" # 防止乱码 CountryDataJson = json.loads(r.text) toWriteFilePath = countryName + '.json' with open(toWriteFilePath, 'w',encoding='UTF-8') as file: json.dump(CountryDataJson, file, ensure_ascii=False) print(countryName + "已经下载完毕") except: print(countryName+" 数据下载失败!")
# 爬取山东的疫情数据file = '山东.json'with open(file,'r') as f: data = json.load(f) data_list = data['data'] confirmed = pd.DataFrame([]) data_d = pd.DataFrame(data_list) data_d.set_index('dateId',inplace=True) data_u = data_d[data_d.index >= 20220101] data_u = data_u[data_u.index <= 20220619] data_u = data_u.T data_u.drop(data_u.index, inplace=True) confirmed = data_u.drop(index=data_u.index) cured = data_u.drop(index=data_u.index) dead = data_u.drop(index=data_u.index)
# 整合各省疫情数据nameList=[]for i in countryJson: nameList.append(i[0])for i in nameList: file = i + '.json' with open(file,'r') as f: data = json.load(f) data_list = data['data'] data_df = pd.DataFrame(data_list) data_df.set_index('dateId',inplace = True) data_df = data_df[['confirmedCount', 'curedCount', 'deadCount']] data_ult = data_df[data_df.index >= 20220101] data_ult = data_ult[data_ult.index <= 20220619] data_ult = data_ult.replace(0, np.nan) data_ult.bfill(inplace = True) data_ult.to_csv('各省数据/'+i + '.csv') data_confirmed = data_ult[['confirmedCount']].T data_cured = data_ult[['curedCount']].T data_dead = data_ult[['deadCount']].T confirmed = pd.concat([confirmed,data_confirmed]) cured = pd.concat([cured,data_cured]) dead = pd.concat([dead,data_dead])confirmed.index = nameListcured.index = nameListdead.index = nameListconfirmed.to_csv('各省数据/'+'China_confirmeCount'+'.csv')cured.to_csv('各省数据/'+'China_curedCount'+'.csv')dead.to_csv('各省数据/'+'China_deadCount'+'.csv')国外部分:# @Time : 2022/6/30 9:59# @Author : 徐以鹏# @File : 国外数据.pyimport requestsfrom lxml import etreeimport jsonimport pandas as pdimport numpy as np
# 找出请求头以及url,xpath解析headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia'req = requests.get(url, headers=headers)req.encoding = "utf-8"html = etree.HTML(req.text)req_data=html.xpath("//*[@id='getListByCountryTypeService2true']/text()")
# 提取字符串req_str = req_data[0]+''data_str = req_str[req_str.find('[{'):req_str.find('}catch')]data_json = json.loads(data_str) # 转化为json
# 存储每个国家的jsoncountryJson=[] # 以列表形式存储国家和国家对应的Jsonfor i in data_json: #中国、美国、巴西、印度、俄罗斯、法国、英国、土耳其、阿根廷、哥伦比亚 if i["provinceName"] in ['中国','美国','巴西','印度','俄罗斯','英国','法国','土耳其','阿根廷','哥伦比亚','日本']: countryJson.append([i['provinceName'],i['statisticsData']])# 此时的countryJson存储的便是要求的十一个国家以及其所对应的json
# 下载jsonfor i in countryJson: countryName = i[0] jsonAddress = i[1] print(countryName,jsonAddress) try: r=requests.get(jsonAddress,headers=headers) r.raise_for_status() r.encoding = "utf-8" # 防止乱码 CountryDataJson = json.loads(r.text) toWriteFilePath = ''+countryName + '.json' with open(toWriteFilePath, 'w',encoding='UTF-8') as file: json.dump(CountryDataJson, file, ensure_ascii=False) print(countryName + "已经下载完毕") except: print(countryName+" 数据下载失败!")
# 爬取中国的疫情数据file = '中国.json'with open(file,'r') as f: data = json.load(f) data_list = data['data'] confirmed = pd.DataFrame([]) data_d = pd.DataFrame(data_list) data_d.set_index('dateId',inplace=True) data_u = data_d[data_d.index >= 20220101] data_u = data_u[data_u.index <= 20220619] data_u = data_u.T data_u.drop(data_u.index, inplace=True) confirmed = data_u.drop(index=data_u.index) cured = data_u.drop(index=data_u.index) dead = data_u.drop(index=data_u.index)
# 整合世界疫情数据nameList = ['中国','美国','巴西','印度','俄罗斯','英国','法国','土耳其','阿根廷','哥伦比亚','日本']for i in nameList: file = i + '.json' with open(file,'r') as f: data = json.load(f) data_list = data['data'] data_df = pd.DataFrame(data_list) data_df.set_index('dateId',inplace = True) data_df = data_df[['confirmedCount', 'curedCount', 'deadCount']] data_ult = data_df[data_df.index >= 20220101] data_ult = data_ult[data_ult.index <= 20220619] data_ult = data_ult.replace(0, np.nan) data_ult.bfill(inplace = True) data_ult.to_csv('各国数据/'+i + '.csv') data_confirmed = data_ult[['confirmedCount']].T data_cured = data_ult[['curedCount']].T data_dead = data_ult[['deadCount']].T confirmed = pd.concat([confirmed,data_confirmed]) cured = pd.concat([cured,data_cured]) dead = pd.concat([dead,data_dead])confirmed.index = nameListcured.index = nameListdead.index = nameListconfirmed.to_csv('各国数据/'+'World_confirmeCount'+'.csv')cured.to_csv('各国数据/'+'World_curedCount'+'.csv')dead.to_csv('各国数据/'+'World_deadCount'+'.csv')
复制代码

2.可视化部分

import requestsimport jsonimport pandas as pdimport numpy as npimport matplotlib.pyplot  as pltimport matplotlib.colors as mcolorsfrom sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, mean_absolute_errorimport datetimefrom sklearn.metrics import mean_squared_error , r2_scorefrom sklearn import datasets , linear_model%matplotlib inline Country =['China','USA','Brazil','India','Russia','Britain','France','Turkey','Argentina','Columbia','Japan']Province =['Hubei','Zhejiang','Guangdong','Henan','Hunan','Anhui','Chongqing','Jiangxi','Shangdong','Sichuan','Jiangsu','Beijing','Shanghai','Fujian','Guangxi','Yunnan','Shanxi3','Hebei','Henan','Heilongjiang','Liaoning','Shanxi1','Tiajin','Gansu','InnerMongolia','Ningxia','Xinjiang','Jilin','Guizhou','Hongkong','Taiwan','Qinghai','Macao','Xizang']#Province = ["Shanghai","Yunnan","Neimeng","Beijing","Taiwan","Jilin","Sichuan","Tianjin","Ningxia","Anhui","Shandong","Shanxi","Guangdong","Guangxi","Xinjiang","Jiangsu","Jiangxi","Hebei","Henan","Zhejiang","Hainan","Hubei","Hunan","Macao","Gansu","Fujian","Xizang","Guizhou","Liaoning","Chongqing","Shaanxi","Qinghai","Hong Kong","Heilongjiang"]path = "/home/aistudio/work/"FileName = ["confirmedCount","curedCount","deadCount","world_confirmedCount","world_curedCount","world_deadCount"]Data = []for i in FileName:    data = pd.read_csv(path+i+".csv").loc[:,"20220101":"20220619"]    #while data.isnull().values.any():    #    data = data.fillna(method='ffill',axis=1)    Data.append(np.array(data))last_confirmedCount1 = [data[-1] for data in Data[0]]last_curedCount1 = [data[-1] for data in Data[1]]last_deadCount1 = [data[-1] for data in Data[2]]last_confirmedCount2 = [data[-1] for data in Data[3]]last_curedCount2 = [data[-1] for data in Data[4]]last_deadCount2 = [data[-1] for data in Data[5]]
plt.xlabel("confirmedCount")plt.barh(Country,last_confirmedCount2)plt.xlabel("curedCount")plt.barh(Country,last_curedCount2)plt.xlabel("deadCount")plt.barh(Country,last_deadCount2)
plt.figure(figsize=(10,10))plt.pie(last_confirmedCount1,radius=1.5,shadow=True,autopct='%1.1f%%')plt.legend(Province, loc='best')plt.figure(figsize=(10,10))plt.pie(last_curedCount1,radius=1.5,shadow=True,autopct='%1.1f%%')plt.legend(Province, loc='best')
复制代码

3.预测部分

# @Time : 2022/7/3 12:01# @Author : 是Dream呀# @File : 预测.pyimport numpy as npimport matplotlib.pyplot as pltimport paddleimport pandas as pd
path = "/home/aistudio/work/"Data = pd.read_csv(path + '中国.csv', index_col='dateId', parse_dates=['dateId']) # 读取文件Data.head()predict_name = 'confirmedCount' # 取文件中我们需要的数据training = Data[predict_name][:'20220619'].values # 训练的数据我们取到6月19号test = Data[predict_name]['20220620':].values # 测试的数据取到6月20号trainlist, testlist = [0], [0] # 将训练和测试的数据都存储在我们创建好的新列表中for i in range(1, len(training)): trainlist.append(training[i] - training[i - 1])for j in range(1, len(test)): testlist.append(test[j] - test[j - 1])# 用np.array()把我们的训练和测试的数据由列表转化为数组training = np.array(trainlist)test = np.array(testlist)# 取训练集中的最小值和最大值,分别为mintrain和maxtrainmintrain = training.min()maxtrain = training.max()train_set_range = maxtrain - mintraindef my_MinMaxScaler(data): return (data - mintrain) / (train_set_range)
def reverse_min_max_scaler(a_num): return a_num * train_set_range + mintrain
normalized_train_set = my_MinMaxScaler(training)normalized_test_set = my_MinMaxScaler(test)normalized_train_set = normalized_train_set.astype('float32')
# 定义MyDataset()类,定义出需要的transform函数class MyDataset(paddle.io.Dataset): def __init__(self, normalized_train_set): super(MyDataset, self).__init__() self.train_set_data_X = [] self.train_set_data_Y = [] self.transform(normalized_train_set)
def __len__(self): return len(self.train_set_data_X) def transform(self, data): for i in range(60, len(data)): self.train_set_data_X.append(np.array(data[i - 60:i].reshape(-1, 1))) self.train_set_data_Y.append(np.array(data[i])) def __getitem__(self, index): data = self.train_set_data_X[index] label = self.train_set_data_Y[index] return data, labeldataSet = MyDataset(normalized_train_set)trainLoader = paddle.io.DataLoader(dataSet, batch_size=200, shuffle=False)class StockNet(paddle.nn.Layer): def __init__(self): super(StockNet, self).__init__() self.lstm = paddle.nn.LSTM(input_size=1,hidden_size=50,num_layers=4,dropout=0.2,time_major=False) self.fc = paddle.nn.Linear(in_features=50, out_features=1)
def forward(self, inputs): outputs, final_states = self.lstm(inputs) y = self.fc(final_states[0][3]) return y# 由于在训练过程中会存在的梯度消失问题,所以我们采用LSTM模型来处理我们的数据,以下为模型:model = StockNet()optimstic = paddle.optimizer.RMSProp(parameters=model.parameters(), learning_rate=0.01)lossFN= paddle.nn.MSELoss()epochs = 1000for epoch in range(epochs): for batch_id, data in enumerate(trainLoader()): x_data = data[0] y_data = data[1] predicts = model(x_data) loss = lossFN(predicts, y_data.reshape((-1, 1))) loss.backward() optimstic.step() optimstic.clear_grad()tmpInput = np.hstack((normalized_train_set[-60:], normalized_test_set))tmpInput = tmpInput.astype('float32')testData = MyDataset(tmpInput)testLoader = paddle.io.DataLoader(testData, batch_size=len(testData), drop_last=False)model.train()testResult = Nonefor batch_id, data in enumerate(testLoader()): x_data = data[0] predicts = model(x_data) testResult = predicts.reshape((-1,))trainResult = Nonefor batch_id, data in enumerate(trainLoader()): trainData = data[0] trainPredicts = model(trainData) trainResult = trainPredicts.reshape((-1,))testPredicts = reverse_min_max_scaler(testResult.detach().numpy())trainPredicts = reverse_min_max_scaler(trainResult.detach().numpy())realtrain_predict, realtest_predict = [], []temptrain, temptest = Data[predict_name]['20220101'], Data[predict_name]['20220620']for i in trainPredicts: temptrain += i realtrain_predict.append(temptrain)for j in testPredicts: temptest += j realtest_predict.append(temptest)
# 画出预测图def plot_predictions(test, predicted): plt.plot(test, color='red', label='Real') plt.plot(predicted, color='red', label='Predicted') plt.title('confirmedCount_Prediction') plt.xlabel('Days') plt.ylabel('People') plt.legend() plt.show()plot_predictions(Data[predict_name]['20220620':].values, realtest_predict)
# 计算出预测结果的r2值from sklearn.metrics import r2_scoreconfirmedCount = r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11])print(r2_score(Data[predict_name]['20220620':].values, realtest_predict[:11]))print((confirmedCount+curedCount+deadCount)/3)
复制代码


最后,如果有任何问题或者疑问的话欢迎私信我嗷~


💕💕💕 好啦,这就是今天要分享给大家的全部内容了,我们下期再见!✨ ✨ ✨🍻🍻🍻如果你喜欢的话,就不要吝惜你的一键三连了~




发布于: 刚刚阅读数: 4
用户头像

是Dream呀

关注

Python领域优质创作者 2021.03.30 加入

2021年度博客之星TOP100,2021年度领域TOP5 Python领域优质创作者,交流、合作、学习,欢迎私信我VX+++ 一万次悲伤,依然会有Dream,我一直在最温暖的地方等你!

评论

发布
暂无评论
机器学习如何做到疫情可视化——疫情数据分析与预测实战_人工智能_是Dream呀_InfoQ写作社区