TensorFlow2 Fashion-MNIST 图像分类（一）

关注
发布于: 2020 年 12 月 10 日
1.数据集介绍FashionMNIST 是一个替代 MNIST 手写数字集的图像数据集。 它是由 Zalando（一家德国的时尚科技公司）旗下的研究部门提供。其涵盖了来自 10 种类别的共 7 万个不同商品的正面图片。
FashionMNIST 的大小、格式和训练集/测试集划分与原始的 MNIST 完全一致。60000/10000 的训练测试数据划分，28x28的灰度图片。你可以直接用它来测试你的机器学习和深度学习算法性能，且不需要改动任何的代码。
官方的关于数据集的介绍可以参考：
https://github.com/zalandoresearch/fashion-mnist
﻿
2.模型训练Tensorflow版本：2.2.0
2.1 数据加载数据的加载使用可以直接调用tensorflow包进行联网加载，也可以将数据下载到本地，进行本地数据读取。
﻿
代码如下：
﻿
#首先导入需要的包，并查看版本信息
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn
import pandas as pd
import os
import sys
import time
import gzip
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(sys.version_info)
for module in mpl, np, pd, sklearn, tf, keras:
    print(module.__name__, module.__version__)
    
# 联网数据加载使用
fashion_mnist = keras.datasets.fashion_mnist
(x_train_all, y_train_all), (x_test, y_test) = fashion_mnist.load_data()
本地数据下载之后，包含四个文件：
﻿
然后，将原来的加载函数load_data稍作修改，即可加载本地文件，只需要将远程下载地址，更改为你的本地文件所在目录即可，此处我的文件目录为data。
﻿
def load_data():
  """Loads the Fashion-MNIST dataset.
  This is a dataset of 60,000 28x28 grayscale images of 10 fashion categories,
  along with a test set of 10,000 images. This dataset can be used as
  a drop-in replacement for MNIST. The class labels are:
  | Label | Description |
  |:-----:|-------------|
  |   0   | T-shirt/top |
  |   1   | Trouser     |
  |   2   | Pullover    |
  |   3   | Dress       |
  |   4   | Coat        |
  |   5   | Sandal      |
  |   6   | Shirt       |
  |   7   | Sneaker     |
  |   8   | Bag         |
  |   9   | Ankle boot  |
  Returns:
      Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
      **x_train, x_test**: uint8 arrays of grayscale image data with shape
        (num_samples, 28, 28).
      **y_train, y_test**: uint8 arrays of labels (integers in range 0-9)
        with shape (num_samples,).
  License:
      The copyright for Fashion-MNIST is held by Zalando SE.
      Fashion-MNIST is licensed under the [MIT license](
      https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE).
  """
  dirname = os.path.join('datasets', 'fashion-mnist')
  # 数据下载到本地，提供一个本地的文件夹地址
  base = 'data/'
  # base = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
  files = [
      'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
      't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
  ]
  paths = [base + f_name for f_name in files]
  # for fname in files:
  #   paths.append(get_file(fname, origin=base + fname, cache_subdir=dirname))
  with gzip.open(paths[0], 'rb') as lbpath:
    y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)
  with gzip.open(paths[1], 'rb') as imgpath:
    x_train = np.frombuffer(
        imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
  with gzip.open(paths[2], 'rb') as lbpath:
    y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)
  with gzip.open(paths[3], 'rb') as imgpath:
    x_test = np.frombuffer(
        imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
  return (x_train, y_train), (x_test, y_test)
# 本地数据读取
(x_train_all, y_train_all), (x_test, y_test) = load_data()
2.2 数据划分提供的数据包含60000/10000 的训练/测试数据。我们再将训练集进行划分，train和valid,原有的测试数据用于模型最终的测试，训练集前5000数据作为valid,剩余的数据作为train集合。
﻿
# 原有训练数据划分为train和valid
x_valid, x_train = x_train_all[:5000], x_train_all[5000:]
y_valid, y_train = y_train_all[:5000], y_train_all[5000:]
print(x_valid.shape, y_valid.shape)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
数据集合划分情况如下：
﻿
(5000, 28, 28) (5000,)
﻿
(55000, 28, 28) (55000,)
﻿
(10000, 28, 28) (10000,)
﻿
对其中的一个样本进行可视化查看
﻿
def show_single_image(img_dir):
    plt.imshow(img_dir, cmap="binary")
    plt.show()
show_single_image(x_train[0])
绘图结果如下：
﻿
﻿
我们可以多查看一些数据样本，编写函数进行自定义查看。
﻿
def show_imgs(n_rows, n_cols, x_data, y_data, class_names):
    """
    指定行列个数，显示数据集中的n_rows*n_cols个样本图像
    """
    assert len(x_data) == len(y_data)
    assert n_rows * n_cols < len(x_data)
    plt.figure(figsize = (n_cols * 1.4, n_rows * 1.6))
    for row in range(n_rows):
        for col in range(n_cols):
            index = n_cols * row + col
            plt.subplot(n_rows, n_cols, index + 1)
            plt.imshow(x_data[index], cmap="binary", interpolation = "nearest")
            plt.axis('off')
            plt.title(class_names[y_data[index]])
    plt.show()
﻿
# 指定类别
class_names = ['T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
# 查看15个样本图片
show_imgs(3, 5, x_train, y_train, class_names)
次数我们查看15个样本图片，结果如下：
﻿
﻿
2.3 模型构建我们使用tensorflow2中集成的kera包进行Sequential模型构建：keras.models.Sequential()
Sequential提供了非常简便的方式进行神经网络模型的构建，就是将要添加的模型一层一层的逐层进行add即可，也可以通过列表的方式，一起进行添加。
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))
# 模型组合成列表进行统一添加
"""
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation='relu'),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])
"""
# 关于激活函数的解释
# relu: y = max(0, x)
# softmax: 将向量变成概率分布. x = [x1, x2, x3], 
#          y = [e^x1/sum, e^x2/sum, e^x3/sum], sum = e^x1 + e^x2 + e^x3
# reason for sparse: y->index. y->one_hot->[] 
# 由于此处的数据类型y是一个数，所以此处使用sparse_categorical_crossentropy，如果y是one_hot向量，那么此处使用categorical_crossentropy
#模型编译
model.compile(loss="sparse_categorical_crossentropy",
              optimizer = "sgd",
              metrics = ["accuracy"])
经过上面的步骤，模型已经构建完成，编译生成模型结构，接下来可以查看模型的相关信息。
﻿
# 查看模型的各层
model.layers
#输出结果
[<tensorflow.python.keras.layers.core.Flatten at 0x7f4e6d11b438>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f4e6d11beb8>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f4e6d11b588>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f4e6d90bfd0>]
﻿
# 模型的结构信息，包括参数情况
model.summary()
# 输出结果
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 300)               235500    
_________________________________________________________________
dense_7 (Dense)              (None, 100)               30100     
_________________________________________________________________
dense_8 (Dense)              (None, 10)                1010      
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
关于各层模型的向量的维度计算，解释如下：
﻿
[None, 784] * W + b -> [None, 300] W.shape [784, 300], b = [300]
输入的数据就是[None,2828],然后经过Flatten之后，就是[None, 784]，接着就是添加全连接层，维度是300，那么这个时候XW + b = y，可以算出W的维度，也就是权重的维度应该是[784, 300],偏置的维度也就是[300]。对于后边全连接层的各参数的维度计算同样的道理可以得出。
﻿
接下来就是模型训练，使用的是fit函数。
这里指定了epochs的个数，就是一共要训练多少次，训练一次就是将xtrain数据跑一遍模型。而每一次往模型中输入多少个样本，就是batchsize,此处指定了64，batch_size越大，占用的内存越大，所以要参照自己的机器配置适当设置。
history = model.fit(x_train, y_train, epochs=10,validation_split=False,batch_size=64,validation_data=(x_valid, y_valid))
模型训练过程如下：
﻿
Epoch 1/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.0986 - val_loss: 2.3028 - val_accuracy: 0.0914
Epoch 2/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.0991 - val_loss: 2.3029 - val_accuracy: 0.0914
Epoch 3/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.0985 - val_loss: 2.3028 - val_accuracy: 0.0914
Epoch 4/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.0988 - val_loss: 2.3029 - val_accuracy: 0.0914
Epoch 5/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.1001 - val_loss: 2.3028 - val_accuracy: 0.0914
Epoch 6/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.0986 - val_loss: 2.3028 - val_accuracy: 0.0914
Epoch 7/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.1008 - val_loss: 2.3028 - val_accuracy: 0.0914
Epoch 8/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.0984 - val_loss: 2.3028 - val_accuracy: 0.0914
Epoch 9/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.0973 - val_loss: 2.3028 - val_accuracy: 0.0914
Epoch 10/10
860/860 [==============================] - 3s 3ms/step - loss: 2.3026 - accuracy: 0.1003 - val_loss: 2.3028 - val_accuracy: 0.0914
可以看出由于设置的batch_size为64，所以训练一次要进行860个批次的样本输入。
﻿
针对history，可以查看他的相关类型和相关信息：
﻿
# 查看类型
type(history)
# 模型结果相关信息
history.history
# 输出结果如下：
{'accuracy': [0.09905454516410828,
  0.10116363316774368,
  0.09749090671539307,
  0.09958181530237198,
  0.09843636304140091,
  0.09821818023920059,
  0.09734545648097992,
  0.1000545471906662,
  0.09998181462287903,
  0.0989818200469017],
 'loss': [2.302921772003174,
  2.302874803543091,
  2.302971839904785,
  2.3028604984283447,
  2.3029799461364746,
  2.3029377460479736,
  2.3030099868774414,
  2.302891731262207,
  2.3029096126556396,
  2.3028926849365234],
 'val_accuracy': [0.09860000014305115,
  0.09759999811649323,
  0.10019999742507935,
  0.09139999747276306,
  0.09799999743700027,
  0.1111999973654747,
  0.09799999743700027,
  0.10019999742507935,
  0.10019999742507935,
  0.10080000013113022],
 'val_loss': [2.302537679672241,
  2.3035120964050293,
  2.303284168243408,
  2.303611993789673,
  2.3037195205688477,
  2.30253267288208,
  2.3035061359405518,
  2.30228328704834,
  2.302734851837158,
  2.3030896186828613]}
最后对模型返回结果中的相关参数画图显示变化情况。
﻿
def plot_learning_curves(history):
    pd.DataFrame(history.history).plot(figsize=(8, 5))
    plt.grid(True)
    plt.gca().set_ylim(0, 1)
    plt.show()
plot_learning_curves(history)
结果如下图：
﻿
﻿
从最终的结果可以看出，准确率很低，而且从整个训练过程也可以看出，每次训练完之后，基本上没有太大的变化。
我们对数据集中的数据的大小进行统计查看：
﻿
print(np.max(x_train), np.min(x_train))
# 输出结果：
255 0
图像灰度值从0到255，数据存在较大的分布跨度。输入数据的维度是28*28，其展开之后就是784。在机器学习领域中，不同评价指标（即特征向量中的不同特征就是所述的不同评价指标）往往具有不同的量纲和量纲单位，这样的情况会影响到数据分析的结果，为了消除指标之间的量纲影响，需要进行数据标准化处理，以解决数据指标之间的可比性。原始数据经过数据标准化处理后，各指标处于同一数量级，适合进行综合对比评价。其中，最典型的就是数据的归一化处理。简而言之，归一化的目的就是使得预处理的数据被限定在一定的范围内（比如[0,1]或者[-1,1]），从而消除奇异样本数据导致的不良影响。
﻿
归一化的目的，有三个方面可以参考：
避免具有不同物理意义和量纲的输入变量不能平等使用;
bp中常采用sigmoid函数作为转移函数，归一化能够防止净输入绝对值过大引起的神经元输出饱和现象;
保证输出数据中数值小的不被吞食.
所以，从上面的训练过程和结果可以看出，我们的模型并没有很好地拟合，所以，需要解决输入数据的归一化问题，下个部分进行归一化的模型构建。
﻿