使用 Amazon Redshift ML 构建多分类模型

2021 年 12 月 24 日
本文字数：5763 字
阅读完需：约 19 分钟

Amazon Redshift ML 通过使用简单的 SQL 语句使用 Amazon Redshift 中的数据创建和训练机器学习（ML）模型，简化了机器学习 (ML) 的操作。您可以使用 Amazon Redshift ML 来解决二进制分类、多分类和回归问题，并可以直接使用 AutoML 或 XGBoost 等技术。

Amazon Redshift ML

https://aws.amazon.com/redshift/features/redshift-ml/

Amazon Redshift

http://aws.amazon.com/redshift

📢 想要了解更多亚马逊云科技最新技术发布和实践创新，敬请关注 2021 亚马逊云科技中国峰会！点击图片报名吧～

这篇文章是 Amazon Redshift ML 系列的一部分。有关使用 Amazon Redshift ML 构建回归的更多信息，请参阅使用 Amazon Redshift ML 构建回归模型。

使用 Amazon Redshift ML 构建回归模型

https://aws.amazon.com/blogs/machine-learning/build-regression-models-with-amazon-redshift-ml/

您可以使用 Amazon Redshift ML 自动执行数据准备、预处理和问题类型的选择，如这篇博客文章中所述。我们假设您非常了解自己的数据以及最适用于您使用案例的问题类型。本文将专注于使用多分类问题类型在 Amazon Redshift 中创建模型，该类型包括至少三个类别。例如，您可以预测交易是欺诈性的、失败的还是成功的，客户是否会将活跃状态保持 3 个月、6 个月、9 个月、12 个月，还是要将新闻标记为体育、世界新闻或是商业内容。

博客文章

https://aws.amazon.com/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/

先决条件

作为实施此解决方案的先决条件，您需要设置启用机器学习（ML）功能的 Amazon Redshift 集群。有关这些准备步骤，请参阅使用 SQL 和 Amazon Redshift ML 在 Amazon Redshift 中创建、训练和部署机器学习模型：

https://aws.amazon.com/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/

使用案例

在我们的使用案例中，我们希望为一个特殊客户忠诚度计划找出最活跃的客户。我们使用 Amazon Redshift ML 和多分类模型来预测客户在 13 个月内将有多少个月内处于活动状态。这将转化为多达 13 个可能的分类，因此更适合采取多分类。预计活动状态将保持 7 个月或更长时间的客户将成为特殊客户忠诚度计划的目标群体。

输入原始数据

为了准备该模型的原始数据，我们使用公用数据集电子商务销售预测（其中包括英国在线零售商的销售数据）填充 Amazon Redshift 中的 ecommerce_sales 表。

电子商务销售预测

https://www.kaggle.com/allunia/e-commerce-sales-forecast

输入以下语句以将数据加载到 Amazon Redshift：

CREATE TABLE IF NOT EXISTS ecommerce_sales(    invoiceno VARCHAR(30)       ,stockcode VARCHAR(30)       ,description VARCHAR(60)        ,quantity DOUBLE PRECISION       ,invoicedate VARCHAR(30)        ,unitprice    DOUBLE PRECISION    ,customerid BIGINT        ,country VARCHAR(25)    );
Copy ecommerce_salesFrom 's3://redshift-ml-multiclass/ecommerce_data.txt'iam_role '<<your-amazon-redshift-sagemaker-iam-role-arn>>' delimiter '\t' IGNOREHEADER 1 region 'us-east-1' maxerror 100;

复制代码

要在您的环境中重现此脚本，请将

your-amazon-redshift-sagemaker-iam-role-arn

复制代码

替换为适用于您的 Amazon Redshift 集群的 Amazon Identity and Access Management (Amazon IAM) ARN。

Amazon Identity and Access Management

http://aws.amazon.com/iam

机器学习（ML）模型的数据准备

现在我们的数据集已加载完毕，我们可以选择将数据拆分为三组分别进行训练 (80％)、验证 (10％) 和预测 (10％)。请注意，Amazon Redshift ML Autopilot 会自动将数据拆分为训练和验证，但是如果在此处就进行拆分，您将能够很好地验证模型的准确性。此外，我们将计算客户保持活跃的月数，因为我们希望模型能够根据新数据预测该值。我们在 SQL 语句中使用随机函数来拆分数据。请参阅以下代码：

create table ecommerce_sales_data as (  select    t1.stockcode,    t1.description,    t1.invoicedate,    t1.customerid,    t1.country,    t1.sales_amt,    cast(random() * 100 as int) as data_group_id  from    (      select        stockcode,        description,        invoicedate,        customerid,        country,        sum(quantity * unitprice) as sales_amt      from        ecommerce_sales      group by        1,        2,        3,        4,        5    ) t1);

复制代码

训练集

create table ecommerce_sales_training as (  select    a.customerid,    a.country,    a.stockcode,    a.description,    a.invoicedate,    a.sales_amt,    (b.nbr_months_active) as nbr_months_active  from    ecommerce_sales_data a    inner join (      select        customerid,        count(          distinct(            DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(              DATE_PART(mon, cast(invoicedate as date)),              2,              '00'            )          )        ) as nbr_months_active      from        ecommerce_sales_data      group by        1    ) b on a.customerid = b.customerid  where    a.data_group_id < 80);

复制代码

验证集

create table ecommerce_sales_validation as (  select    a.customerid,    a.country,    a.stockcode,    a.description,    a.invoicedate,    a.sales_amt,    (b.nbr_months_active) as nbr_months_active  from    ecommerce_sales_data a    inner join (      select        customerid,        count(          distinct(            DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(              DATE_PART(mon, cast(invoicedate as date)),              2,              '00'            )          )        ) as nbr_months_active      from        ecommerce_sales_data      group by        1    ) b on a.customerid = b.customerid  where    a.data_group_id between 80    and 90);

复制代码

预测集

create table ecommerce_sales_prediction as (  select    customerid,    country,    stockcode,    description,    invoicedate,    sales_amt  from    ecommerce_sales_data  where    data_group_id > 90);

复制代码

在 Amazon Redshift 中创建模型

现在我们创建了训练和验证数据集，我们可以使用 Amazon Redshift 中的 create model 语句使用 Multiclass_Classification 创建我们的机器学习模型。我们指定问题类型，然后让 AutoML 处理其他的一切事务。在这个模型中，我们想要预测的目标是 nbr_months_active。Amazon SageMaker 创建了一个函数 predict_customer_activity，我们将用它在 Amazon Redshift 中进行推断。请参阅以下代码：

create model ecommerce_customer_activityfrom  (select     customerid,  country,  stockcode,  description,  invoicedate,  sales_amt,  nbr_months_active   from ecommerce_sales_training) TARGET nbr_months_active FUNCTION predict_customer_activity IAM_ROLE '<<your-amazon-redshift-sagemaker-iam-role-arn>>' problem_type MULTICLASS_CLASSIFICATION    SETTINGS (    S3_BUCKET '<<your-amazon-s3-bucket-name>>’,    S3_GARBAGE_COLLECT OFF  );

复制代码

要在环境中重现此脚本，请将

your-amazon-redshift-sagemaker-iam-role-arn

复制代码

替换为集群的 Amazon IAM 角色 ARN。

create model

https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html

Amazon SageMaker

https://aws.amazon.com/sagemaker/

验证预测

在此步骤中，我们将对照验证数据评估机器学习（ML）模型的准确性。

在创建模型时，Amazon SageMaker Autopilot 会自动将输入数据拆分为训练和验证集，并选择具有最佳客观指标的模型，该指标部署在 Amazon Redshift 集群中。您可以使用集群中的 show model 语句查看各种指标，包括准确性分数。如果没有明确指定，Amazon SageMaker 会自动使用目标类型的准确性。请参阅以下代码：

Show model ecommerce_customer_activity;

复制代码

Amazon SageMaker Autopilot

https://aws.amazon.com/sagemaker/autopilot/

如以下输出所示，我们的模型的准确率为 0.996580。

让我们对验证数据使用以下 SQL 代码以对验证数据运行推理查询：

select  cast(sum(t1.match)as decimal(7,2)) as predicted_matches,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches,cast(sum(t1.match + t1.nonmatch) as decimal(7,2))  as total_predictions,predicted_matches / total_predictions as pct_accuracyfrom (select     customerid,  country,  stockcode,  description,  invoicedate,  sales_amt,  nbr_months_active,  predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active,  case when nbr_months_active = predicted_months_active then 1      else 0 end as match,  case when nbr_months_active <> predicted_months_active then 1    else 0 end as nonmatch  from ecommerce_sales_validation  )t1;

复制代码

可以看到，在我们的数据集上预测的准确率位 99.74％，这与 show model 中的准确率相符。

现在让我们运行一个查询，以至少活跃 7 个月为标准来查看哪些客户有资格参加我们的客户忠诚度计划：

select   customerid,   predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active  from ecommerce_sales_prediction where predicted_months_active >=7 group by 1,2 limit 10;

复制代码

下表显示了我们的输出结果。

问题排查

尽管 Amazon Redshift 中的 Create Model 语句自动负责启动 Amazon SageMaker Autopilot 流程以构建、训练和调整最佳机器学习模型并在 Amazon Redshift 中部署该模型，但您可以查看在此过程中执行的中间步骤，如果出现问题，这还可以帮助您进行故障排除。您还可以从 show model 命令的输出中检索 AutoML Job Name。

创建模型时，您需要设置一个 Amazon Simple Storage Service (Amazon S3) 存储桶名称作为参数 s3_bucket 的值。您可以使用此存储桶在 Amazon Redshift 和 Amazon SageMaker 之间共享训练数据和构件。Amazon Redshift 会在此存储桶中创建一个子文件夹保存训练数据。训练完成后，除非将参数 s3_garbage_collect 设置为 off（可用于故障排除），否则它会删除子文件夹及其内容。有关更多信息，请参阅 CREATE MODEL。