『Rによる機械学習』第2版(以降、テキスト)の第6.4節ではRWeka::M5P
関数を用いたM5’(M5 Prime)アルゴリズムを用いたモデル木について解説しています。しかし、サンプルコードを実行するとRWeka::M5P
関数がテキストやWekaで実行した結果と異なってしまいます。
そこで、RWeka::M5P
関数に代わるCubistアルゴリズムを用いたモデリング方法を紹介します。
注)本資料は 2019年度 第6回 データ分析勉強会 の補足資料を焼き直したものです。
Packages and Datasets
本ページではR version 3.6.1 (2019-07-05)の標準パッケージ以外に以下の追加パッケージを用いています。
Package | Version | Description |
---|---|---|
Cubist | 0.2.2 | Rule- And Instance-Based Regression Modeling |
ggplot2 | 3.2.1 | Create Elegant Data Visualisations Using the Grammar of Graphics |
tidyverse | 1.2.1 | Easily Install and Load the ‘Tidyverse’ |
また、本ページでは以下のデータセットを用いています。
Dataset | Package | Version | Description |
---|---|---|---|
wine | NA | NA | dataspelunking/MLwR, GitHub |
データ概要
利用するデータはテキストで紹介されているポルトガルの白ワインの品質に関するデータです。
wine
## # A tibble: 4,898 x 12
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.7 0.62 0.24 1.1 0.039
## 2 5.7 0.22 0.2 16 0.044
## 3 5.9 0.19 0.26 7.4 0.034
## 4 5.3 0.47 0.1 1.3 0.036
## 5 6.4 0.290 0.21 9.65 0.041
## 6 7 0.14 0.41 0.9 0.037
## 7 7.9 0.12 0.49 5.2 0.049
## 8 6.6 0.38 0.28 2.8 0.043
## 9 7 0.16 0.3 2.6 0.043
## 10 6.5 0.37 0.33 3.9 0.027
## # … with 4,888 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## # total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <int>
フィーチャーは全て数値型であり欠損値はありません。目的変数となる品質スコア(quality
)は整数型の間隔尺度のようです。
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
fixed.acidity | 0 | 1 | 6.85 | 0.84 | 3.80 | 6.30 | 6.80 | 7.30 | 14.20 | ▁▇▁▁▁ |
volatile.acidity | 0 | 1 | 0.28 | 0.10 | 0.08 | 0.21 | 0.26 | 0.32 | 1.10 | ▇▅▁▁▁ |
citric.acid | 0 | 1 | 0.33 | 0.12 | 0.00 | 0.27 | 0.32 | 0.39 | 1.66 | ▇▆▁▁▁ |
residual.sugar | 0 | 1 | 6.39 | 5.07 | 0.60 | 1.70 | 5.20 | 9.90 | 65.80 | ▇▁▁▁▁ |
chlorides | 0 | 1 | 0.05 | 0.02 | 0.01 | 0.04 | 0.04 | 0.05 | 0.35 | ▇▁▁▁▁ |
free.sulfur.dioxide | 0 | 1 | 35.31 | 17.01 | 2.00 | 23.00 | 34.00 | 46.00 | 289.00 | ▇▁▁▁▁ |
total.sulfur.dioxide | 0 | 1 | 138.36 | 42.50 | 9.00 | 108.00 | 134.00 | 167.00 | 440.00 | ▂▇▂▁▁ |
density | 0 | 1 | 0.99 | 0.00 | 0.99 | 0.99 | 0.99 | 1.00 | 1.04 | ▇▂▁▁▁ |
pH | 0 | 1 | 3.19 | 0.15 | 2.72 | 3.09 | 3.18 | 3.28 | 3.82 | ▁▇▇▂▁ |
sulphates | 0 | 1 | 0.49 | 0.11 | 0.22 | 0.41 | 0.47 | 0.55 | 1.08 | ▃▇▂▁▁ |
alcohol | 0 | 1 | 10.51 | 1.23 | 8.00 | 9.50 | 10.40 | 11.40 | 14.20 | ▃▇▆▃▁ |
quality | 0 | 1 | 5.88 | 0.89 | 3.00 | 5.00 | 6.00 | 6.00 | 9.00 | ▁▅▇▃▁ |
データの分割
wine
データセットをトレーニング用とテスト用のデータセットに二分します。分割方法はテキストと同じ方法を用います。
wine_train <- wine[1:3750, ]
wine_train
## # A tibble: 3,750 x 12
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.7 0.62 0.24 1.1 0.039
## 2 5.7 0.22 0.2 16 0.044
## 3 5.9 0.19 0.26 7.4 0.034
## 4 5.3 0.47 0.1 1.3 0.036
## 5 6.4 0.290 0.21 9.65 0.041
## 6 7 0.14 0.41 0.9 0.037
## 7 7.9 0.12 0.49 5.2 0.049
## 8 6.6 0.38 0.28 2.8 0.043
## 9 7 0.16 0.3 2.6 0.043
## 10 6.5 0.37 0.33 3.9 0.027
## # … with 3,740 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## # total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <int>
wine_test <- wine[3751:4898, ]
wine_test
## # A tibble: 1,148 x 12
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 0.33 0.28 5.7 0.033
## 2 7.4 0.39 0.23 7 0.033
## 3 6.9 0.14 0.38 1 0.041
## 4 6.5 0.18 0.290 1.7 0.035
## 5 6.8 0.28 0.44 11.5 0.04
## 6 7.3 0.4 0.28 6.5 0.037
## 7 6.1 0.32 0.33 10.7 0.036
## 8 6.8 0.35 0.44 6.5 0.056
## 9 6 0.28 0.27 15.5 0.036
## 10 6.3 0.24 0.290 13.7 0.035
## # … with 1,138 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## # total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <int>
Package Cubist
M5’(M5 Prime)アルゴリズムを拡張したアルゴリズムにCubistアルゴリズムというものがあります。このアルゴリズムは RuleQuest というサイトでC言語で実装されたコード(Cubist GPL C code)が公開されています。このコードをベースにcaret
パッケージの作者であるMax Kuhn(RStudio,Inc.)がR言語用のパッケージとして作成したのがCubist
パッケージです。
Cubistアルゴリズム自体の説明は省略しますので Google Scholar などで調べてください。
パッケージのインストール
Cubist
パッケージは CRAN に登録されていますので以下のようにインストールします。
install.packages("Cubist")
インストールが完了しましたらCubist
パッケージを読込んでおきます。
library(Cubist)
モデリング
Cubist
パッケージを用いてモデリングする準備が整いましたので、最初にトレーニング用データを用いて学習モデルを作成します。RWeka::M5
関数と異なりformula形式での指定ができませんので注意してください。
学習モデルの作成
引数x
に目的変数を除くトレーニング用データを引数y
に目的変数を指定します。
m_wine <- Cubist::cubist(x = wine_train[, -12], y = wine_train[, 12])
m_wine
##
## Call:
## cubist.default(x = wine_train[, -12], y = wine_train[, 12])
##
## Number of samples: 3750
## Number of predictors: 11
##
## Number of committees: 1
## Number of rules: 25
学習モデルの詳細を見るにはsummary
関数を用います。
summary(m_wine)
##
## Call:
## cubist.default(x = wine_train[, -12], y = wine_train[, 12])
##
##
## Cubist [Release 2.07 GPL Edition] Sun Nov 24 14:42:54 2019
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 3750 cases (12 attributes) from undefined.data
##
## Model:
##
## Rule 1: [21 cases, mean 5.0, range 4 to 6, est err 0.5]
##
## if
## free.sulfur.dioxide > 30
## total.sulfur.dioxide > 195
## total.sulfur.dioxide <= 235
## sulphates > 0.64
## alcohol > 9.1
## then
## outcome = 573.6 + 0.0478 total.sulfur.dioxide - 573 density
## - 0.788 alcohol + 0.186 residual.sugar - 4.73 volatile.acidity
##
## Rule 2: [28 cases, mean 5.0, range 4 to 8, est err 0.7]
##
## if
## volatile.acidity > 0.31
## citric.acid <= 0.36
## residual.sugar <= 1.45
## total.sulfur.dioxide <= 97
## alcohol > 9.1
## then
## outcome = 168.2 + 4.75 citric.acid + 0.0123 total.sulfur.dioxide
## - 170 density + 0.057 residual.sugar - 6.4 chlorides + 0.84 pH
## + 0.14 fixed.acidity
##
## Rule 3: [171 cases, mean 5.1, range 3 to 6, est err 0.3]
##
## if
## volatile.acidity > 0.205
## chlorides <= 0.054
## density <= 0.99839
## alcohol <= 9.1
## then
## outcome = 147.4 - 144 density + 0.08 residual.sugar + 0.117 alcohol
## - 0.87 volatile.acidity - 0.09 pH - 0.01 fixed.acidity
##
## Rule 4: [37 cases, mean 5.3, range 3 to 6, est err 0.5]
##
## if
## free.sulfur.dioxide > 30
## total.sulfur.dioxide > 235
## alcohol > 9.1
## then
## outcome = 19.5 - 0.013 total.sulfur.dioxide - 2.7 volatile.acidity
## - 10 density + 0.005 residual.sugar + 0.008 alcohol
##
## Rule 5: [64 cases, mean 5.3, range 5 to 6, est err 0.3]
##
## if
## volatile.acidity > 0.205
## residual.sugar > 17.85
## then
## outcome = -23.6 + 0.233 alcohol - 5.2 chlorides - 0.75 citric.acid
## + 28 density - 0.81 volatile.acidity - 0.19 pH
## - 0.002 residual.sugar
##
## Rule 6: [56 cases, mean 5.3, range 4 to 7, est err 0.6]
##
## if
## fixed.acidity <= 7.1
## volatile.acidity > 0.205
## chlorides > 0.054
## density <= 0.99839
## alcohol <= 9.1
## then
## outcome = 40.6 + 0.374 alcohol - 1.62 volatile.acidity
## + 0.026 residual.sugar - 38 density - 0.21 pH
## - 0.01 fixed.acidity
##
## Rule 7: [337 cases, mean 5.3, range 3 to 7, est err 0.4]
##
## if
## fixed.acidity <= 7.8
## volatile.acidity > 0.305
## chlorides <= 0.09
## free.sulfur.dioxide <= 82.5
## total.sulfur.dioxide > 130
## total.sulfur.dioxide <= 235
## sulphates <= 0.64
## alcohol <= 10.4
## then
## outcome = -32.1 + 0.233 alcohol - 9.7 chlorides
## + 0.0038 total.sulfur.dioxide - 0.0081 free.sulfur.dioxide
## + 35 density + 0.81 volatile.acidity
##
## Rule 8: [30 cases, mean 5.5, range 3 to 7, est err 0.5]
##
## if
## fixed.acidity > 7.1
## volatile.acidity > 0.205
## chlorides > 0.054
## density <= 0.99839
## alcohol <= 9.1
## then
## outcome = 244 - 1.56 fixed.acidity - 228 density
## + 0.0252 free.sulfur.dioxide - 7.3 chlorides
## - 0.19 volatile.acidity + 0.003 residual.sugar
##
## Rule 9: [98 cases, mean 5.5, range 4 to 8, est err 0.5]
##
## if
## volatile.acidity > 0.155
## chlorides > 0.09
## total.sulfur.dioxide <= 235
## sulphates <= 0.64
## then
## outcome = 55.9 - 3.85 volatile.acidity - 52 density
## + 0.023 residual.sugar + 0.092 alcohol + 0.35 pH
## + 0.05 fixed.acidity + 0.3 sulphates
## + 0.001 free.sulfur.dioxide
##
## Rule 10: [446 cases, mean 5.6, range 4 to 8, est err 0.5]
##
## if
## fixed.acidity <= 7.8
## volatile.acidity > 0.155
## volatile.acidity <= 0.305
## chlorides <= 0.09
## free.sulfur.dioxide <= 82.5
## total.sulfur.dioxide > 130
## total.sulfur.dioxide <= 235
## sulphates <= 0.64
## alcohol > 9.1
## alcohol <= 10.4
## then
## outcome = 15.1 + 0.35 alcohol - 3.09 volatile.acidity - 14.7 chlorides
## + 1.16 sulphates - 0.0022 total.sulfur.dioxide
## + 0.11 fixed.acidity + 0.45 pH + 0.5 citric.acid - 14 density
## + 0.006 residual.sugar
##
## Rule 11: [31 cases, mean 5.6, range 3 to 8, est err 0.8]
##
## if
## volatile.acidity > 0.31
## citric.acid > 0.36
## free.sulfur.dioxide <= 30
## total.sulfur.dioxide <= 97
## then
## outcome = 3.2 + 0.0584 total.sulfur.dioxide + 7.77 volatile.acidity
## + 0.328 alcohol - 9 density + 0.003 residual.sugar
##
## Rule 12: [20 cases, mean 5.7, range 3 to 8, est err 0.9]
##
## if
## free.sulfur.dioxide > 82.5
## total.sulfur.dioxide <= 235
## sulphates <= 0.64
## alcohol > 9.1
## then
## outcome = -8.9 + 109.3 chlorides + 0.948 alcohol
##
## Rule 13: [331 cases, mean 5.8, range 4 to 8, est err 0.5]
##
## if
## volatile.acidity > 0.31
## free.sulfur.dioxide <= 30
## total.sulfur.dioxide > 97
## alcohol > 9.1
## then
## outcome = 89.8 + 0.0234 free.sulfur.dioxide + 0.324 alcohol
## + 0.07 residual.sugar - 90 density - 1.47 volatile.acidity
## + 0.48 pH
##
## Rule 14: [116 cases, mean 5.8, range 3 to 8, est err 0.6]
##
## if
## fixed.acidity > 7.8
## volatile.acidity > 0.155
## free.sulfur.dioxide > 30
## total.sulfur.dioxide > 130
## total.sulfur.dioxide <= 235
## sulphates <= 0.64
## alcohol > 9.1
## then
## outcome = 6 + 0.346 alcohol - 0.41 fixed.acidity - 1.69 volatile.acidity
## - 2.9 chlorides + 0.19 sulphates + 0.07 pH
##
## Rule 15: [115 cases, mean 5.8, range 4 to 7, est err 0.5]
##
## if
## volatile.acidity > 0.205
## residual.sugar <= 17.85
## density > 0.99839
## alcohol <= 9.1
## then
## outcome = -110.2 + 120 density - 3.46 volatile.acidity - 0.97 pH
## - 0.022 residual.sugar + 0.088 alcohol - 0.6 citric.acid
## - 0.01 fixed.acidity
##
## Rule 16: [986 cases, mean 5.9, range 3 to 9, est err 0.6]
##
## if
## volatile.acidity <= 0.31
## free.sulfur.dioxide <= 30
## alcohol > 9.1
## then
## outcome = 280.4 - 282 density + 0.128 residual.sugar
## + 0.0264 free.sulfur.dioxide - 3 volatile.acidity + 1.2 pH
## + 0.65 citric.acid + 0.09 fixed.acidity + 0.56 sulphates
## + 0.015 alcohol
##
## Rule 17: [49 cases, mean 6.0, range 5 to 8, est err 0.5]
##
## if
## volatile.acidity > 0.155
## residual.sugar > 8.8
## free.sulfur.dioxide > 30
## total.sulfur.dioxide <= 130
## pH <= 3.26
## alcohol > 9.1
## then
## outcome = 173.5 - 169 density + 0.055 alcohol + 0.38 sulphates
## + 0.002 residual.sugar
##
## Rule 18: [114 cases, mean 6.1, range 3 to 9, est err 0.6]
##
## if
## volatile.acidity > 0.31
## citric.acid <= 0.36
## residual.sugar > 1.45
## total.sulfur.dioxide <= 97
## alcohol > 9.1
## then
## outcome = 302.3 - 305 density + 0.0128 total.sulfur.dioxide
## + 0.096 residual.sugar + 1.94 citric.acid + 1.05 pH
## + 0.17 fixed.acidity - 6.7 chlorides
## + 0.0022 free.sulfur.dioxide - 0.21 volatile.acidity
## + 0.013 alcohol + 0.09 sulphates
##
## Rule 19: [145 cases, mean 6.1, range 5 to 8, est err 0.6]
##
## if
## volatile.acidity > 0.155
## free.sulfur.dioxide > 30
## total.sulfur.dioxide <= 195
## sulphates > 0.64
## then
## outcome = 206 - 209 density + 0.069 residual.sugar + 0.38 fixed.acidity
## + 2.79 sulphates + 0.0155 free.sulfur.dioxide
## - 0.0051 total.sulfur.dioxide - 1.71 citric.acid + 1.04 pH
##
## Rule 20: [555 cases, mean 6.1, range 3 to 9, est err 0.6]
##
## if
## total.sulfur.dioxide > 130
## total.sulfur.dioxide <= 235
## sulphates <= 0.64
## alcohol > 10.4
## then
## outcome = 108 + 0.276 alcohol - 109 density + 0.05 residual.sugar
## + 0.77 pH - 1.02 volatile.acidity - 4.2 chlorides
## + 0.78 sulphates + 0.08 fixed.acidity
## + 0.0016 free.sulfur.dioxide - 0.0003 total.sulfur.dioxide
##
## Rule 21: [73 cases, mean 6.2, range 4 to 8, est err 0.4]
##
## if
## volatile.acidity > 0.155
## citric.acid <= 0.28
## residual.sugar <= 8.8
## free.sulfur.dioxide > 30
## total.sulfur.dioxide <= 130
## pH <= 3.26
## sulphates <= 0.64
## alcohol > 9.1
## then
## outcome = 4.2 + 0.147 residual.sugar + 0.47 alcohol + 3.75 sulphates
## - 2.5 volatile.acidity - 5 density
##
## Rule 22: [244 cases, mean 6.3, range 4 to 8, est err 0.6]
##
## if
## citric.acid > 0.28
## residual.sugar <= 8.8
## free.sulfur.dioxide > 30
## total.sulfur.dioxide <= 130
## pH <= 3.26
## then
## outcome = 40.1 + 0.278 alcohol + 1.3 sulphates - 39 density
## + 0.017 residual.sugar + 0.001 total.sulfur.dioxide + 0.17 pH
## + 0.03 fixed.acidity
##
## Rule 23: [106 cases, mean 6.3, range 4 to 8, est err 0.6]
##
## if
## volatile.acidity <= 0.155
## free.sulfur.dioxide > 30
## then
## outcome = 139.1 - 138 density + 0.058 residual.sugar + 0.71 pH
## + 0.92 sulphates + 0.11 fixed.acidity - 0.73 volatile.acidity
## + 0.055 alcohol - 0.0012 total.sulfur.dioxide
## + 0.0007 free.sulfur.dioxide
##
## Rule 24: [137 cases, mean 6.5, range 4 to 9, est err 0.6]
##
## if
## volatile.acidity > 0.155
## free.sulfur.dioxide > 30
## total.sulfur.dioxide <= 130
## pH > 3.26
## sulphates <= 0.64
## alcohol > 9.1
## then
## outcome = 114.2 + 0.0142 total.sulfur.dioxide - 107 density
## - 11.8 chlorides - 1.57 pH + 0.124 alcohol + 1.21 sulphates
## + 1.16 volatile.acidity + 0.021 residual.sugar
## + 0.04 fixed.acidity
##
## Rule 25: [92 cases, mean 6.5, range 4 to 8, est err 0.6]
##
## if
## volatile.acidity <= 0.205
## alcohol <= 9.1
## then
## outcome = -200.7 + 210 density + 5.88 volatile.acidity + 23.9 chlorides
## - 2.83 citric.acid - 1.17 pH
##
##
## Evaluation on training data (3750 cases):
##
## Average |error| 0.5
## Relative |error| 0.67
## Correlation coefficient 0.66
##
##
## Attribute usage:
## Conds Model
##
## 84% 93% alcohol
## 80% 89% volatile.acidity
## 70% 61% free.sulfur.dioxide
## 63% 50% total.sulfur.dioxide
## 44% 70% sulphates
## 26% 44% chlorides
## 22% 76% fixed.acidity
## 16% 87% residual.sugar
## 11% 86% pH
## 11% 45% citric.acid
## 8% 97% density
##
##
## Time: 0.3 secs
モデルデータの参照方法
ルールが多い場合にはsummary
関数で確認するのは大変ですので、以降のようにモデル内の変数を参照することで個々の情報を確認することができます。
ルール
m_wine$splits
## # A tibble: 119 x 8
## committee rule variable dir value category type percentile
## <dbl> <dbl> <fct> <fct> <dbl> <fct> <chr> <dbl>
## 1 1 1 sulphates > 0.64 "" type2 0.910
## 2 1 1 total.sulfur.dioxide > 195 "" type2 0.902
## 3 1 1 total.sulfur.dioxide <= 235 "" type2 0.985
## 4 1 1 alcohol > 9.1 "" type2 0.135
## 5 1 1 free.sulfur.dioxide > 30 "" type2 0.423
## 6 1 2 volatile.acidity > 0.31 "" type2 0.719
## 7 1 2 residual.sugar <= 1.45 "" type2 0.176
## 8 1 2 total.sulfur.dioxide <= 97 "" type2 0.16
## 9 1 2 citric.acid <= 0.36 "" type2 0.693
## 10 1 2 alcohol > 9.1 "" type2 0.135
## # … with 109 more rows
各ルールの回帰式(回帰係数表)
m_wine$coefficients
## # A tibble: 25 x 14
## `(Intercept)` fixed.acidity volatile.acidity citric.acid residual.sugar
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 574. NA -4.73 NA 0.186
## 2 168. 0.14 NA 4.75 0.057
## 3 147. -0.01 -0.87 NA 0.08
## 4 19.5 NA -2.7 NA 0.005
## 5 -23.6 NA -0.81 -0.75 -0.002
## 6 40.6 -0.01 -1.62 NA 0.026
## 7 -32.1 NA 0.81 NA NA
## 8 244 -1.56 -0.19 NA 0.003
## 9 55.9 0.05 -3.85 NA 0.023
## 10 15.1 0.11 -3.09 0.5 0.006
## # … with 15 more rows, and 9 more variables: chlorides <dbl>,
## # free.sulfur.dioxide <dbl>, total.sulfur.dioxide <dbl>, density <dbl>,
## # pH <dbl>, sulphates <dbl>, alcohol <dbl>, committee <chr>, rule <chr>
各フィーチャーの利用率(属性利用率)
m_wine$usage
## # A tibble: 11 x 3
## Conditions Model Variable
## <dbl> <dbl> <chr>
## 1 84 93 alcohol
## 2 80 89 volatile.acidity
## 3 70 61 free.sulfur.dioxide
## 4 63 50 total.sulfur.dioxide
## 5 44 70 sulphates
## 6 26 44 chlorides
## 7 22 76 fixed.acidity
## 8 16 87 residual.sugar
## 9 11 86 pH
## 10 11 45 citric.acid
## 11 8 97 density
学習モデルを用いた予測
予測を行うにはpredict
関数を用います。
p_wine <- predict(m_wine, wine_test[, -12])
p_wine
## [1] 6.659258 5.844499 6.364941 6.064602 5.660200 6.807702 6.049500 5.046499
## [9] 6.860379 5.976668 5.020800 5.463120 5.834602 6.708495 5.657901 5.607023
## [17] 6.403296 6.095160 5.482300 6.087520 5.721400 5.780000 5.556600 5.302051
## [25] 5.987817 6.208598 5.110150 4.920499 6.025604 5.617798 5.297199 5.883048
## [33] 4.927701 6.809091 5.538096 5.976700 5.064700 6.087401 6.248298 6.091101
## [41] 5.876072 5.885598 6.005801 6.296202 5.801429 4.889294 5.720280 6.224800
## [49] 5.334001 6.198160 6.394821 5.870597 6.690900 6.439704 6.658896 5.541197
## [57] 6.083459 6.643502 5.122796 5.963795 5.415951 5.508580 5.770601 5.856198
## [65] 5.208998 4.560801 5.687980 5.630600 5.290480 6.036205 6.822050 5.707800
## [73] 4.954700 5.018999 5.122999 6.128472 7.081603 5.566800 5.604201 4.660500
## [81] 5.156402 5.547500 6.774819 6.474690 5.695000 5.627905 5.654605 6.872820
## [89] 6.883808 6.065700 5.688950 6.337118 5.287049 6.414652 5.921400 5.314601
## [97] 5.060899 6.755995 6.193338 5.357949 6.319558 6.728499 7.078003 5.624503
## [105] 6.085704 5.606606 5.505495 6.198494 5.165200 5.861100 6.292290 6.365801
## [113] 5.232293 6.209298 5.640262 6.291306 5.423500 5.821182 5.965500 5.849319
## [121] 5.776904 6.073902 5.205908 5.904900 6.489145 7.393163 5.351200 5.819599
## [129] 3.964800 5.898099 5.915450 5.912299 6.062420 4.995601 5.681108 6.255775
## [137] 5.002101 6.124820 6.118924 6.033196 6.025000 5.609650 5.253720 5.981349
## [145] 5.725399 5.479798 6.141408 6.684260 4.877749 5.228301 6.015901 5.235496
## [153] 4.935503 4.986503 6.741941 6.457150 6.008400 6.187693 5.916946 5.907000
## [161] 6.073363 5.063838 5.765705 6.391996 5.771399 6.162497 5.989305 4.982701
## [169] 6.141501 4.979099 4.656431 5.990803 5.757955 5.079001 6.141650 5.252283
## [177] 4.832384 5.745753 6.376420 5.024202 5.188597 5.251720 5.095301 5.657477
## [185] 6.534503 5.895433 5.082615 6.148260 5.120520 6.247309 5.139350 6.264918
## [193] 5.212293 6.524900 6.205099 7.332300 5.887001 6.465540 6.656764 7.332300
## [201] 6.597508 5.757850 6.712619 5.364055 5.921300 6.234360 6.057098 6.013013
## [209] 6.027441 6.203901 5.049798 5.095304 5.046350 6.212729 4.927251 5.860797
## [217] 5.072300 4.951750 5.286199 5.509000 5.929882 6.522807 5.342800 5.139350
## [225] 6.971497 5.954150 5.959404 6.170881 6.362580 5.807500 6.179539 6.349300
## [233] 6.676600 5.912284 6.115821 4.864610 6.196860 5.602254 6.046200 6.150400
## [241] 5.015887 5.883500 5.419900 5.684118 5.230598 6.137882 6.751918 5.832820
## [249] 6.586703 5.971000 6.043880 5.057950 5.330704 6.098418 5.630600 6.005701
## [257] 6.180904 6.216792 6.596248 6.241800 5.815797 6.176993 5.831659 5.567500
## [265] 6.463100 5.569903 5.892400 5.907351 5.593300 5.942508 6.092402 6.028201
## [273] 6.664892 6.148402 5.814898 6.028577 4.956450 6.452750 6.110602 6.083180
## [281] 6.128197 6.333541 5.323302 5.650138 5.480005 5.645607 4.560197 5.975801
## [289] 5.803627 5.408300 5.658998 5.159903 6.119817 5.181401 6.701000 6.237262
## [297] 6.678019 5.792400 6.007006 6.071520 6.645163 6.892098 5.753151 5.936296
## [305] 4.297103 5.536800 6.096602 5.856400 5.938519 5.464200 5.984100 5.744174
## [313] 5.691799 6.323923 6.108502 6.100698 3.910699 6.206235 4.274251 6.650401
## [321] 5.571393 5.376600 5.870100 5.038400 6.066453 6.658896 5.036100 5.827404
## [329] 5.936980 6.728167 5.837597 6.037111 6.305277 4.868566 4.842704 5.627882
## [337] 6.708495 6.231677 5.323839 6.256697 5.535230 5.418820 6.485672 4.916881
## [345] 6.240296 5.236900 4.448452 5.938003 5.261999 5.906497 5.291998 6.847675
## [353] 5.310699 5.982279 6.413097 6.441307 5.554101 5.407737 5.547800 6.361557
## [361] 5.349100 5.963501 6.576892 6.070291 6.700603 5.995495 5.604640 5.259476
## [369] 5.323304 6.213867 5.706500 6.114465 5.297700 6.089847 6.094703 6.342673
## [377] 5.156851 5.523335 4.828249 6.147403 6.176899 5.999249 5.499171 6.066799
## [385] 4.516747 6.029399 5.115850 6.247022 5.909700 5.256500 6.502111 5.897400
## [393] 6.012151 5.605100 6.080605 6.935369 6.192259 6.135900 5.761906 5.166643
## [401] 6.259996 5.885508 6.058300 5.808098 5.683499 5.773451 5.851030 5.761700
## [409] 5.491820 6.821696 6.708198 5.582720 5.523416 6.379098 6.442420 6.163950
## [417] 6.232393 6.435319 6.003997 5.910900 6.979352 5.251299 5.224201 4.971149
## [425] 5.943795 6.273510 6.182400 5.421406 6.055199 6.225900 5.619095 5.580100
## [433] 5.072899 5.039600 6.988250 4.858197 6.526549 6.301540 6.135341 6.950603
## [441] 5.349396 5.402660 5.253720 5.559599 6.522439 6.600161 6.247309 5.748979
## [449] 5.726102 6.659905 5.300299 6.422603 5.383270 5.211496 5.834547 5.850740
## [457] 5.243101 5.187803 5.899003 5.932400 6.048882 5.783200 5.882154 4.953900
## [465] 5.743400 5.621701 6.015601 5.025400 6.376256 6.721801 5.536900 5.937649
## [473] 5.935579 5.509000 6.371346 5.489001 6.717050 6.445799 4.869594 6.409091
## [481] 6.216101 5.169299 5.892896 5.967098 5.758295 5.931001 5.541380 5.824501
## [489] 6.024201 4.928300 6.728167 5.466300 5.531082 5.231508 6.129747 4.852400
## [497] 5.325600 5.114250 5.873619 5.848603 5.359402 6.651520 6.600161 6.340298
## [505] 5.657901 6.486323 5.824195 6.035903 4.768300 6.750499 5.068100 7.356731
## [513] 6.144232 6.335707 5.831659 6.307021 5.043900 6.272141 5.307784 5.442504
## [521] 5.549200 5.116440 6.277101 5.983872 5.133107 5.137702 6.140602 6.440979
## [529] 5.837600 5.948350 6.395962 5.738405 7.170300 6.580151 5.907097 5.414400
## [537] 6.526397 5.439866 5.660093 6.860140 6.364600 5.262201 5.952605 5.653254
## [545] 4.309250 6.340956 6.308275 5.656240 6.066873 6.222020 6.003820 5.870505
## [553] 6.469800 5.672006 5.907351 5.069899 6.493000 5.677155 5.672200 6.569219
## [561] 6.128472 6.016199 5.147099 3.676699 6.182357 6.042942 5.948553 5.344000
## [569] 6.376256 5.057950 6.478027 6.240959 5.670448 5.667202 5.328745 6.087617
## [577] 5.935579 5.316500 6.540391 5.172602 6.301834 4.889302 5.379850 5.901001
## [585] 5.924298 7.078003 6.129747 5.362700 5.324201 4.704400 5.191002 6.100194
## [593] 5.658400 5.950022 6.145100 7.013124 5.047002 5.181499 5.112749 6.263801
## [601] 5.948708 6.625402 5.800001 5.873619 6.016357 5.495920 6.164698 4.591502
## [609] 5.231795 5.219400 5.508300 5.161800 5.296561 6.331502 5.051400 6.728100
## [617] 5.144301 6.100800 5.317331 5.976200 5.151300 5.547793 5.209200 6.315360
## [625] 5.038843 6.166399 6.373600 5.645900 6.474690 5.981101 5.962847 6.855156
## [633] 6.655737 5.811997 6.096885 6.823597 6.099460 6.527140 6.008697 6.462603
## [641] 5.614000 4.782200 5.677100 6.662760 5.586205 6.073691 5.578401 6.091897
## [649] 6.497147 5.152400 5.755108 7.078003 6.927901 3.964200 5.593407 6.019049
## [657] 5.997200 4.837394 5.608840 5.975662 5.963501 4.535502 5.796707 6.106650
## [665] 6.321381 4.751598 6.635900 4.694601 5.412158 5.098599 6.192700 5.599420
## [673] 5.355300 7.007197 5.095301 6.469934 3.708900 5.333811 5.812408 5.607581
## [681] 5.051307 5.580100 6.676600 7.111102 6.441307 5.435600 6.771202 6.521250
## [689] 5.404095 5.348401 5.529400 5.598180 5.570900 5.882154 6.003796 6.527601
## [697] 5.992297 5.682400 5.251720 5.924298 5.128600 5.434704 5.860197 5.901001
## [705] 6.777926 6.444737 6.598104 6.010632 5.643216 5.640380 5.990102 6.452179
## [713] 6.496585 5.243500 6.651401 5.765705 5.419794 6.823323 6.249052 5.419900
## [721] 5.219894 4.908301 5.137600 4.858149 6.370603 5.123077 6.452652 5.730199
## [729] 6.730374 5.892404 4.892200 6.983693 7.081603 6.043904 6.422603 5.915564
## [737] 6.933154 6.454346 4.961099 6.523101 6.234146 5.003200 6.381001 6.421600
## [745] 5.753093 5.706400 6.026586 5.569399 6.618472 6.612689 6.255288 6.276650
## [753] 6.785680 5.360600 6.738299 5.054649 6.013905 6.122164 6.860497 5.267700
## [761] 5.080204 4.006301 5.722740 6.134439 6.384849 5.887002 6.822464 5.213199
## [769] 6.342673 5.789155 4.932601 5.946682 5.640700 6.658001 5.688440 4.998600
## [777] 5.658998 5.594408 6.009100 6.347199 5.212293 5.954800 5.311260 6.062822
## [785] 5.847400 5.295650 6.424400 5.325399 6.299280 5.594965 6.664892 6.802005
## [793] 5.680900 6.371505 6.274363 5.410723 6.471258 6.096185 5.934600 5.422312
## [801] 6.347235 5.557700 6.488057 6.280504 6.238417 6.325500 6.714278 5.236900
## [809] 5.950259 6.036836 5.739399 5.177100 6.839464 6.518151 6.896598 5.368948
## [817] 6.681100 5.587250 5.098599 5.884451 6.191597 6.411686 5.966864 5.237672
## [825] 6.611582 6.730998 5.594965 5.150899 5.596600 5.998895 5.831404 5.892896
## [833] 5.452000 6.316099 5.284101 4.871100 5.456760 6.821696 4.896349 6.379602
## [841] 6.345400 4.841600 4.927104 6.381400 5.985653 5.936400 6.259349 5.395317
## [849] 5.271300 5.834900 5.269400 5.175349 5.972446 5.880502 5.818419 6.126471
## [857] 6.052407 5.906500 6.149899 6.008697 6.288797 5.936213 6.112603 5.787706
## [865] 6.404240 5.967800 5.977800 5.905400 6.983219 6.612360 5.264749 6.166399
## [873] 6.028720 6.201801 6.227925 5.087525 6.379597 6.331771 5.409699 4.956450
## [881] 5.075200 5.805893 6.200500 6.140201 4.988507 5.121900 6.639302 5.570400
## [889] 4.852941 6.434403 6.444723 5.334400 5.957267 5.139017 4.647099 5.804302
## [897] 6.160837 5.272501 5.949199 5.915900 6.746295 5.790048 5.805321 6.049702
## [905] 5.990456 5.214297 5.813200 6.635599 5.474801 5.838800 5.192300 6.102972
## [913] 6.164450 5.402800 6.489343 5.938499 6.075129 5.892404 5.450095 5.906848
## [921] 5.985205 6.867101 5.300700 6.725308 5.730448 5.942903 5.697400 4.849400
## [929] 6.557100 5.575480 5.885598 6.135599 5.341598 6.159297 5.006684 5.837597
## [937] 5.361000 6.066897 5.912284 4.971900 4.834399 6.067904 5.747300 6.534945
## [945] 5.938679 6.942694 5.801600 5.737494 6.813920 5.577561 6.411686 6.524357
## [953] 6.152420 6.171782 4.872449 4.988001 6.056532 5.105325 6.058192 5.930802
## [961] 5.075200 6.361120 6.913600 6.424500 6.550105 6.387105 5.665657 5.610754
## [969] 5.440700 6.186013 6.199804 5.175601 5.571393 5.189250 6.204291 5.144301
## [977] 6.628181 6.133702 5.711499 6.140602 5.536100 5.428000 4.985499 5.148216
## [985] 7.017704 6.393398 5.868763 6.009105 6.303993 5.465498 5.560008 5.338681
## [993] 5.015858 5.858720 5.303250 6.520600 5.535399 6.350692 6.021299 6.726400
## [1001] 5.688440 6.037721 5.193101 6.243627 5.604008 5.442997 6.516210 6.143981
## [1009] 6.231800 6.508193 4.997323 6.003203 5.527500 6.451500 6.418379 5.935300
## [1017] 5.375111 7.014143 6.279066 5.577229 5.256101 6.072423 6.661403 4.871700
## [1025] 5.426480 5.773600 5.733600 6.635796 6.728998 6.532901 5.991600 6.566572
## [1033] 5.195301 5.703500 6.276598 5.323304 5.008101 5.046496 6.494785 5.697639
## [1041] 5.958222 6.270421 6.272800 5.720975 5.317889 5.299702 5.147400 6.999498
## [1049] 4.840000 5.440100 5.843600 7.003475 5.113297 6.075560 5.570420 5.371040
## [1057] 5.111301 5.105325 6.016800 6.047621 6.338500 4.907949 6.033002 5.091800
## [1065] 5.924298 6.043957 4.961500 6.186100 5.758401 6.542100 5.767504 6.885174
## [1073] 5.010180 5.184148 5.868901 6.190796 5.804400 5.362992 6.230501 6.942694
## [1081] 4.825000 6.896598 5.638008 6.860379 5.736123 5.608393 5.693018 5.976668
## [1089] 6.108000 5.331249 5.007751 5.564540 5.342896 5.891451 6.017900 4.951750
## [1097] 6.319636 5.986501 6.100916 6.170881 6.364350 6.032498 5.573849 6.436730
## [1105] 5.370203 5.863500 6.276055 5.570537 5.079001 5.301136 5.433360 6.019905
## [1113] 5.495700 5.783196 5.422740 5.094501 5.980088 6.637400 6.057098 5.633207
## [1121] 5.344675 6.361120 6.339003 5.940458 5.285795 5.338698 5.987340 4.671700
## [1129] 5.602696 5.863298 5.696306 6.034294 6.003604 6.508814 5.782903 5.892601
## [1137] 6.858105 5.968100 6.629999 6.496694 6.088439 5.462301 6.108502 4.997090
## [1145] 6.415817 6.066836 5.814192 4.668099
モデリング評価
テスト結果(p_wine
)を用いてモデリングの評価を行います。データ操作がしやすいようにテスト用データ(wine_test
)とテスト結果(p_wine
)を結合しておきます。目的変数となっている品質スコア(quality
)が整数値ですので、予測結果を整数値に丸めたフィーチャーも作成しておきます。
result <- dplyr::bind_cols(wine_test, pred = p_wine) %>%
dplyr::mutate(pred_int = as.integer(pred))
result
## # A tibble: 1,148 x 14
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 0.33 0.28 5.7 0.033
## 2 7.4 0.39 0.23 7 0.033
## 3 6.9 0.14 0.38 1 0.041
## 4 6.5 0.18 0.290 1.7 0.035
## 5 6.8 0.28 0.44 11.5 0.04
## 6 7.3 0.4 0.28 6.5 0.037
## 7 6.1 0.32 0.33 10.7 0.036
## 8 6.8 0.35 0.44 6.5 0.056
## 9 6 0.28 0.27 15.5 0.036
## 10 6.3 0.24 0.290 13.7 0.035
## # … with 1,138 more rows, and 9 more variables: free.sulfur.dioxide <dbl>,
## # total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <int>, pred <dbl>, pred_int <int>
要約の比較
作成したモデルでは\(8\)より上の値が出ないようです。
summary(result$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.901 6.000 9.000
summary(result$pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.677 5.416 5.906 5.848 6.238 7.393
summary(result$pred_int)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 5.000 5.359 6.000 7.000
相関係数の比較
cor(result$quality, result$pred)
## [1] 0.6201015
cor(result$quality, result$pred_int)
## [1] 0.537246
平均絶対誤差の比較
mean(abs(result$quality - result$pred))
## [1] 0.5339725
mean(abs(result$quality - result$pred_int))
## [1] 0.6689895
予測値の分布
hist(result$quality)
hist(result$pred)
hist(result$pred_int)
予測値-観測値プロット(参考)
ggplot2::ggplot(result, ggplot2::aes(x = quality, y = pred)) +
ggplot2::geom_abline(slope = 1, colour = "red", linetype = "dotted") +
ggplot2::geom_point()
ggplot2::ggplot(result, ggplot2::aes(x = quality, y = pred_int)) +
ggplot2::geom_abline(slope = 1, colour = "red", linetype = "dotted") +
ggplot2::geom_point()
まとめ
テキストのサンプルコードで使っているRWeka::M5P
関数がなぜテキスト通りの結果を出せないかの理由は分かりませんでした。M5’モデリングを行いたい場合、その拡張アルゴリズムであるCubistが使えるCubist
パッケージを使ってください。
テキストでは主観的な目的変数(品質スコア)をモデリングするためにモデル木を使っていますが、目的変数が整数の間隔尺度と考えられるためモデル木(回帰式)で予測するのは適していないと考えます。