RWeka::M5Pに代わるモデリング関数
Text Update: 11/24, 2019 (JST)

 『Rによる機械学習』第2版(以降、テキスト)の第6.4節ではRWeka::M5P関数を用いたM5’(M5 Prime)アルゴリズムを用いたモデル木について解説しています。しかし、サンプルコードを実行するとRWeka::M5P関数がテキストやWekaで実行した結果と異なってしまいます。
 そこで、RWeka::M5P関数に代わるCubistアルゴリズムを用いたモデリング方法を紹介します。

注)本資料は 2019年度 第6回 データ分析勉強会 の補足資料を焼き直したものです。

Packages and Datasets

 本ページではR version 3.6.1 (2019-07-05)の標準パッケージ以外に以下の追加パッケージを用いています。
 

Package Version Description
Cubist 0.2.2 Rule- And Instance-Based Regression Modeling
ggplot2 3.2.1 Create Elegant Data Visualisations Using the Grammar of Graphics
tidyverse 1.2.1 Easily Install and Load the ‘Tidyverse’

 
 また、本ページでは以下のデータセットを用いています。
 

Dataset Package Version Description
wine NA NA dataspelunking/MLwR, GitHub

 

データ概要

 利用するデータはテキストで紹介されているポルトガルの白ワインの品質に関するデータです。

wine
## # A tibble: 4,898 x 12
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           6.7            0.62         0.24           1.1      0.039
##  2           5.7            0.22         0.2           16        0.044
##  3           5.9            0.19         0.26           7.4      0.034
##  4           5.3            0.47         0.1            1.3      0.036
##  5           6.4            0.290        0.21           9.65     0.041
##  6           7              0.14         0.41           0.9      0.037
##  7           7.9            0.12         0.49           5.2      0.049
##  8           6.6            0.38         0.28           2.8      0.043
##  9           7              0.16         0.3            2.6      0.043
## 10           6.5            0.37         0.33           3.9      0.027
## # … with 4,888 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <int>

 フィーチャーは全て数値型であり欠損値はありません。目的変数となる品質スコア(quality)は整数型の間隔尺度のようです。

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fixed.acidity 0 1 6.85 0.84 3.80 6.30 6.80 7.30 14.20 ▁▇▁▁▁
volatile.acidity 0 1 0.28 0.10 0.08 0.21 0.26 0.32 1.10 ▇▅▁▁▁
citric.acid 0 1 0.33 0.12 0.00 0.27 0.32 0.39 1.66 ▇▆▁▁▁
residual.sugar 0 1 6.39 5.07 0.60 1.70 5.20 9.90 65.80 ▇▁▁▁▁
chlorides 0 1 0.05 0.02 0.01 0.04 0.04 0.05 0.35 ▇▁▁▁▁
free.sulfur.dioxide 0 1 35.31 17.01 2.00 23.00 34.00 46.00 289.00 ▇▁▁▁▁
total.sulfur.dioxide 0 1 138.36 42.50 9.00 108.00 134.00 167.00 440.00 ▂▇▂▁▁
density 0 1 0.99 0.00 0.99 0.99 0.99 1.00 1.04 ▇▂▁▁▁
pH 0 1 3.19 0.15 2.72 3.09 3.18 3.28 3.82 ▁▇▇▂▁
sulphates 0 1 0.49 0.11 0.22 0.41 0.47 0.55 1.08 ▃▇▂▁▁
alcohol 0 1 10.51 1.23 8.00 9.50 10.40 11.40 14.20 ▃▇▆▃▁
quality 0 1 5.88 0.89 3.00 5.00 6.00 6.00 9.00 ▁▅▇▃▁

 

データの分割

 wineデータセットをトレーニング用とテスト用のデータセットに二分します。分割方法はテキストと同じ方法を用います。

wine_train <- wine[1:3750, ]
wine_train
## # A tibble: 3,750 x 12
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           6.7            0.62         0.24           1.1      0.039
##  2           5.7            0.22         0.2           16        0.044
##  3           5.9            0.19         0.26           7.4      0.034
##  4           5.3            0.47         0.1            1.3      0.036
##  5           6.4            0.290        0.21           9.65     0.041
##  6           7              0.14         0.41           0.9      0.037
##  7           7.9            0.12         0.49           5.2      0.049
##  8           6.6            0.38         0.28           2.8      0.043
##  9           7              0.16         0.3            2.6      0.043
## 10           6.5            0.37         0.33           3.9      0.027
## # … with 3,740 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <int>
wine_test <- wine[3751:4898, ]
wine_test
## # A tibble: 1,148 x 12
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7               0.33       0.28             5.7     0.033
##  2           7.4             0.39       0.23             7       0.033
##  3           6.9             0.14       0.38             1       0.041
##  4           6.5             0.18       0.290            1.7     0.035
##  5           6.8             0.28       0.44            11.5     0.04 
##  6           7.3             0.4        0.28             6.5     0.037
##  7           6.1             0.32       0.33            10.7     0.036
##  8           6.8             0.35       0.44             6.5     0.056
##  9           6               0.28       0.27            15.5     0.036
## 10           6.3             0.24       0.290           13.7     0.035
## # … with 1,138 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <int>

 

Package Cubist

 M5’(M5 Prime)アルゴリズムを拡張したアルゴリズムにCubistアルゴリズムというものがあります。このアルゴリズムは RuleQuest というサイトでC言語で実装されたコード(Cubist GPL C code)が公開されています。このコードをベースにcaretパッケージの作者であるMax Kuhn(RStudio,Inc.)がR言語用のパッケージとして作成したのがCubistパッケージです。
 Cubistアルゴリズム自体の説明は省略しますので Google Scholar などで調べてください。

 

パッケージのインストール

 Cubistパッケージは CRAN に登録されていますので以下のようにインストールします。

install.packages("Cubist")

 インストールが完了しましたらCubistパッケージを読込んでおきます。

library(Cubist)

 

モデリング

 Cubistパッケージを用いてモデリングする準備が整いましたので、最初にトレーニング用データを用いて学習モデルを作成します。RWeka::M5関数と異なりformula形式での指定ができませんので注意してください。    

学習モデルの作成

引数xに目的変数を除くトレーニング用データを引数yに目的変数を指定します。

m_wine <- Cubist::cubist(x = wine_train[, -12], y = wine_train[, 12])
m_wine
## 
## Call:
## cubist.default(x = wine_train[, -12], y = wine_train[, 12])
## 
## Number of samples: 3750 
## Number of predictors: 11 
## 
## Number of committees: 1 
## Number of rules: 25

学習モデルの詳細を見るにはsummary関数を用います。

summary(m_wine)
## 
## Call:
## cubist.default(x = wine_train[, -12], y = wine_train[, 12])
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Nov 24 14:42:54 2019
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 3750 cases (12 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [21 cases, mean 5.0, range 4 to 6, est err 0.5]
## 
##     if
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide > 195
##  total.sulfur.dioxide <= 235
##  sulphates > 0.64
##  alcohol > 9.1
##     then
##  outcome = 573.6 + 0.0478 total.sulfur.dioxide - 573 density
##            - 0.788 alcohol + 0.186 residual.sugar - 4.73 volatile.acidity
## 
##   Rule 2: [28 cases, mean 5.0, range 4 to 8, est err 0.7]
## 
##     if
##  volatile.acidity > 0.31
##  citric.acid <= 0.36
##  residual.sugar <= 1.45
##  total.sulfur.dioxide <= 97
##  alcohol > 9.1
##     then
##  outcome = 168.2 + 4.75 citric.acid + 0.0123 total.sulfur.dioxide
##            - 170 density + 0.057 residual.sugar - 6.4 chlorides + 0.84 pH
##            + 0.14 fixed.acidity
## 
##   Rule 3: [171 cases, mean 5.1, range 3 to 6, est err 0.3]
## 
##     if
##  volatile.acidity > 0.205
##  chlorides <= 0.054
##  density <= 0.99839
##  alcohol <= 9.1
##     then
##  outcome = 147.4 - 144 density + 0.08 residual.sugar + 0.117 alcohol
##            - 0.87 volatile.acidity - 0.09 pH - 0.01 fixed.acidity
## 
##   Rule 4: [37 cases, mean 5.3, range 3 to 6, est err 0.5]
## 
##     if
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide > 235
##  alcohol > 9.1
##     then
##  outcome = 19.5 - 0.013 total.sulfur.dioxide - 2.7 volatile.acidity
##            - 10 density + 0.005 residual.sugar + 0.008 alcohol
## 
##   Rule 5: [64 cases, mean 5.3, range 5 to 6, est err 0.3]
## 
##     if
##  volatile.acidity > 0.205
##  residual.sugar > 17.85
##     then
##  outcome = -23.6 + 0.233 alcohol - 5.2 chlorides - 0.75 citric.acid
##            + 28 density - 0.81 volatile.acidity - 0.19 pH
##            - 0.002 residual.sugar
## 
##   Rule 6: [56 cases, mean 5.3, range 4 to 7, est err 0.6]
## 
##     if
##  fixed.acidity <= 7.1
##  volatile.acidity > 0.205
##  chlorides > 0.054
##  density <= 0.99839
##  alcohol <= 9.1
##     then
##  outcome = 40.6 + 0.374 alcohol - 1.62 volatile.acidity
##            + 0.026 residual.sugar - 38 density - 0.21 pH
##            - 0.01 fixed.acidity
## 
##   Rule 7: [337 cases, mean 5.3, range 3 to 7, est err 0.4]
## 
##     if
##  fixed.acidity <= 7.8
##  volatile.acidity > 0.305
##  chlorides <= 0.09
##  free.sulfur.dioxide <= 82.5
##  total.sulfur.dioxide > 130
##  total.sulfur.dioxide <= 235
##  sulphates <= 0.64
##  alcohol <= 10.4
##     then
##  outcome = -32.1 + 0.233 alcohol - 9.7 chlorides
##            + 0.0038 total.sulfur.dioxide - 0.0081 free.sulfur.dioxide
##            + 35 density + 0.81 volatile.acidity
## 
##   Rule 8: [30 cases, mean 5.5, range 3 to 7, est err 0.5]
## 
##     if
##  fixed.acidity > 7.1
##  volatile.acidity > 0.205
##  chlorides > 0.054
##  density <= 0.99839
##  alcohol <= 9.1
##     then
##  outcome = 244 - 1.56 fixed.acidity - 228 density
##            + 0.0252 free.sulfur.dioxide - 7.3 chlorides
##            - 0.19 volatile.acidity + 0.003 residual.sugar
## 
##   Rule 9: [98 cases, mean 5.5, range 4 to 8, est err 0.5]
## 
##     if
##  volatile.acidity > 0.155
##  chlorides > 0.09
##  total.sulfur.dioxide <= 235
##  sulphates <= 0.64
##     then
##  outcome = 55.9 - 3.85 volatile.acidity - 52 density
##            + 0.023 residual.sugar + 0.092 alcohol + 0.35 pH
##            + 0.05 fixed.acidity + 0.3 sulphates
##            + 0.001 free.sulfur.dioxide
## 
##   Rule 10: [446 cases, mean 5.6, range 4 to 8, est err 0.5]
## 
##     if
##  fixed.acidity <= 7.8
##  volatile.acidity > 0.155
##  volatile.acidity <= 0.305
##  chlorides <= 0.09
##  free.sulfur.dioxide <= 82.5
##  total.sulfur.dioxide > 130
##  total.sulfur.dioxide <= 235
##  sulphates <= 0.64
##  alcohol > 9.1
##  alcohol <= 10.4
##     then
##  outcome = 15.1 + 0.35 alcohol - 3.09 volatile.acidity - 14.7 chlorides
##            + 1.16 sulphates - 0.0022 total.sulfur.dioxide
##            + 0.11 fixed.acidity + 0.45 pH + 0.5 citric.acid - 14 density
##            + 0.006 residual.sugar
## 
##   Rule 11: [31 cases, mean 5.6, range 3 to 8, est err 0.8]
## 
##     if
##  volatile.acidity > 0.31
##  citric.acid > 0.36
##  free.sulfur.dioxide <= 30
##  total.sulfur.dioxide <= 97
##     then
##  outcome = 3.2 + 0.0584 total.sulfur.dioxide + 7.77 volatile.acidity
##            + 0.328 alcohol - 9 density + 0.003 residual.sugar
## 
##   Rule 12: [20 cases, mean 5.7, range 3 to 8, est err 0.9]
## 
##     if
##  free.sulfur.dioxide > 82.5
##  total.sulfur.dioxide <= 235
##  sulphates <= 0.64
##  alcohol > 9.1
##     then
##  outcome = -8.9 + 109.3 chlorides + 0.948 alcohol
## 
##   Rule 13: [331 cases, mean 5.8, range 4 to 8, est err 0.5]
## 
##     if
##  volatile.acidity > 0.31
##  free.sulfur.dioxide <= 30
##  total.sulfur.dioxide > 97
##  alcohol > 9.1
##     then
##  outcome = 89.8 + 0.0234 free.sulfur.dioxide + 0.324 alcohol
##            + 0.07 residual.sugar - 90 density - 1.47 volatile.acidity
##            + 0.48 pH
## 
##   Rule 14: [116 cases, mean 5.8, range 3 to 8, est err 0.6]
## 
##     if
##  fixed.acidity > 7.8
##  volatile.acidity > 0.155
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide > 130
##  total.sulfur.dioxide <= 235
##  sulphates <= 0.64
##  alcohol > 9.1
##     then
##  outcome = 6 + 0.346 alcohol - 0.41 fixed.acidity - 1.69 volatile.acidity
##            - 2.9 chlorides + 0.19 sulphates + 0.07 pH
## 
##   Rule 15: [115 cases, mean 5.8, range 4 to 7, est err 0.5]
## 
##     if
##  volatile.acidity > 0.205
##  residual.sugar <= 17.85
##  density > 0.99839
##  alcohol <= 9.1
##     then
##  outcome = -110.2 + 120 density - 3.46 volatile.acidity - 0.97 pH
##            - 0.022 residual.sugar + 0.088 alcohol - 0.6 citric.acid
##            - 0.01 fixed.acidity
## 
##   Rule 16: [986 cases, mean 5.9, range 3 to 9, est err 0.6]
## 
##     if
##  volatile.acidity <= 0.31
##  free.sulfur.dioxide <= 30
##  alcohol > 9.1
##     then
##  outcome = 280.4 - 282 density + 0.128 residual.sugar
##            + 0.0264 free.sulfur.dioxide - 3 volatile.acidity + 1.2 pH
##            + 0.65 citric.acid + 0.09 fixed.acidity + 0.56 sulphates
##            + 0.015 alcohol
## 
##   Rule 17: [49 cases, mean 6.0, range 5 to 8, est err 0.5]
## 
##     if
##  volatile.acidity > 0.155
##  residual.sugar > 8.8
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide <= 130
##  pH <= 3.26
##  alcohol > 9.1
##     then
##  outcome = 173.5 - 169 density + 0.055 alcohol + 0.38 sulphates
##            + 0.002 residual.sugar
## 
##   Rule 18: [114 cases, mean 6.1, range 3 to 9, est err 0.6]
## 
##     if
##  volatile.acidity > 0.31
##  citric.acid <= 0.36
##  residual.sugar > 1.45
##  total.sulfur.dioxide <= 97
##  alcohol > 9.1
##     then
##  outcome = 302.3 - 305 density + 0.0128 total.sulfur.dioxide
##            + 0.096 residual.sugar + 1.94 citric.acid + 1.05 pH
##            + 0.17 fixed.acidity - 6.7 chlorides
##            + 0.0022 free.sulfur.dioxide - 0.21 volatile.acidity
##            + 0.013 alcohol + 0.09 sulphates
## 
##   Rule 19: [145 cases, mean 6.1, range 5 to 8, est err 0.6]
## 
##     if
##  volatile.acidity > 0.155
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide <= 195
##  sulphates > 0.64
##     then
##  outcome = 206 - 209 density + 0.069 residual.sugar + 0.38 fixed.acidity
##            + 2.79 sulphates + 0.0155 free.sulfur.dioxide
##            - 0.0051 total.sulfur.dioxide - 1.71 citric.acid + 1.04 pH
## 
##   Rule 20: [555 cases, mean 6.1, range 3 to 9, est err 0.6]
## 
##     if
##  total.sulfur.dioxide > 130
##  total.sulfur.dioxide <= 235
##  sulphates <= 0.64
##  alcohol > 10.4
##     then
##  outcome = 108 + 0.276 alcohol - 109 density + 0.05 residual.sugar
##            + 0.77 pH - 1.02 volatile.acidity - 4.2 chlorides
##            + 0.78 sulphates + 0.08 fixed.acidity
##            + 0.0016 free.sulfur.dioxide - 0.0003 total.sulfur.dioxide
## 
##   Rule 21: [73 cases, mean 6.2, range 4 to 8, est err 0.4]
## 
##     if
##  volatile.acidity > 0.155
##  citric.acid <= 0.28
##  residual.sugar <= 8.8
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide <= 130
##  pH <= 3.26
##  sulphates <= 0.64
##  alcohol > 9.1
##     then
##  outcome = 4.2 + 0.147 residual.sugar + 0.47 alcohol + 3.75 sulphates
##            - 2.5 volatile.acidity - 5 density
## 
##   Rule 22: [244 cases, mean 6.3, range 4 to 8, est err 0.6]
## 
##     if
##  citric.acid > 0.28
##  residual.sugar <= 8.8
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide <= 130
##  pH <= 3.26
##     then
##  outcome = 40.1 + 0.278 alcohol + 1.3 sulphates - 39 density
##            + 0.017 residual.sugar + 0.001 total.sulfur.dioxide + 0.17 pH
##            + 0.03 fixed.acidity
## 
##   Rule 23: [106 cases, mean 6.3, range 4 to 8, est err 0.6]
## 
##     if
##  volatile.acidity <= 0.155
##  free.sulfur.dioxide > 30
##     then
##  outcome = 139.1 - 138 density + 0.058 residual.sugar + 0.71 pH
##            + 0.92 sulphates + 0.11 fixed.acidity - 0.73 volatile.acidity
##            + 0.055 alcohol - 0.0012 total.sulfur.dioxide
##            + 0.0007 free.sulfur.dioxide
## 
##   Rule 24: [137 cases, mean 6.5, range 4 to 9, est err 0.6]
## 
##     if
##  volatile.acidity > 0.155
##  free.sulfur.dioxide > 30
##  total.sulfur.dioxide <= 130
##  pH > 3.26
##  sulphates <= 0.64
##  alcohol > 9.1
##     then
##  outcome = 114.2 + 0.0142 total.sulfur.dioxide - 107 density
##            - 11.8 chlorides - 1.57 pH + 0.124 alcohol + 1.21 sulphates
##            + 1.16 volatile.acidity + 0.021 residual.sugar
##            + 0.04 fixed.acidity
## 
##   Rule 25: [92 cases, mean 6.5, range 4 to 8, est err 0.6]
## 
##     if
##  volatile.acidity <= 0.205
##  alcohol <= 9.1
##     then
##  outcome = -200.7 + 210 density + 5.88 volatile.acidity + 23.9 chlorides
##            - 2.83 citric.acid - 1.17 pH
## 
## 
## Evaluation on training data (3750 cases):
## 
##     Average  |error|                0.5
##     Relative |error|               0.67
##     Correlation coefficient        0.66
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##     84%    93%    alcohol
##     80%    89%    volatile.acidity
##     70%    61%    free.sulfur.dioxide
##     63%    50%    total.sulfur.dioxide
##     44%    70%    sulphates
##     26%    44%    chlorides
##     22%    76%    fixed.acidity
##     16%    87%    residual.sugar
##     11%    86%    pH
##     11%    45%    citric.acid
##      8%    97%    density
## 
## 
## Time: 0.3 secs

 

モデルデータの参照方法

ルールが多い場合にはsummary関数で確認するのは大変ですので、以降のようにモデル内の変数を参照することで個々の情報を確認することができます。

 

ルール

m_wine$splits
## # A tibble: 119 x 8
##    committee  rule variable             dir    value category type  percentile
##        <dbl> <dbl> <fct>                <fct>  <dbl> <fct>    <chr>      <dbl>
##  1         1     1 sulphates            >       0.64 ""       type2      0.910
##  2         1     1 total.sulfur.dioxide >     195    ""       type2      0.902
##  3         1     1 total.sulfur.dioxide <=    235    ""       type2      0.985
##  4         1     1 alcohol              >       9.1  ""       type2      0.135
##  5         1     1 free.sulfur.dioxide  >      30    ""       type2      0.423
##  6         1     2 volatile.acidity     >       0.31 ""       type2      0.719
##  7         1     2 residual.sugar       <=      1.45 ""       type2      0.176
##  8         1     2 total.sulfur.dioxide <=     97    ""       type2      0.16 
##  9         1     2 citric.acid          <=      0.36 ""       type2      0.693
## 10         1     2 alcohol              >       9.1  ""       type2      0.135
## # … with 109 more rows

 

各ルールの回帰式(回帰係数表)

m_wine$coefficients
## # A tibble: 25 x 14
##    `(Intercept)` fixed.acidity volatile.acidity citric.acid residual.sugar
##            <dbl>         <dbl>            <dbl>       <dbl>          <dbl>
##  1         574.          NA               -4.73       NA             0.186
##  2         168.           0.14            NA           4.75          0.057
##  3         147.          -0.01            -0.87       NA             0.08 
##  4          19.5         NA               -2.7        NA             0.005
##  5         -23.6         NA               -0.81       -0.75         -0.002
##  6          40.6         -0.01            -1.62       NA             0.026
##  7         -32.1         NA                0.81       NA            NA    
##  8         244           -1.56            -0.19       NA             0.003
##  9          55.9          0.05            -3.85       NA             0.023
## 10          15.1          0.11            -3.09        0.5           0.006
## # … with 15 more rows, and 9 more variables: chlorides <dbl>,
## #   free.sulfur.dioxide <dbl>, total.sulfur.dioxide <dbl>, density <dbl>,
## #   pH <dbl>, sulphates <dbl>, alcohol <dbl>, committee <chr>, rule <chr>

 

各フィーチャーの利用率(属性利用率)

m_wine$usage
## # A tibble: 11 x 3
##    Conditions Model Variable            
##         <dbl> <dbl> <chr>               
##  1         84    93 alcohol             
##  2         80    89 volatile.acidity    
##  3         70    61 free.sulfur.dioxide 
##  4         63    50 total.sulfur.dioxide
##  5         44    70 sulphates           
##  6         26    44 chlorides           
##  7         22    76 fixed.acidity       
##  8         16    87 residual.sugar      
##  9         11    86 pH                  
## 10         11    45 citric.acid         
## 11          8    97 density

 

学習モデルを用いた予測

予測を行うにはpredict関数を用います。

p_wine <- predict(m_wine, wine_test[, -12])
p_wine
##    [1] 6.659258 5.844499 6.364941 6.064602 5.660200 6.807702 6.049500 5.046499
##    [9] 6.860379 5.976668 5.020800 5.463120 5.834602 6.708495 5.657901 5.607023
##   [17] 6.403296 6.095160 5.482300 6.087520 5.721400 5.780000 5.556600 5.302051
##   [25] 5.987817 6.208598 5.110150 4.920499 6.025604 5.617798 5.297199 5.883048
##   [33] 4.927701 6.809091 5.538096 5.976700 5.064700 6.087401 6.248298 6.091101
##   [41] 5.876072 5.885598 6.005801 6.296202 5.801429 4.889294 5.720280 6.224800
##   [49] 5.334001 6.198160 6.394821 5.870597 6.690900 6.439704 6.658896 5.541197
##   [57] 6.083459 6.643502 5.122796 5.963795 5.415951 5.508580 5.770601 5.856198
##   [65] 5.208998 4.560801 5.687980 5.630600 5.290480 6.036205 6.822050 5.707800
##   [73] 4.954700 5.018999 5.122999 6.128472 7.081603 5.566800 5.604201 4.660500
##   [81] 5.156402 5.547500 6.774819 6.474690 5.695000 5.627905 5.654605 6.872820
##   [89] 6.883808 6.065700 5.688950 6.337118 5.287049 6.414652 5.921400 5.314601
##   [97] 5.060899 6.755995 6.193338 5.357949 6.319558 6.728499 7.078003 5.624503
##  [105] 6.085704 5.606606 5.505495 6.198494 5.165200 5.861100 6.292290 6.365801
##  [113] 5.232293 6.209298 5.640262 6.291306 5.423500 5.821182 5.965500 5.849319
##  [121] 5.776904 6.073902 5.205908 5.904900 6.489145 7.393163 5.351200 5.819599
##  [129] 3.964800 5.898099 5.915450 5.912299 6.062420 4.995601 5.681108 6.255775
##  [137] 5.002101 6.124820 6.118924 6.033196 6.025000 5.609650 5.253720 5.981349
##  [145] 5.725399 5.479798 6.141408 6.684260 4.877749 5.228301 6.015901 5.235496
##  [153] 4.935503 4.986503 6.741941 6.457150 6.008400 6.187693 5.916946 5.907000
##  [161] 6.073363 5.063838 5.765705 6.391996 5.771399 6.162497 5.989305 4.982701
##  [169] 6.141501 4.979099 4.656431 5.990803 5.757955 5.079001 6.141650 5.252283
##  [177] 4.832384 5.745753 6.376420 5.024202 5.188597 5.251720 5.095301 5.657477
##  [185] 6.534503 5.895433 5.082615 6.148260 5.120520 6.247309 5.139350 6.264918
##  [193] 5.212293 6.524900 6.205099 7.332300 5.887001 6.465540 6.656764 7.332300
##  [201] 6.597508 5.757850 6.712619 5.364055 5.921300 6.234360 6.057098 6.013013
##  [209] 6.027441 6.203901 5.049798 5.095304 5.046350 6.212729 4.927251 5.860797
##  [217] 5.072300 4.951750 5.286199 5.509000 5.929882 6.522807 5.342800 5.139350
##  [225] 6.971497 5.954150 5.959404 6.170881 6.362580 5.807500 6.179539 6.349300
##  [233] 6.676600 5.912284 6.115821 4.864610 6.196860 5.602254 6.046200 6.150400
##  [241] 5.015887 5.883500 5.419900 5.684118 5.230598 6.137882 6.751918 5.832820
##  [249] 6.586703 5.971000 6.043880 5.057950 5.330704 6.098418 5.630600 6.005701
##  [257] 6.180904 6.216792 6.596248 6.241800 5.815797 6.176993 5.831659 5.567500
##  [265] 6.463100 5.569903 5.892400 5.907351 5.593300 5.942508 6.092402 6.028201
##  [273] 6.664892 6.148402 5.814898 6.028577 4.956450 6.452750 6.110602 6.083180
##  [281] 6.128197 6.333541 5.323302 5.650138 5.480005 5.645607 4.560197 5.975801
##  [289] 5.803627 5.408300 5.658998 5.159903 6.119817 5.181401 6.701000 6.237262
##  [297] 6.678019 5.792400 6.007006 6.071520 6.645163 6.892098 5.753151 5.936296
##  [305] 4.297103 5.536800 6.096602 5.856400 5.938519 5.464200 5.984100 5.744174
##  [313] 5.691799 6.323923 6.108502 6.100698 3.910699 6.206235 4.274251 6.650401
##  [321] 5.571393 5.376600 5.870100 5.038400 6.066453 6.658896 5.036100 5.827404
##  [329] 5.936980 6.728167 5.837597 6.037111 6.305277 4.868566 4.842704 5.627882
##  [337] 6.708495 6.231677 5.323839 6.256697 5.535230 5.418820 6.485672 4.916881
##  [345] 6.240296 5.236900 4.448452 5.938003 5.261999 5.906497 5.291998 6.847675
##  [353] 5.310699 5.982279 6.413097 6.441307 5.554101 5.407737 5.547800 6.361557
##  [361] 5.349100 5.963501 6.576892 6.070291 6.700603 5.995495 5.604640 5.259476
##  [369] 5.323304 6.213867 5.706500 6.114465 5.297700 6.089847 6.094703 6.342673
##  [377] 5.156851 5.523335 4.828249 6.147403 6.176899 5.999249 5.499171 6.066799
##  [385] 4.516747 6.029399 5.115850 6.247022 5.909700 5.256500 6.502111 5.897400
##  [393] 6.012151 5.605100 6.080605 6.935369 6.192259 6.135900 5.761906 5.166643
##  [401] 6.259996 5.885508 6.058300 5.808098 5.683499 5.773451 5.851030 5.761700
##  [409] 5.491820 6.821696 6.708198 5.582720 5.523416 6.379098 6.442420 6.163950
##  [417] 6.232393 6.435319 6.003997 5.910900 6.979352 5.251299 5.224201 4.971149
##  [425] 5.943795 6.273510 6.182400 5.421406 6.055199 6.225900 5.619095 5.580100
##  [433] 5.072899 5.039600 6.988250 4.858197 6.526549 6.301540 6.135341 6.950603
##  [441] 5.349396 5.402660 5.253720 5.559599 6.522439 6.600161 6.247309 5.748979
##  [449] 5.726102 6.659905 5.300299 6.422603 5.383270 5.211496 5.834547 5.850740
##  [457] 5.243101 5.187803 5.899003 5.932400 6.048882 5.783200 5.882154 4.953900
##  [465] 5.743400 5.621701 6.015601 5.025400 6.376256 6.721801 5.536900 5.937649
##  [473] 5.935579 5.509000 6.371346 5.489001 6.717050 6.445799 4.869594 6.409091
##  [481] 6.216101 5.169299 5.892896 5.967098 5.758295 5.931001 5.541380 5.824501
##  [489] 6.024201 4.928300 6.728167 5.466300 5.531082 5.231508 6.129747 4.852400
##  [497] 5.325600 5.114250 5.873619 5.848603 5.359402 6.651520 6.600161 6.340298
##  [505] 5.657901 6.486323 5.824195 6.035903 4.768300 6.750499 5.068100 7.356731
##  [513] 6.144232 6.335707 5.831659 6.307021 5.043900 6.272141 5.307784 5.442504
##  [521] 5.549200 5.116440 6.277101 5.983872 5.133107 5.137702 6.140602 6.440979
##  [529] 5.837600 5.948350 6.395962 5.738405 7.170300 6.580151 5.907097 5.414400
##  [537] 6.526397 5.439866 5.660093 6.860140 6.364600 5.262201 5.952605 5.653254
##  [545] 4.309250 6.340956 6.308275 5.656240 6.066873 6.222020 6.003820 5.870505
##  [553] 6.469800 5.672006 5.907351 5.069899 6.493000 5.677155 5.672200 6.569219
##  [561] 6.128472 6.016199 5.147099 3.676699 6.182357 6.042942 5.948553 5.344000
##  [569] 6.376256 5.057950 6.478027 6.240959 5.670448 5.667202 5.328745 6.087617
##  [577] 5.935579 5.316500 6.540391 5.172602 6.301834 4.889302 5.379850 5.901001
##  [585] 5.924298 7.078003 6.129747 5.362700 5.324201 4.704400 5.191002 6.100194
##  [593] 5.658400 5.950022 6.145100 7.013124 5.047002 5.181499 5.112749 6.263801
##  [601] 5.948708 6.625402 5.800001 5.873619 6.016357 5.495920 6.164698 4.591502
##  [609] 5.231795 5.219400 5.508300 5.161800 5.296561 6.331502 5.051400 6.728100
##  [617] 5.144301 6.100800 5.317331 5.976200 5.151300 5.547793 5.209200 6.315360
##  [625] 5.038843 6.166399 6.373600 5.645900 6.474690 5.981101 5.962847 6.855156
##  [633] 6.655737 5.811997 6.096885 6.823597 6.099460 6.527140 6.008697 6.462603
##  [641] 5.614000 4.782200 5.677100 6.662760 5.586205 6.073691 5.578401 6.091897
##  [649] 6.497147 5.152400 5.755108 7.078003 6.927901 3.964200 5.593407 6.019049
##  [657] 5.997200 4.837394 5.608840 5.975662 5.963501 4.535502 5.796707 6.106650
##  [665] 6.321381 4.751598 6.635900 4.694601 5.412158 5.098599 6.192700 5.599420
##  [673] 5.355300 7.007197 5.095301 6.469934 3.708900 5.333811 5.812408 5.607581
##  [681] 5.051307 5.580100 6.676600 7.111102 6.441307 5.435600 6.771202 6.521250
##  [689] 5.404095 5.348401 5.529400 5.598180 5.570900 5.882154 6.003796 6.527601
##  [697] 5.992297 5.682400 5.251720 5.924298 5.128600 5.434704 5.860197 5.901001
##  [705] 6.777926 6.444737 6.598104 6.010632 5.643216 5.640380 5.990102 6.452179
##  [713] 6.496585 5.243500 6.651401 5.765705 5.419794 6.823323 6.249052 5.419900
##  [721] 5.219894 4.908301 5.137600 4.858149 6.370603 5.123077 6.452652 5.730199
##  [729] 6.730374 5.892404 4.892200 6.983693 7.081603 6.043904 6.422603 5.915564
##  [737] 6.933154 6.454346 4.961099 6.523101 6.234146 5.003200 6.381001 6.421600
##  [745] 5.753093 5.706400 6.026586 5.569399 6.618472 6.612689 6.255288 6.276650
##  [753] 6.785680 5.360600 6.738299 5.054649 6.013905 6.122164 6.860497 5.267700
##  [761] 5.080204 4.006301 5.722740 6.134439 6.384849 5.887002 6.822464 5.213199
##  [769] 6.342673 5.789155 4.932601 5.946682 5.640700 6.658001 5.688440 4.998600
##  [777] 5.658998 5.594408 6.009100 6.347199 5.212293 5.954800 5.311260 6.062822
##  [785] 5.847400 5.295650 6.424400 5.325399 6.299280 5.594965 6.664892 6.802005
##  [793] 5.680900 6.371505 6.274363 5.410723 6.471258 6.096185 5.934600 5.422312
##  [801] 6.347235 5.557700 6.488057 6.280504 6.238417 6.325500 6.714278 5.236900
##  [809] 5.950259 6.036836 5.739399 5.177100 6.839464 6.518151 6.896598 5.368948
##  [817] 6.681100 5.587250 5.098599 5.884451 6.191597 6.411686 5.966864 5.237672
##  [825] 6.611582 6.730998 5.594965 5.150899 5.596600 5.998895 5.831404 5.892896
##  [833] 5.452000 6.316099 5.284101 4.871100 5.456760 6.821696 4.896349 6.379602
##  [841] 6.345400 4.841600 4.927104 6.381400 5.985653 5.936400 6.259349 5.395317
##  [849] 5.271300 5.834900 5.269400 5.175349 5.972446 5.880502 5.818419 6.126471
##  [857] 6.052407 5.906500 6.149899 6.008697 6.288797 5.936213 6.112603 5.787706
##  [865] 6.404240 5.967800 5.977800 5.905400 6.983219 6.612360 5.264749 6.166399
##  [873] 6.028720 6.201801 6.227925 5.087525 6.379597 6.331771 5.409699 4.956450
##  [881] 5.075200 5.805893 6.200500 6.140201 4.988507 5.121900 6.639302 5.570400
##  [889] 4.852941 6.434403 6.444723 5.334400 5.957267 5.139017 4.647099 5.804302
##  [897] 6.160837 5.272501 5.949199 5.915900 6.746295 5.790048 5.805321 6.049702
##  [905] 5.990456 5.214297 5.813200 6.635599 5.474801 5.838800 5.192300 6.102972
##  [913] 6.164450 5.402800 6.489343 5.938499 6.075129 5.892404 5.450095 5.906848
##  [921] 5.985205 6.867101 5.300700 6.725308 5.730448 5.942903 5.697400 4.849400
##  [929] 6.557100 5.575480 5.885598 6.135599 5.341598 6.159297 5.006684 5.837597
##  [937] 5.361000 6.066897 5.912284 4.971900 4.834399 6.067904 5.747300 6.534945
##  [945] 5.938679 6.942694 5.801600 5.737494 6.813920 5.577561 6.411686 6.524357
##  [953] 6.152420 6.171782 4.872449 4.988001 6.056532 5.105325 6.058192 5.930802
##  [961] 5.075200 6.361120 6.913600 6.424500 6.550105 6.387105 5.665657 5.610754
##  [969] 5.440700 6.186013 6.199804 5.175601 5.571393 5.189250 6.204291 5.144301
##  [977] 6.628181 6.133702 5.711499 6.140602 5.536100 5.428000 4.985499 5.148216
##  [985] 7.017704 6.393398 5.868763 6.009105 6.303993 5.465498 5.560008 5.338681
##  [993] 5.015858 5.858720 5.303250 6.520600 5.535399 6.350692 6.021299 6.726400
## [1001] 5.688440 6.037721 5.193101 6.243627 5.604008 5.442997 6.516210 6.143981
## [1009] 6.231800 6.508193 4.997323 6.003203 5.527500 6.451500 6.418379 5.935300
## [1017] 5.375111 7.014143 6.279066 5.577229 5.256101 6.072423 6.661403 4.871700
## [1025] 5.426480 5.773600 5.733600 6.635796 6.728998 6.532901 5.991600 6.566572
## [1033] 5.195301 5.703500 6.276598 5.323304 5.008101 5.046496 6.494785 5.697639
## [1041] 5.958222 6.270421 6.272800 5.720975 5.317889 5.299702 5.147400 6.999498
## [1049] 4.840000 5.440100 5.843600 7.003475 5.113297 6.075560 5.570420 5.371040
## [1057] 5.111301 5.105325 6.016800 6.047621 6.338500 4.907949 6.033002 5.091800
## [1065] 5.924298 6.043957 4.961500 6.186100 5.758401 6.542100 5.767504 6.885174
## [1073] 5.010180 5.184148 5.868901 6.190796 5.804400 5.362992 6.230501 6.942694
## [1081] 4.825000 6.896598 5.638008 6.860379 5.736123 5.608393 5.693018 5.976668
## [1089] 6.108000 5.331249 5.007751 5.564540 5.342896 5.891451 6.017900 4.951750
## [1097] 6.319636 5.986501 6.100916 6.170881 6.364350 6.032498 5.573849 6.436730
## [1105] 5.370203 5.863500 6.276055 5.570537 5.079001 5.301136 5.433360 6.019905
## [1113] 5.495700 5.783196 5.422740 5.094501 5.980088 6.637400 6.057098 5.633207
## [1121] 5.344675 6.361120 6.339003 5.940458 5.285795 5.338698 5.987340 4.671700
## [1129] 5.602696 5.863298 5.696306 6.034294 6.003604 6.508814 5.782903 5.892601
## [1137] 6.858105 5.968100 6.629999 6.496694 6.088439 5.462301 6.108502 4.997090
## [1145] 6.415817 6.066836 5.814192 4.668099

 

モデリング評価

テスト結果(p_wine)を用いてモデリングの評価を行います。データ操作がしやすいようにテスト用データ(wine_test)とテスト結果(p_wine)を結合しておきます。目的変数となっている品質スコア(quality)が整数値ですので、予測結果を整数値に丸めたフィーチャーも作成しておきます。

result <- dplyr::bind_cols(wine_test, pred = p_wine) %>% 
  dplyr::mutate(pred_int = as.integer(pred))
result
## # A tibble: 1,148 x 14
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7               0.33       0.28             5.7     0.033
##  2           7.4             0.39       0.23             7       0.033
##  3           6.9             0.14       0.38             1       0.041
##  4           6.5             0.18       0.290            1.7     0.035
##  5           6.8             0.28       0.44            11.5     0.04 
##  6           7.3             0.4        0.28             6.5     0.037
##  7           6.1             0.32       0.33            10.7     0.036
##  8           6.8             0.35       0.44             6.5     0.056
##  9           6               0.28       0.27            15.5     0.036
## 10           6.3             0.24       0.290           13.7     0.035
## # … with 1,138 more rows, and 9 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <int>, pred <dbl>, pred_int <int>

 

要約の比較

作成したモデルでは\(8\)より上の値が出ないようです。

summary(result$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.901   6.000   9.000
summary(result$pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.677   5.416   5.906   5.848   6.238   7.393
summary(result$pred_int)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.359   6.000   7.000

 

相関係数の比較

cor(result$quality, result$pred)
## [1] 0.6201015
cor(result$quality, result$pred_int)
## [1] 0.537246

 

平均絶対誤差の比較

mean(abs(result$quality - result$pred))
## [1] 0.5339725
mean(abs(result$quality - result$pred_int))
## [1] 0.6689895

 

予測値の分布

hist(result$quality)

hist(result$pred)

hist(result$pred_int)

 

予測値-観測値プロット(参考)

ggplot2::ggplot(result, ggplot2::aes(x = quality, y = pred)) + 
  ggplot2::geom_abline(slope = 1, colour = "red", linetype = "dotted") +
  ggplot2::geom_point()

ggplot2::ggplot(result, ggplot2::aes(x = quality, y = pred_int)) + 
  ggplot2::geom_abline(slope = 1, colour = "red", linetype = "dotted") +
  ggplot2::geom_point()

 

まとめ

 テキストのサンプルコードで使っているRWeka::M5P関数がなぜテキスト通りの結果を出せないかの理由は分かりませんでした。M5’モデリングを行いたい場合、その拡張アルゴリズムであるCubistが使えるCubistパッケージを使ってください。
 テキストでは主観的な目的変数(品質スコア)をモデリングするためにモデル木を使っていますが、目的変数が整数の間隔尺度と考えられるためモデル木(回帰式)で予測するのは適していないと考えます。

 

参考資料

 
Enjoy!  

本blogに対するアドバイス、ご指摘等は データ分析勉強会 または GitHub まで。

CC BY-NC-SA 4.0 , Sampo Suzuki