`vignettes/07-glm.Rmd`

`07-glm.Rmd`

```
suppressPackageStartupMessages(library(disk.frame))
if(interactive()) {
setup_disk.frame()
} else {
# only use 1 work to pass CRAN check
setup_disk.frame(1)
}
#> The number of workers available for disk.frame is 1
```

In this article, we will assume you are familiar with Generalized Linear Models (GLMs). You are also expected to have basic working knowledge of {`disk.frame`

}, see this {`disk.frame`

} Quick Start.

One can fit a GLM using the `glm`

function. For example,

`m = glm(dist ~ speed, data = cars)`

would fit a linear model on the data `cars`

with `dist`

as the target and `speed`

as the explanatory variable. You can inspect the results of the model fit using

```
summary(m)
#>
#> Call:
#> glm(formula = dist ~ speed, data = cars)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -29.069 -9.525 -2.272 9.215 43.201
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -17.5791 6.7584 -2.601 0.0123 *
#> speed 3.9324 0.4155 9.464 1.49e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 236.5317)
#>
#> Null deviance: 32539 on 49 degrees of freedom
#> Residual deviance: 11354 on 48 degrees of freedom
#> AIC: 419.16
#>
#> Number of Fisher Scoring iterations: 2
```

or if you have broom installed

```
broom::tidy(m)
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -17.6 6.76 -2.60 1.23e- 2
#> 2 speed 3.93 0.416 9.46 1.49e-12
```

With {`disk.frame`

}, you can run GLM `dfglm`

function, where the `df`

stands for `disk.frame`

of course!

```
cars.df = as.disk.frame(cars)
m = dfglm(dist ~ speed, cars.df)
#> Loading required namespace: biglm
summary(m)
#> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...)
#> Sample size = 50
#> Coef (95% CI) SE p
#> (Intercept) -17.5791 -31.0960 -4.0622 6.7584 0.0093
#> speed 3.9324 3.1014 4.7634 0.4155 0.0000
majorv = as.integer(version$major)
minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1])
if((majorv == 3) & (minorv >= 6)) {
broom::tidy(m)
} else {
# broom doesn't work in version < R3.6 because biglm does not work
}
#> NULL
```

The syntax didn’t change at all! You are able to enjoy the benefits of `disk.frame`

when dealing with larger-than-RAM data.

Logistic regression is one of the most commonly deployed machine learning (ML) models. It is often used to build binary classification models

```
iris.df = as.disk.frame(iris)
# fit a logistic regression model to predict Speciess == "setosa" using all variables
all_terms_except_species = setdiff(names(iris.df), "Species")
formula_rhs = paste0(all_terms_except_species, collapse = "+")
formula = as.formula(paste("Species == 'versicolor' ~ ", formula_rhs))
iris_model = dfglm(formula , data = iris.df, family=binomial())
# iris_model = dfglm(Species == "setosa" ~ , data = iris.df, family=binomial())
summary(iris_model)
#> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...)
#> Sample size = 150
#> Coef (95% CI) SE p
#> (Intercept) 7.3785 2.3799 12.3771 2.4993 0.0032
#> Sepal.Length -0.2454 -1.5445 1.0538 0.6496 0.7056
#> Sepal.Width -2.7966 -4.3637 -1.2295 0.7835 0.0004
#> Petal.Length 1.3136 -0.0539 2.6812 0.6838 0.0547
#> Petal.Width -2.7783 -5.1246 -0.4321 1.1731 0.0179
majorv = as.integer(version$major)
minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1])
if((majorv == 3) & (minorv >= 6)) {
broom::tidy(iris_model)
} else {
# broom doesn't work in version < R3.6 because biglm does not work
}
#> NULL
```

The arguments to the `dfglm`

function are the same as the arguments to `biglm::bigglm`

which are based on the `glm`

function. Please check their documentations for other argument options.

disk.frame uses `{biglm}`

and `{speedglm}`

as the backend for GLMs. Unfortunately, neither package is managed on open-source platforms, so it’s more difficult to contribute to them by making bug fixes and submitting bug reports. So bugs are likely to persists. There is an active effort on `disk.frame`

to look for alternatives. Example of avenues to explore include tighter integration with keras, h2o, or Julia’s OnlineStats.jl for model fit purposes.

Another package for larger-than-RAM glm fitting, `{bigFastlm}`

, has been taken off CRAN, it is managed on Github.

Currently, parallel processing of GLM fit are not possible with {`disk.frame`

}.