suppressPackageStartupMessages(library(disk.frame))

if(interactive()) {
setup_disk.frame()
} else {
# only use 1 work to pass CRAN check
setup_disk.frame(1)
}
#> The number of workers available for disk.frame is 1

# GLMs

### Prerequisites

In this article, we will assume you are familiar with Generalized Linear Models (GLMs). You are also expected to have basic working knowledge of {disk.frame}, see this {disk.frame} Quick Start.

## Introduction

One can fit a GLM using the glm function. For example,

m = glm(dist ~ speed, data = cars)

would fit a linear model on the data cars with dist as the target and speed as the explanatory variable. You can inspect the results of the model fit using

summary(m)
#>
#> Call:
#> glm(formula = dist ~ speed, data = cars)
#>
#> Deviance Residuals:
#>     Min       1Q   Median       3Q      Max
#> -29.069   -9.525   -2.272    9.215   43.201
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -17.5791     6.7584  -2.601   0.0123 *
#> speed         3.9324     0.4155   9.464 1.49e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 236.5317)
#>
#>     Null deviance: 32539  on 49  degrees of freedom
#> Residual deviance: 11354  on 48  degrees of freedom
#> AIC: 419.16
#>
#> Number of Fisher Scoring iterations: 2

or if you have {broom} installed

broom::tidy(m)
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)   -17.6      6.76      -2.60 1.23e- 2
#> 2 speed           3.93     0.416      9.46 1.49e-12

With {disk.frame}, you can run GLM dfglm function, where the df stands for disk.frame of course!

cars.df = as.disk.frame(cars)

m = dfglm(dist ~ speed, cars.df)

summary(m)
#> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...)
#> Sample size =  50
#>                 Coef     (95%     CI)     SE      p
#> (Intercept) -17.5791 -31.0960 -4.0622 6.7584 0.0093
#> speed         3.9324   3.1014  4.7634 0.4155 0.0000

majorv = as.integer(version$major) minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[])

if((majorv == 3) & (minorv >= 6)) {
broom::tidy(m)
} else {
# broom doesn't work in version < R3.6 because biglm does not work
}
#> # A tibble: 2 x 4
#>   term        estimate std.error  p.value
#>   <chr>          <dbl>     <dbl>    <dbl>
#> 1 (Intercept)   -17.6      6.76  9.29e- 3
#> 2 speed           3.93     0.416 2.96e-21

The syntax didn’t change at all! You are able to enjoy the benefits of disk.frame when dealing with larger-than-RAM data.

## Logistic regression

Logistic regression is one of the most commonly deployed machine learning (ML) models. It is often used to build binary classification models

iris.df = as.disk.frame(iris)

# fit a logistic regression model to predict Speciess == "setosa" using all variables
all_terms_except_species = setdiff(names(iris.df), "Species")
formula_rhs = paste0(all_terms_except_species, collapse = "+")

formula = as.formula(paste("Species == 'versicolor' ~ ", formula_rhs))

iris_model = dfglm(formula , data = iris.df, family=binomial())

# iris_model = dfglm(Species == "setosa" ~ , data = iris.df, family=binomial())

summary(iris_model)
#> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...)
#> Sample size =  150
#>                 Coef    (95%     CI)     SE      p
#> (Intercept)   7.3785  2.3799 12.3771 2.4993 0.0032
#> Sepal.Length -0.2454 -1.5445  1.0538 0.6496 0.7056
#> Sepal.Width  -2.7966 -4.3637 -1.2295 0.7835 0.0004
#> Petal.Length  1.3136 -0.0539  2.6812 0.6838 0.0547
#> Petal.Width  -2.7783 -5.1246 -0.4321 1.1731 0.0179

majorv = as.integer(version$major) minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[])

if((majorv == 3) & (minorv >= 6)) {
broom::tidy(iris_model)
} else {
# broom doesn't work in version < R3.6 because biglm does not work
}
#> # A tibble: 5 x 4
#>   term         estimate std.error  p.value
#>   <chr>           <dbl>     <dbl>    <dbl>
#> 1 (Intercept)     7.38      2.50  0.00315
#> 2 Sepal.Length   -0.245     0.650 0.706
#> 3 Sepal.Width    -2.80      0.784 0.000358
#> 4 Petal.Length    1.31      0.684 0.0547
#> 5 Petal.Width    -2.78      1.17  0.0179

The arguments to the dfglm function are the same as the arguments to biglm::bigglm which are based on the glm function. Please check their documentations for other argument options.