`vignettes/07-glm.Rmd`

`07-glm.Rmd`

suppressPackageStartupMessages(library(disk.frame)) if(interactive()) { setup_disk.frame() } else { # only use 1 work to pass CRAN check setup_disk.frame(1) } #> The number of workers available for disk.frame is 1

In this article, we will assume you are familiar with Generalized Linear Models (GLMs). You are also expected to have basic working knowledge of {`disk.frame`

}, see this {`disk.frame`

} Quick Start.

One can fit a GLM using the `glm`

function. For example,

m = glm(dist ~ speed, data = cars)

would fit a linear model on the data `cars`

with `dist`

as the target and `speed`

as the explanatory variable. You can inspect the results of the model fit using

summary(m) #> #> Call: #> glm(formula = dist ~ speed, data = cars) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -29.069 -9.525 -2.272 9.215 43.201 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -17.5791 6.7584 -2.601 0.0123 * #> speed 3.9324 0.4155 9.464 1.49e-12 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for gaussian family taken to be 236.5317) #> #> Null deviance: 32539 on 49 degrees of freedom #> Residual deviance: 11354 on 48 degrees of freedom #> AIC: 419.16 #> #> Number of Fisher Scoring iterations: 2

or if you have `{broom}`

installed

broom::tidy(m) #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -17.6 6.76 -2.60 1.23e- 2 #> 2 speed 3.93 0.416 9.46 1.49e-12

With {`disk.frame`

}, you can run GLM `dfglm`

function, where the `df`

stands for `disk.frame`

of course!

cars.df = as.disk.frame(cars) m = dfglm(dist ~ speed, cars.df) #> Loading required namespace: biglm summary(m) #> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...) #> Sample size = 50 #> Coef (95% CI) SE p #> (Intercept) -17.5791 -31.0960 -4.0622 6.7584 0.0093 #> speed 3.9324 3.1014 4.7634 0.4155 0.0000 majorv = as.integer(version$major) minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1]) if((majorv == 3) & (minorv >= 6)) { broom::tidy(m) } else { # broom doesn't work in version < R3.6 because biglm does not work } #> NULL

The syntax didn’t change at all! You are able to enjoy the benefits of `disk.frame`

when dealing with larger-than-RAM data.

Logistic regression is one of the most commonly deployed machine learning (ML) models. It is often used to build binary classification models

iris.df = as.disk.frame(iris) # fit a logistic regression model to predict Speciess == "setosa" using all variables all_terms_except_species = setdiff(names(iris.df), "Species") formula_rhs = paste0(all_terms_except_species, collapse = "+") formula = as.formula(paste("Species == 'versicolor' ~ ", formula_rhs)) iris_model = dfglm(formula , data = iris.df, family=binomial()) # iris_model = dfglm(Species == "setosa" ~ , data = iris.df, family=binomial()) summary(iris_model) #> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...) #> Sample size = 150 #> Coef (95% CI) SE p #> (Intercept) 7.3785 2.3799 12.3771 2.4993 0.0032 #> Sepal.Length -0.2454 -1.5445 1.0538 0.6496 0.7056 #> Sepal.Width -2.7966 -4.3637 -1.2295 0.7835 0.0004 #> Petal.Length 1.3136 -0.0539 2.6812 0.6838 0.0547 #> Petal.Width -2.7783 -5.1246 -0.4321 1.1731 0.0179 majorv = as.integer(version$major) minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1]) if((majorv == 3) & (minorv >= 6)) { broom::tidy(iris_model) } else { # broom doesn't work in version < R3.6 because biglm does not work } #> NULL

The arguments to the `dfglm`

function are the same as the arguments to `biglm::bigglm`

which are based on the `glm`

function. Please check their documentations for other argument options.

`{disk.frame}`

uses `{biglm}`

and `{speedglm}`

as the backend for GLMs. Unfortunately, neither package is managed on open-source platforms, so it’s more difficult to contribute to them by making bug fixes and submitting bug reports. So bugs are likely to persists. There is an active effort on `disk.frame`

to look for alternatives. Example of avenues to explore include tighter integration with `{keras}`

, h2o, or Julia’s OnlineStats.jl for model fit purposes.

Another package for larger-than-RAM glm fitting, `{bigFastlm}`

, has been taken off CRAN, it is managed on Github.

Currently, parallel processing of GLM fit are not possible with {`disk.frame`

}.