vignettes/07-glm.Rmd
07-glm.Rmd
suppressPackageStartupMessages(library(disk.frame))
if(interactive()) {
setup_disk.frame()
} else {
# only use 1 work to pass CRAN check
setup_disk.frame(1)
}
#> The number of workers available for disk.frame is 1
In this article, we will assume you are familiar with Generalized Linear Models (GLMs). You are also expected to have basic working knowledge of {disk.frame
}, see this {disk.frame
} Quick Start.
One can fit a GLM using the glm
function. For example,
m = glm(dist ~ speed, data = cars)
would fit a linear model on the data cars
with dist
as the target and speed
as the explanatory variable. You can inspect the results of the model fit using
summary(m)
#>
#> Call:
#> glm(formula = dist ~ speed, data = cars)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -29.069 -9.525 -2.272 9.215 43.201
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -17.5791 6.7584 -2.601 0.0123 *
#> speed 3.9324 0.4155 9.464 1.49e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 236.5317)
#>
#> Null deviance: 32539 on 49 degrees of freedom
#> Residual deviance: 11354 on 48 degrees of freedom
#> AIC: 419.16
#>
#> Number of Fisher Scoring iterations: 2
or if you have broom installed
broom::tidy(m)
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -17.6 6.76 -2.60 1.23e- 2
#> 2 speed 3.93 0.416 9.46 1.49e-12
With {disk.frame
}, you can run GLM dfglm
function, where the df
stands for disk.frame
of course!
cars.df = as.disk.frame(cars)
m = dfglm(dist ~ speed, cars.df)
#> Loading required namespace: biglm
summary(m)
#> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...)
#> Sample size = 50
#> Coef (95% CI) SE p
#> (Intercept) -17.5791 -31.0960 -4.0622 6.7584 0.0093
#> speed 3.9324 3.1014 4.7634 0.4155 0.0000
majorv = as.integer(version$major)
minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1])
if((majorv == 3) & (minorv >= 6)) {
broom::tidy(m)
} else {
# broom doesn't work in version < R3.6 because biglm does not work
}
#> NULL
The syntax didn’t change at all! You are able to enjoy the benefits of disk.frame
when dealing with larger-than-RAM data.
Logistic regression is one of the most commonly deployed machine learning (ML) models. It is often used to build binary classification models
iris.df = as.disk.frame(iris)
# fit a logistic regression model to predict Speciess == "setosa" using all variables
all_terms_except_species = setdiff(names(iris.df), "Species")
formula_rhs = paste0(all_terms_except_species, collapse = "+")
formula = as.formula(paste("Species == 'versicolor' ~ ", formula_rhs))
iris_model = dfglm(formula , data = iris.df, family=binomial())
# iris_model = dfglm(Species == "setosa" ~ , data = iris.df, family=binomial())
summary(iris_model)
#> Large data regression model: biglm::bigglm(formula, data = streaming_fn, ...)
#> Sample size = 150
#> Coef (95% CI) SE p
#> (Intercept) 7.3785 2.3799 12.3771 2.4993 0.0032
#> Sepal.Length -0.2454 -1.5445 1.0538 0.6496 0.7056
#> Sepal.Width -2.7966 -4.3637 -1.2295 0.7835 0.0004
#> Petal.Length 1.3136 -0.0539 2.6812 0.6838 0.0547
#> Petal.Width -2.7783 -5.1246 -0.4321 1.1731 0.0179
majorv = as.integer(version$major)
minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1])
if((majorv == 3) & (minorv >= 6)) {
broom::tidy(iris_model)
} else {
# broom doesn't work in version < R3.6 because biglm does not work
}
#> NULL
The arguments to the dfglm
function are the same as the arguments to biglm::bigglm
which are based on the glm
function. Please check their documentations for other argument options.
disk.frame uses {biglm}
and {speedglm}
as the backend for GLMs. Unfortunately, neither package is managed on open-source platforms, so it’s more difficult to contribute to them by making bug fixes and submitting bug reports. So bugs are likely to persists. There is an active effort on disk.frame
to look for alternatives. Example of avenues to explore include tighter integration with keras, h2o, or Julia’s OnlineStats.jl for model fit purposes.
Another package for larger-than-RAM glm fitting, {bigFastlm}
, has been taken off CRAN, it is managed on Github.
Currently, parallel processing of GLM fit are not possible with {disk.frame
}.