Fits GLMs using `speedglm` or `biglm`. The return object will be exactly as those return by those functions. This is a convenience wrapper

dfglm(formula, data, ..., glm_backend = c("biglm", "speedglm", "biglmm"))

Arguments

formula

A model formula

data

See Details below. Method dispatch is on this argument

...

Additional arguments

glm_backend

Which package to use for fitting GLMs. The default is "biglm", which has known issues with factor level if different levels are present in different chunks. The "speedglm" option is more robust, but does not implement `predict` which makes prediction and implementation impossible.

Value

An object of class bigglm

Details

The data argument may be a function, a data frame, or a SQLiteConnection or RODBC connection object.

When it is a function the function must take a single argument reset. When this argument is FALSE it returns a data frame with the next chunk of data or NULL if no more data are available. Whenreset=TRUE it indicates that the data should be reread from the beginning by subsequent calls. The chunks need not be the same size or in the same order when the data are reread, but the same data must be provided in total. The bigglm.data.frame method gives an example of how such a function might be written, another is in the Examples below.

The model formula must not contain any data-dependent terms, as these will not be consistent when updated. Factors are permitted, but the levels of the factor must be the same across all data chunks (empty factor levels are ok). Offsets are allowed (since version 0.8).

The SQLiteConnection and RODBC methods loads only the variables needed for the model, not the whole table. The code in the SQLiteConnection method should work for other DBI connections, but I do not have any of these to check it with.

References

Algorithm AS274 Applied Statistics (1992) Vol.41, No. 2

See also

Other Machine Learning (ML): make_glm_streaming_fn()

Examples

cars.df = as.disk.frame(cars)
m = dfglm(dist ~ speed, data = cars.df)
#> Loading required namespace: biglm

# can use normal R functions
# Only works in version > R 3.6
majorv = as.integer(version$major)
minorv = as.integer(strsplit(version$minor, ".", fixed=TRUE)[[1]][1])
if(((majorv == 3) & (minorv >= 6)) | (majorv > 3)) {
  summary(m)
  predict(m, get_chunk(cars.df, 1))
  predict(m, collect(cars.df))
  # can use broom to tidy up the returned info
  broom::tidy(m)
}
#> # A tibble: 2 x 4
#>   term        estimate std.error  p.value
#>   <chr>          <dbl>     <dbl>    <dbl>
#> 1 (Intercept)   -17.6      6.76  9.29e- 3
#> 2 speed           3.93     0.416 2.96e-21

# clean up
delete(cars.df)