| Overview {Design} | R Documentation |
Design does regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit.
Design is a collection of about 180 functions that assist and streamline modeling, especially for biostatistical and epidemiologic applications. It also contains new functions for binary and ordinal logistic regression models and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. Design works with almost any regression model, but it was especially written to work with logistic regression, Cox regression, accelerated failure time models, ordinary linear models, and the Buckley-James model. You should install the Hmisc library before using Design, as a few of Design's options use Hmisc functions, and Hmisc has several functions useful for data analysis (especially data reduction and imputation).
To make use of automatic typesetting features you must
have LaTeX or one of its variants installed.
Some aspects of Design (e.g., latex) will not work correctly if
options(contrasts=) other than c("contr.treatment",
"contr.poly") are used.
Design relies on a wealth of survival analysis functions written by Terry Therneau of Mayo Clinic. Front-ends have been written for several of Therneau's functions, and other functions have been slightly modified.
Design was motivated by the following needs:
predict without newdata or when using
resid
f <- lrm(y ~ log(cholesterol)+age) plot(f, cholesterol=NA) # cholesterol on x-axis, default range
summary(fit, age=c(30,50),
sex="female") -> odds ratios for logistic model, relative survival time
for accelerated failure time survival models
pmin(x^2-3,10) refer to factor with legal S-name
x
na.delete in Hmisc)
Design will work with a wide variety of fitting functions, but it is meant especially for the following:
| Function | Purpose | Related S |
| Functions | ||
ols | Ordinary least squares linear model | lm |
lrm | Binary and ordinal logistic regression | glm |
| model | cr.setup |
|
psm | Accelerated failure time parametric | survreg |
| survival model | ||
cph | Cox proportional hazards regression | coxph |
bj | Buckley-James censored least squares | survreg |
| linear model | ||
glmD | Version of glm for use with Design | |
glsD | Version of gls for use with Design |
The following generic functions work with fits with Design in effect:
| Function | Purpose | Related |
| Functions | ||
print | Print parameters and statistics of fit | |
coef | Fitted regression coefficients | |
formula | Formula used in the fit | |
specs | Detailed specifications of fit | |
robcov | Robust covariance matrix estimates | |
bootcov | Bootstrap covariance matrix estimates | |
summary | Summary of effects of predictors | |
plot.summary | Plot continuously shaded confidence | |
| bars for results of summary | ||
anova | Wald tests of most meaningful hypotheses | |
contrast | General contrasts, C.L., tests | |
plot.anova | Depict results of anova graphically | dotchart |
plot | Plot effects of predictors | |
gendata | Generate data frame with predictor | expand.grid |
| combinations (optionally interactively) | ||
predict | Obtain predicted values or design matrix | |
fastbw | Fast backward step-down variable | step |
| selection | ||
residuals | Residuals, influence statistics from fit | |
(or resid) | ||
which.influence | Which observations are overly | residuals |
| influential | ||
sensuc | Sensitivity of one binary predictor in | |
| lrm and cph models to an unmeasured | ||
| binary confounder | ||
latex | LaTeX representation of fitted | |
model or anova or summary table | ||
Function | S function analytic representation | Function.transcan |
| of a fitted regression model (X*Beta) | ||
hazard | S function analytic representation | rcspline.restate |
of a fitted hazard function (for psm) | ||
Survival | S function analytic representation of | |
fitted survival function (for psm,cph) | ||
Quantile | S function analytic representation of | |
| fitted function for quantiles of | ||
survival time (for psm, cph) | ||
nomogram | Draws a nomogram for the fitted model | latex, plot |
survest | Estimate survival probabilities | survfit |
(for psm, cph) | ||
survplot | Plot survival curves (psm, cph) | plot.survfit |
validate | Validate indexes of model fit using | val.prob |
| resampling | ||
calibrate | Estimate calibration curve for model | |
| using resampling | ||
vif | Variance inflation factors for a fit | |
naresid | Bring elements corresponding to missing | |
| data back into predictions and residuals | ||
naprint | Print summary of missing values | |
pentrace | Find optimum penality for penalized MLE | |
effective.df | Print effective d.f. for each type of | |
| variable in model, for penalized fit or | ||
| pentrace result | ||
rm.impute | Impute repeated measures data with | transcan, |
| non-random dropout | fit.mult.impute |
|
| experimental, non-functional |
The following programs demonstrate how the pieces of
the Design package work together. A (usually)
one-time call to the function datadist requires a
pass at the entire data frame to store distribution
summaries for potential predictor variables. These
summaries contain (by default) the .25 and .75
quantiles of continuous variables (for estimating
effects such as odds ratios), the 10th smallest and
10th largest values (or .1 and .9 quantiles for small
n) for plotting ranges for estimated curves, and the
total range. For discrete numeric variables (those
having <=10 unique values), the list of unique values
is also stored. Such summaries are used by the
summary.Design, plot.Design, and nomogram.Design
functions. You may save time and defer running
datadist. In that case, the distribution summary
is not stored with the fit object, but it can be
gathered before running summary or plot.
d <- datadist(my.data.frame) # or datadist(x1,x2)
options(datadist="d") # omit this or use options(datadist=NULL)
# if not run datadist yet
cf <- ols(y ~ x1 * x2)
anova(f)
fastbw(f)
predict(f, newdata)
In the Examples section there are three detailed examples using a
fitting function
designed to be used with Design, lrm (logistic
regression model). In Detailed Example 1 we
create 3 predictor variables and a two binary response
on 500 subjects. For the first binary response, dz,
the true model involves only sex and age, and there is
a nonlinear interaction between the two because the log
odds is a truncated linear relationship in age for
females and a quadratic function for males. For the
second binary outcome, dz.bp, the true population model
also involves systolic blood pressure (sys.bp) through
a truncated linear relationship. First, nonparametric
estimation of relationships is done using the Hmisc
library's plsmo function which uses lowess with outlier
detection turned off for binary responses. Then
parametric modeling is done using restricted cubic
splines. This modeling does not assume that we know
the true transformations for age or sys.bp but that
these transformations are smooth (which is not actually
the case in the population).
For Detailed Example 2, suppose that a
categorical variable treat has values "a", "b", and
"c", an ordinal variable num.diseases has values
0,1,2,3,4, and that there are two continuous variables,
age and cholesterol. age is fitted with a restricted
cubic spline, while cholesterol is transformed using
the transformation log(cholesterol - 10). Cholesterol
is missing on three subjects, and we impute these using
the overall median cholesterol. We wish to allow for
interaction between treat and cholesterol. The
following S program will fit a logistic model,
test all effects in the design, estimate effects, and
plot estimated transformations. The fit for
num.diseases really considers the variable to be a
5-level categorical variable. The only difference is
that a 3 d.f. test of linearity is done to assess
whether the variable can be re-modeled "asis". Here
we also show statements to attach the Design library
and store predictor characteristics from datadist.
Detailed Example 3 shows some of the survival
analysis capabilities of Design related to the Cox
proportional hazards model. We simulate data for 2000
subjects with 2 predictors, age and sex. In the true
population model, the log hazard function is linear in
age and there is no age x sex interaction. In the
analysis below we do not make use of the linearity in
age. Design makes use of many of Terry Therneau's
survival functions that are builtin to S.
The following is a typical sequence of steps that
would be used with Design in conjunction with the Hmisc
transcan function to do single imputation of all NAs in the
predictors (multiple imputation would be better but would be
harder to do in the context of bootstrap model validation),
fit a model, do backward stepdown to reduce the number of
predictors in the model (with all the severe problems this can
entail), and use the bootstrap to validate this stepwise model,
repeating the variable selection for each re-sample. Here we
take a short cut as the imputation is not repeated within the
bootstrap.
In what follows we (atypically) have only 3 candidate predictors. In practice be sure to have the validate and calibrate functions operate on a model fit that contains all predictors that were involved in previous analyses that used the response variable. Here the imputation is necessary because backward stepdown would otherwise delete observations missing on any candidate variable.
Note that you would have to define x1, x2, x3, y to run
the following code.
xt <- transcan(~ x1 + x2 + x3, imputed=TRUE)
impute(xt) # imputes any NAs in x1, x2, x3
# Now fit original full model on filled-in data
f <- lrm(y ~ x1 + rcs(x2,4) + x3, x=TRUE, y=TRUE) #x,y allow boot.
fastbw(f)
# derives stepdown model (using default stopping rule)
validate(f, B=100, bw=TRUE) # repeats fastbw 100 times
cal <- calibrate(f, B=100, bw=TRUE) # also repeats fastbw
plot(cal)
y ~ age + age^2.
In S you need to connect related variables using
a function which produces a matrix, such as pol or
rcs.
This allows effect estimates (e.g., hazard ratios)
to be computed as well as multiple d.f. tests of
association.
poly or strata inside formulas used in
Design. Use pol and strat instead.
anova can't do its
job.
y ~ log(cell.count+1),
which will allow cell.count to appear on
x-axes. You can get fancier, e.g.,
y ~ rcs(log(cell.count+1),4) to fit a restricted
cubic spline with 4 knots in log(cell.count+1).
For more complex transformations do something
like f <- function(x) {... various 'if' statements, etc.log(pmin(x,50000)+1)}fit1 <- lrm(death ~ f(cell.count))fit2 <- lrm(death ~ rcs(f(cell.count),4))}
$ inside variable names used in formulas.
Either attach data frames or use data=.
datadist. Try to use it
at the top of your program so that all model fits
can automatically take advantage if its
distributional summaries for the predictors.
validate or calibrate models which were
reduced by dropping "insignificant" predictors.
Proper bootstrap or cross-validation must repeat
any variable selection steps for each re-sample.
Therefore, validate or calibrate models
which contain all candidate predictors, and if
you must reduce models, specify the option
bw=TRUE to validate or calibrate.
If you are using any of Design's survival analysis functions, create a
file called .Rprofile in your working directory that contains the
line library(survival). That way, survival will move down the
search list as Hmisc and Design are attached during your session. This
will allow Hmisc and Design to override some of the survival function such as
survfit.
Since the Design library has a .First.lib function,
that function will be executed by the library
command, to dynamically load the .o or .obj files. You
may want to create a .First function such as
.First <- {
options(na.action = "na.delete")
# gives more info than na.omit
library(Hmisc)
library(Design)
invisible()
}
latex.Design
transcan for
imputation
latex anova or summary table Function Function.transcan hazard rcspline.restate psm) Survival psm,cph) Quantile psm, cph) nomogram latex, plot survest survfit psm, cph) survplot validate calibrate vif naresid naprint pentrace effective.df rm.impute transcan, fit.mult.impute
The following programs demonstrate how the pieces of
the Design package work together. A (usually)
one-time call to the function datadist requires a
pass at the entire data frame to store distribution
summaries for potential predictor variables. These
summaries contain (by default) the .25 and .75
quantiles of continuous variables (for estimating
effects such as odds ratios), the 10th smallest and
10th largest values (or .1 and .9 quantiles for small
n) for plotting ranges for estimated curves, and the
total range. For discrete numeric variables (those
having <=10 unique values), the list of unique values
is also stored. Such summaries are used by the
summary.Design, plot.Design, and nomogram.Design
functions. You may save time and defer running
datadist. In that case, the distribution summary
is not stored with the fit object, but it can be
gathered before running summary or plot.
d <- datadist(my.data.frame) # or datadist(x1,x2)
options(datadist="d") # omit this or use options(datadist=NULL)
# if not run datadist yet
cf <- ols(y ~ x1 * x2)
anova(f)
fastbw(f)
predict(f, newdata)
In the Examples section there are three detailed examples using a
fitting function
designed to be used with Design, lrm (logistic
regression model). In Detailed Example 1 we
create 3 predictor variables and a two binary response
on 500 subjects. For the first binary response, dz,
the true model involves only sex and age, and there is
a nonlinear interaction between the two because the log
odds is a truncated linear relationship in age for
females and a quadratic function for males. For the
second binary outcome, dz.bp, the true population model
also involves systolic blood pressure (sys.bp) through
a truncated linear relationship. First, nonparametric
estimation of relationships is done using the Hmisc
library's plsmo function which uses lowess with outlier
detection turned off for binary responses. Then
parametric modeling is done using restricted cubic
splines. This modeling does not assume that we know
the true transformations for age or sys.bp but that
these transformations are smooth (which is not actually
the case in the population).
For Detailed Example 2, suppose that a
categorical variable treat has values "a", "b", and
"c", an ordinal variable num.diseases has values
0,1,2,3,4, and that there are two continuous variables,
age and cholesterol. age is fitted with a restricted
cubic spline, while cholesterol is transformed using
the transformation log(cholesterol - 10). Cholesterol
is missing on three subjects, and we impute these using
the overall median cholesterol. We wish to allow for
interaction between treat and cholesterol. The
following S program will fit a logistic model,
test all effects in the design, estimate effects, and
plot estimated transformations. The fit for
num.diseases really considers the variable to be a
5-level categorical variable. The only difference is
that a 3 d.f. test of linearity is done to assess
whether the variable can be re-modeled "asis". Here
we also show statements to attach the Design library
and store predictor characteristics from datadist.
Detailed Example 3 shows some of the survival
analysis capabilities of Design related to the Cox
proportional hazards model. We simulate data for 2000
subjects with 2 predictors, age and sex. In the true
population model, the log hazard function is linear in
age and there is no age x sex interaction. In the
analysis below we do not make use of the linearity in
age. Design makes use of many of Terry Therneau's
survival functions that are builtin to S.
The following is a typical sequence of steps that
would be used with Design in conjunction with the Hmisc
transcan function to do single imputation of all NAs in the
predictors (multiple imputation would be better but would be
harder to do in the context of bootstrap model validation),
fit a model, do backward stepdown to reduce the number of
predictors in the model (with all the severe problems this can
entail), and use the bootstrap to validate this stepwise model,
repeating the variable selection for each re-sample. Here we
take a short cut as the imputation is not repeated within the
bootstrap.
In what follows we (atypically) have only 3 candidate predictors. In practice be sure to have the validate and calibrate functions operate on a model fit that contains all predictors that were involved in previous analyses that used the response variable. Here the imputation is necessary because backward stepdown would otherwise delete observations missing on any candidate variable.
Note that you would have to define x1, x2, x3, y to run
the following code.
xt <- transcan(~ x1 + x2 + x3, imputed=TRUE)
impute(xt) # imputes any NAs in x1, x2, x3
# Now fit original full model on filled-in data
f <- lrm(y ~ x1 + rcs(x2,4) + x3, x=TRUE, y=TRUE) #x,y allow boot.
fastbw(f)
# derives stepdown model (using default stopping rule)
validate(f, B=100, bw=TRUE) # repeats fastbw 100 times
cal <- calibrate(f, B=100, bw=TRUE) # also repeats fastbw
plot(cal)
y ~ age + age^2.
In S you need to connect related variables using
a function which produces a matrix, such as pol or
rcs.
This allows effect estimates (e.g., hazard ratios)
to be computed as well as multiple d.f. tests of
association.
poly or strata inside formulas used in
Design. Use pol and strat instead.
anova can't do its
job.
y ~ log(cell.count+1),
which will allow cell.count to appear on
x-axes. You can get fancier, e.g.,
y ~ rcs(log(cell.count+1),4) to fit a restricted
cubic spline with 4 knots in log(cell.count+1).
For more complex transformations do something
like f <- function(x) {... various 'if' statements, etc.log(pmin(x,50000)+1)}fit1 <- lrm(death ~ f(cell.count))fit2 <- lrm(death ~ rcs(f(cell.count),4))}
$ inside variable names used in formulas.
Either attach data frames or use data=.
datadist. Try to use it
at the top of your program so that all model fits
can automatically take advantage if its
distributional summaries for the predictors.
validate or calibrate models which were
reduced by dropping "insignificant" predictors.
Proper bootstrap or cross-validation must repeat
any variable selection steps for each re-sample.
Therefore, validate or calibrate models
which contain all candidate predictors, and if
you must reduce models, specify the option
bw=TRUE to validate or calibrate.
If you are using any of Design's survival analysis functions, create a
file called .Rprofile in your working directory that contains the
line library(survival). That way, survival will move down the
search list as Hmisc and Design are attached during your session. This
will allow Hmisc and Design to override some of the survival function such as
survfit.
Since the Design library has a .First.lib function,
that function will be executed by the library
command, to dynamically load the .o or .obj files. You
may want to create a .First function such as
.First <- {
options(na.action = "na.delete")
# gives more info than na.omit
library(Hmisc)
library(Design)
invisible()
}
latex.Design
transcan for
imputation
latex anova or summary table Function Function.transcan hazard rcspline.restate psm) Survival psm,cph) Quantile psm, cph) nomogram latex, plot survest survfit psm, cph) survplot validate calibrate vif naresid naprint pentrace effective.df rm.impute transcan, fit.mult.impute
The following programs demonstrate how the pieces of
the Design package work together. A (usually)
one-time call to the function datadist requires a
pass at the entire data frame to store distribution
summaries for potential predictor variables. These
summaries contain (by default) the .25 and .75
quantiles of continuous variables (for estimating
effects such as odds ratios), the 10th smallest and
10th largest values (or .1 and .9 quantiles for small
n) for plotting ranges for estimated curves, and the
total range. For discrete numeric variables (those
having <=10 unique values), the list of unique values
is also stored. Such summaries are used by the
summary.Design, plot.Design, and nomogram.Design
functions. You may save time and defer running
datadist. In that case, the distribution summary
is not stored with the fit object, but it can be
gathered before running summary or plot.
d <- datadist(my.data.frame) # or datadist(x1,x2)
options(datadist="d") # omit this or use options(datadist=NULL)
# if not run datadist yet
cf <- ols(y ~ x1 * x2)
anova(f)
fastbw(f)
predict(f, newdata)
In the Examples section there are three detailed examples using a
fitting function
designed to be used with Design, lrm (logistic
regression model). In Detailed Example 1 we
create 3 predictor variables and a two binary response
on 500 subjects. For the first binary response, dz,
the true model involves only sex and age, and there is
a nonlinear interaction between the two because the log
odds is a truncated linear relationship in age for
females and a quadratic function for males. For the
second binary outcome, dz.bp, the true population model
also involves systolic blood pressure (sys.bp) through
a truncated linear relationship. First, nonparametric
estimation of relationships is done using the Hmisc
library's plsmo function which uses lowess with outlier
detection turned off for binary responses. Then
parametric modeling is done using restricted cubic
splines. This modeling does not assume that we know
the true transformations for age or sys.bp but that
these transformations are smooth (which is not actually
the case in the population).
For Detailed Example 2, suppose that a
categorical variable treat has values "a", "b", and
"c", an ordinal variable num.diseases has values
0,1,2,3,4, and that there are two continuous variables,
age and cholesterol. age is fitted with a restricted
cubic spline, while cholesterol is transformed using
the transformation log(cholesterol - 10). Cholesterol
is missing on three subjects, and we impute these using
the overall median cholesterol. We wish to allow for
interaction between treat and cholesterol. The
following S program will fit a logistic model,
test all effects in the design, estimate effects, and
plot estimated transformations. The fit for
num.diseases really considers the variable to be a
5-level categorical variable. The only difference is
that a 3 d.f. test of linearity is done to assess
whether the variable can be re-modeled "asis". Here
we also show