11 May Provide an example of how simple linear regression could be used within your potential field of study for your dissertation. Please make sure you
Please make sure that it is your own work and not copy and paste. Please read the study guide and Please watch out for Spelling and Grammar errors. Please use the APA 7th edition.
Book Reference: Fox, J. (2017). Using the R Commander: A pointandclick interface for R. CRC Press. https://online.vitalsource.com/#/books/9781498741934
Provide an example of how simple linear regression could be used within your potential field of study for your dissertation. Please make sure you address the purpose of regression and the type of results you would obtain. Also please discuss the assumptions that need to be met to use this type of analysis. Your EOSA modules discuss this. Clearly identify the variables you are considering.
7.1 Linear Regression Models
As mentioned, linear leastsquares regression is typically taken up in a basic statistics course. The normal linear regression model is written
yi=β0+β1x1i+β2x2i+⋯+βkxki+εi=E(yi)+εi 
(7.1) 
where yi is the value of the response variable for the ith of n independently sampled observations; x1i, x2i,…, xki are the values of k explanatory variables; and the errors εi are normally and independently distributed with 0 means and constant variance, εi ∼ NID(0,σε2). Both y and the xs are numeric variables, and the model assumes that the average value E(y) of yis a linear function—that is, a simple weighted sum—of the xs. 1 If there is just one x (i.e., if k = 1), then Equation 7.1 is called the linear simple regression model; if there are more than one x (k ≥ 2), then it is called the linear multiple regression model.
The normal linear model is optimally estimated by the method of least squares, producing the fitted model
yi=b0+b1x1i+b2x2i+⋯+bkxki+ei=y^i+ei
where y^i is the fitted value and ei the residual for observation i. The leastsquares criterion
FIGURE 7.1: The Linear Regression dialog for Duncan’s occupational prestige data.
selects the values of the bs that minimize the sum of squared residuals, ∑ei2. The leastsquares regression coefficients are easily computed, and, in addition to having desirable statistical properties under the model (such as efficiency and unbias), statistical inference based on the leastsquares estimates is very simple (see, e.g., the references given at the beginning of the chapter).
The simplest way to fit a linear regression model in the R Commander is by the Linear Regression dialog. To illustrate, I’ll use Duncan’s occupational prestige data (introduced in Chapter 4 ). Duncan’s data set resides in the car package, and so I can read the data into the R Commander via Data > Data in packages > Read data from an attached package (see Section 4.2.4 ). Then selecting Statistics > Fit models > Linear regression produces the dialog in Figure 7.1 . To complete the dialog, I click on prestige in the Response variable list, and Ctrlclick on education and income in the Explanatory variables list. Finally, pressing the OK button produces the output shown in Figure 7.2 .
The commands generated by the Linear Regression dialog use the lm (linear model) function in R to fit the model, creating RegModel.1, and then summarize the model to produce printed output. The summary output includes information about the distribution of the residuals; coefficient estimates, their standard errors, t statistics for testing the null hypothesis that each population regression coefficient is 0, and the twosided pvalues for these tests; the standard deviation of the residuals (“residual standard error”) and residual degrees of freedom; the squared multiple correlation, R2, for the model and R2 adjusted for degrees of freedom; and the omnibus F test for the hypothesis that all population slope coefficients (here the coefficients of education and income) are 0 (H0: β1 = β2 = 0, for the example).
This is more or less standard leastsquares regression output, similar to printed output produced by almost all statistical packages. What is unusual is that in addition to the printout in Figure 7.2 , the R Commander creates and retains a linear model object on which I can perform further computations, as illustrated later in this chapter.
The Model button in the R Commander toolbar now reads RegModel.1, rather than <No active model>, as it did at the beginning the session. Just as you can choose among data sets residing in memory (if there are more than one) by pressing the Data set button in the toolbar, you can similarly choose among statistical models (if there are more than one) by pressing the Model button. Equivalently, you can pick Models > Select active model from the R Commander menus. Moreover, the R Commander takes care of coordinating data sets and models, by associating each statistical model with the data set to which it is fit. Consequently, selecting a statistical model makes the data set to which it was fit the active data set, if that isn’t already the case.
FIGURE 7.2: Output from Duncan’s regression of occupational prestige on income and education, produced by the Linear Regression dialog.
The variable lists in the Linear Regression dialog in Figure 7.1 include only numeric variables. For example, the factor type (type of occupation) in Duncan’s data set, with levels “bc” (bluecollar), “wc” (whitecollar), and “prof”(professional, technical, or managerial), doesn’t appear in either variable list. Moreover, the explanatory variables that are selected enter the model linearly and additively. The Linear Model dialog, described in the next section, is capable of fitting a much wider variety of regression models.
In completing the Linear Regression dialog in Figure 7.1 , I left the name of the model at its default, RegModel.1. The R Commander generates unique model names automatically during a session, each time incrementing the model number (here 1).
I also left the Subset expression at its default, <all valid cases>. Had I instead entered type == “bc”, 2 for example, the regression model would have been fit only to bluecollar occupations. As in this example, the subset expression can be a logical expression, returning the value TRUE or FALSE for each case (see Section 4.4.2 ), a vector of case indices to include, 3 or a negative vector of case indices to exclude. For example, 1:25 would include the first 25 occupations, while c(6, 16) would exclude occupations 6 and 16. 4 All of the statistical modeling dialogs in the R Commander allow subsets of cases to be specified in this manner.
7.2 Linear Models with Factors*
Like the Linear Regression dialog described in the preceding section, the Linear Model dialog can fit additive linear regression models, but it is much more flexible: The Linear Model dialog accommodates transformations of the response and explanatory variables, factors as well as numeric explanatory variables on the righthandside of the regression model, nonlinear functions of explanatory variables expressed as polynomials and regression splines, and interactions among explanatory variables. All this is accomplished by allowing the user to specify the model as an R linearmodel formula. Linearmodel formulas in R are inherited from the S programming language (Chambers and Hastie, 1992), and are a version of notation for expressing linear models originally introduced by Wilkinson and Rogers (1973).
An R linearmodel formula is of the general form responsevariable ∼ linearpredictor . The tilde (~) in a linearmodel formula can be read as “is regressed on.” Thus, in this general form, the response variable is regressed on a linear predictor comprising the terms in the righthand side of the model.
The lefthand side of the model formula, responsevariable , is an R expression that evaluates to the numeric response variable in the model, and is usually simply the name of the response variable—for example, prestige in Duncan’s regression. You can, however, transform the response variable directly in the model formula (e.g., log10(income)) or compute the response as a more complex arithmetic expression (e.g., log(investment.income + hourly.wage.rate*hours.worked). 5
The formulation of the linear predictor on the righthand side of a model formula is more complex. What are normally arithmetic operators (+, , *, /, and ^) in R expressions have special meanings in a model formula, as do the operators : (colon) and %in%. The numeral 1 (one) may be used to represent the regression constant (i.e., the intercept) in a model formula; this is usually unnecessary, however, because an intercept is included by default. A period (.) represents all of the variables in the data set with the exception of the response. Parentheses may be used for grouping, much as in an arithmetic expression.
In the large majority of cases, you’ll be able to formulate a model using only the operators + (interpreted as “and”) and * (interpreted as “crossed with”), and so I’ll emphasize these operators here. The meaning of these and the other modelformula operators are summarized and illustrated in Table 7.1 . Especially on first reading, feel free to ignore everything in the table except +, :, and * (and : is rarely used directly).
A final formula subtlety: As I’ve explained, the arithmetic operators take on special meanings on the righthand side of a linearmodel formula. A consequence is that you can’t use these operators directly for arithmetic. For example, fitting the model savings ~ wages + interest + dividends estimates a separate regression coefficient for each of wages, interest, and dividends. Suppose, however, that you want to estimate a single coefficient for the sum of these variables—in effect, setting the three coefficients equal to each other. The solution is to “protect” the + operator inside a call to the I (identity or inhibit) function, which simply returns its argument unchanged: 6 savings ∼ I(wages + interest + dividends). This formula works as desired because arithmetic operators like + have their usual meaning within a function call on the righthand side of the formula—implying, incidentally, that savings ∼ log10(wages + interest + dividends) also works as intended, estimating a single coefficient for the log base 10 of the sum of wages, interest, and dividends.
TABLE 7.1: Operators and other symbols used on the righthand side of R linearmodel formulas.
Operator 
Meaning 
Example 
Interpretation 
+ 
and 
x1 + x2 
x1 and x2 
: 
interaction 
x1:x2 
interaction of x1 and x2 
* 
crossing 
x1*x2 
x1 crossed with x2 (i.e., x1 + x2 + x1:x2) 
– 
remove 
x1  1 
regression through the origin (for numeric x1) 
^k 
cross to order k 
(x1 + x2 + x3)^2 
same as x1*x2 + x1*x3 + x2*x3 
%.in%. 
nesting 
province %in% country 
province nested in country 
/ 
nesting 
country/province 
same as country + province %in% country 
Symbol 
Meaning 
Example 
Interpretation 
1 
intercept 
x1  1 
suppress the intercept 
. 
everything but the response 
y ~ . 
regress y on everything else 
( ) 
grouping 
x1*(x2 + x3) 
same as x1*x2 + x1*x3 
The symbols x1, x2, and x3 represent explanatory variables and could be either numeric or factors.
7.2.2 The Principle of Marginality
Introduced by Nelder (1977), the principle of marginality is a rule for formulating and interpreting linear (and similar) statistical models. According to the principle of marginality, if an interaction, say x1:x2, is included in a linear model, then so should the main effects, x1 and x2, that are marginal to—that is lowerorder relatives of—the interaction. Similarly, the lowerorder interactions x1:x2, x1:x3, and x2:x3 are marginal to the threeway interaction x1:x2:x3. The regression constant (1 in an R model formula) is marginal to every other term in the model. 7
It is in most circumstances difficult in R to formulate models that violate the principle of marginality, and trying to do so can produce unintended results. For example, although it may appear that the model y ∼ f*x – x – 1, where f is a factor and x is a numeric explanatory variable, 8 violates the principle of marginality by removing the regression constant and x slope, the model that R actually fits includes a separate intercept and slope for each level of the factor f. Thus, the model y ∼ f*x – x – 1 is equivalent to (i.e., an alternative parametrization of) y ∼ f*x. It is almost always best to stay away from such unusual model formulas.
7.2.3 Examples Using the Canadian Occupational Prestige Data
For concreteness, I’ll formulate several linear models for the Canadian occupational prestige data (introduced in Section 4.2.3 and described in Table 4.2 on page 61 ), regressing prestige on income, education, women (gender composition), and type (type of occupation). The last variable is a factor (categorical variable) and so it cannot enter into the linear model directly. When a factor is included in a linearmodel formula, R generates contrasts to represent the factor—one fewer than the number of levels of the factor. I’ll explain how this works in greater detail in Section 7.2.4, but the default in the R Commander (and R more generally) is to use 0/1 dummyvariable regressors, also called indicator variables.
A version of the Canadian occupational prestige data resides in the data frame Prestige in the car package,9 and it’s convenient to read the data into the R Commander from this source via Data > Data in packages > Read data from an attached package. Prestige replaces Duncan as the active data set.
Recall that 4 of the 102 occupations in the Prestige data set have missing values (NA) for occupational type. Because I will fit several regression models to the Prestige data, not all of which include type, I begin by filtering the data set for missing values, selecting Data > Active data set > Remove cases with missing data (as described in Section 4.5.2).
Moreover, the default alphabetical ordering of the levels of type—“bc” , “prof” , “wc”—is not the natural ordering, and so I also reorder the levels of this factor via Data > Manage variables in active data set > Reorder factor levels to “bc”, “wc”, “prof” (see Section 3.4). This last step isn’t strictly necessary, but it makes the data analysis easier to follow.
I first fit an additive dummy regression to the Canadian prestige data, employing the model formula prestige ∼ income + education + women + type. To do so, I select Statistics > Fit models > Linear model from the R Commander menus, producing the dialog box in Figure 7.3. The automatically supplied model name is LinearModel.2, reflecting the fact that I have already fit a statistical model in the session, RegModel.1 (in Section 7.1).
Most of the structure of the Linear Model dialog is common to statistical modeling dialogs in the R Commander. If the response text box to the left of the ∼ in the model formula is empty, doubleclicking on a variable name in the variable list box enters the name into the response box; thereafter, doubleclicking on variable names enters the names into the righthand side of the model formula, separated by +s (if no operator appears at the end of the partially completed formula). You can enter parentheses and operators like + and * into the formula using the toolbar in the dialog box.10 You can also type directly into the modelformula text boxes. In Figure 7.3, I simply doubleclicked successively on prestige, education, income, women, and type.11 Clicking OK produces the output shown in Figure 7.4.
I already explained the general format of linearmodel summary output in R. What’s new in Figure 7.4 is the way in which the factor type is handled in the linear model: Two dummyvariable regressors are automatically created for the threelevel factor type. The first dummy regressor, labelled type[T.wc] in the output, is coded 1 when type is “wc”and 0 otherwise; the second dummy regressor, type[T.prof], is coded 1 when type is “prof” and 0 otherwise. The first level of type—“bc”—is therefore selected as the reference or baseline level, coded 0 for both dummy regressors.12
Consequently, the intercept in the linearmodel output is the intercept for the “bc” reference level of type, and the coefficients for the other levels give differences in the intercepts between each of these levels and the reference level. Because the slope coefficients for the numeric explanatory variables education, income, and women in this additive model do not vary by levels of type, the dummyvariable coefficients are also interpretable as the average difference between each other level and “bc” for any fixed values of education, income, and women.
FIGURE 7.3: Linear Model dialog completed to fit an additive dummyvariable regression of prestige on the numeric explanatory variables education, income, and women, and the factor type.
To illustrate a structurally more complex, nonadditive model, I respecify the Canadian occupational prestige regression model to include interactions between type and education and between type and income, in the process removing women from the model—in the initial regression, the coefficient of women is small with a large pvalue. 13 The Linear Model dialog (not shown) reopens in its previous state, with the model name incremented to LinearModel.3. To fit the new model, I modify the formula to read prestige ∼ type*education + type*income. Clicking OK produces the output in Figure 7.5 .
With interactions in the model, there are different intercepts and slopes for each level of type. The intercept in the output—along with the coefficients for education and income— pertains to the baseline level “bc” of type. Other coefficients represent differences between each of the other levels and the baseline level. For example, type[T.wc] = –33.54 is the difference in intercepts between the “wc” and “bc” levels of type; 14 similarly, the interaction coefficient type[T.wc]:education = 4.291 is the difference in education slopes between the “wc” and “bc” levels. The complexity of the coefficients makes it difficult to understand what the model says about the data; Section 7.6 shows how to visualize terms such as interactions in a complex linear model.
FIGURE 7.4: Output for the linear model prestige ∼ income + education + women + type fit to the Prestige data.
FIGURE 7.5: Output for the linear model prestige ∼ type*education + type*income fit to the Prestige data.
TABLE 7.2: Contrastregressor codings for type generated by contr.Treatment, contr.Sum, contr.poly,, and contr.Helmert.
Levels of type 

Function 
Contrast Names 
“bc” 
“wc” 
“prof” 
contr.Treatment 
type[T.wc] 
0 
1 
0 
type[T.prof] 
0 
0 
1 

contr.Sum 
type[S.wc] 
1 
0 
1 
type[S.prof] 
0 
1 
1 

contr.poly 
type.L 
−1/2 
0 
1/2 
type.Q 
1/6 
−2/6 
1/6 

contr.Helmert 
type[H.1] 
1 
1 
0 
type[H.2] 
1 
1 
2 
7.2.4 Dummy Variables and Other Contrasts for Factors
By default in the R Commander, factors in linearmodel formulas are represented by 0/1 dummyvariable regressors generated by the contr.Treatment function in the car package, picking the first level of a factor as the baseline level. 15 This contrast coding, along with some other choices, is shown in Table 7.2 , using the factor type in the Prestige data set as an example.
The function contr.Sum from the car generates socalled “sigmaconstrained” or “sumtozero” contrast regressors, as are used in traditional treatments of analysis of variance. 16 The standard R function contr.poly generates orthogonalpolynomial contrasts—in this case, linear and quadratic terms for the three levels of type; in the R Commander, contr.poly is the default choice for ordered factors. Finally, contr.Helmert generates Helmert contrasts, which compare each level to the average of those preceding it.
Selecting Data > Manage variables in active data set > Define contrasts for a factor produces the dialog box on the left of Figure 7.6 . The factor type is preselected in this dialog because it’s the only factor in the data set. You can use the radio butt