Chat with us, powered by LiveChat Provide an example of how simple linear regression could be used within your potential field of study for your dissertation. Please make sure you | EssayAbode

Provide an example of how simple linear regression could be used within your potential field of study for your dissertation. Please make sure you

Please make sure that it is your own work and not copy and paste. Please read the study guide and Please watch out for Spelling and Grammar errors. Please use the APA 7th edition.

Book Reference: Fox, J. (2017). Using the R Commander: A point-and-click interface for R. CRC Press.

Provide an example of how simple linear regression could be used within your potential field of study for your dissertation. Please make sure you address the purpose of regression and the type of results you would obtain. Also please discuss the assumptions that need to be met to use this type of analysis. Your EOSA modules discuss this. Clearly identify the variables you are considering.

7.1 Linear Regression Models

As mentioned, linear least-squares regression is typically taken up in a basic statistics course. The normal linear regression model is written



where yi is the value of the response variable for the ith of n independently sampled observations; x1ix2i,…, xki are the values of k explanatory variables; and the errors εi are normally and independently distributed with 0 means and constant variance, εi ∼ NID(0,σε2). Both y and the xs are numeric variables, and the model assumes that the average value E(y) of yis a linear function—that is, a simple weighted sum—of the xs. 1  If there is just one x (i.e., if k = 1), then  Equation 7.1  is called the linear simple regression model; if there are more than one x (k ≥ 2), then it is called the linear multiple regression model.

The normal linear model is optimally estimated by the method of least squares, producing the fitted model


where y^i is the fitted value and ei the residual for observation i. The least-squares criterion


FIGURE 7.1: The Linear Regression dialog for Duncan’s occupational prestige data.

selects the values of the bs that minimize the sum of squared residuals, ∑ei2. The least-squares regression coefficients are easily computed, and, in addition to having desirable statistical properties under the model (such as efficiency and unbias), statistical inference based on the least-squares estimates is very simple (see, e.g., the references given at the beginning of the chapter).

The simplest way to fit a linear regression model in the R Commander is by the Linear Regression dialog. To illustrate, I’ll use Duncan’s occupational prestige data (introduced in  Chapter 4 ). Duncan’s data set resides in the car package, and so I can read the data into the R Commander via Data > Data in packages > Read data from an attached package (see  Section 4.2.4 ). Then selecting Statistics > Fit models > Linear regression produces the dialog in  Figure 7.1 . To complete the dialog, I click on prestige in the Response variable list, and Ctrl-click on education and income in the Explanatory variables list. Finally, pressing the OK button produces the output shown in  Figure 7.2 .

The commands generated by the Linear Regression dialog use the lm (linear model) function in R to fit the model, creating RegModel.1, and then summarize the model to produce printed output. The summary output includes information about the distribution of the residuals; coefficient estimates, their standard errors, t statistics for testing the null hypothesis that each population regression coefficient is 0, and the two-sided p-values for these tests; the standard deviation of the residuals (“residual standard error”) and residual degrees of freedom; the squared multiple correlation, R2, for the model and R2 adjusted for degrees of freedom; and the omnibus F test for the hypothesis that all population slope coefficients (here the coefficients of education and income) are 0 (H0: β1 = β2 = 0, for the example).

This is more or less standard least-squares regression output, similar to printed output produced by almost all statistical packages. What is unusual is that in addition to the printout in  Figure 7.2 , the R Commander creates and retains a linear model object on which I can perform further computations, as illustrated later in this chapter.

The Model button in the R Commander toolbar now reads RegModel.1, rather than <No active model>, as it did at the beginning the session. Just as you can choose among data sets residing in memory (if there are more than one) by pressing the Data set button in the toolbar, you can similarly choose among statistical models (if there are more than one) by pressing the Model button. Equivalently, you can pick Models > Select active model from the R Commander menus. Moreover, the R Commander takes care of coordinating data sets and models, by associating each statistical model with the data set to which it is fit. Consequently, selecting a statistical model makes the data set to which it was fit the active data set, if that isn’t already the case.


FIGURE 7.2: Output from Duncan’s regression of occupational prestige on income and education, produced by the Linear Regression dialog.

The variable lists in the Linear Regression dialog in  Figure 7.1  include only numeric variables. For example, the factor type (type of occupation) in Duncan’s data set, with levels “bc” (blue-collar), “wc” (white-collar), and “prof”(professional, technical, or managerial), doesn’t appear in either variable list. Moreover, the explanatory variables that are selected enter the model linearly and additively. The Linear Model dialog, described in the next section, is capable of fitting a much wider variety of regression models.

In completing the Linear Regression dialog in  Figure 7.1 , I left the name of the model at its default, RegModel.1. The R Commander generates unique model names automatically during a session, each time incrementing the model number (here 1).

I also left the Subset expression at its default, <all valid cases>. Had I instead entered type == “bc”, 2  for example, the regression model would have been fit only to blue-collar occupations. As in this example, the subset expression can be a logical expression, returning the value TRUE or FALSE for each case (see  Section 4.4.2 ), a vector of case indices to include, 3  or a negative vector of case indices to exclude. For example, 1:25 would include the first 25 occupations, while -c(6, 16) would exclude occupations 6 and 16. 4  All of the statistical modeling dialogs in the R Commander allow subsets of cases to be specified in this manner.

7.2 Linear Models with Factors*

Like the Linear Regression dialog described in the preceding section, the Linear Model dialog can fit additive linear regression models, but it is much more flexible: The Linear Model dialog accommodates transformations of the response and explanatory variables, factors as well as numeric explanatory variables on the right-hand-side of the regression model, nonlinear functions of explanatory variables expressed as polynomials and regression splines, and interactions among explanatory variables. All this is accomplished by allowing the user to specify the model as an R linear-model formula. Linear-model formulas in R are inherited from the S programming language (Chambers and Hastie, 1992), and are a version of notation for expressing linear models originally introduced by Wilkinson and Rogers (1973).

7.2.1 Linear-Model Formulas

An R linear-model formula is of the general form response-variable ∼ linear-predictor . The tilde (~) in a linear-model formula can be read as “is regressed on.” Thus, in this general form, the response variable is regressed on a linear predictor comprising the terms in the right-hand side of the model.

The left-hand side of the model formula, response-variable , is an R expression that evaluates to the numeric response variable in the model, and is usually simply the name of the response variable—for example, prestige in Duncan’s regression. You can, however, transform the response variable directly in the model formula (e.g., log10(income)) or compute the response as a more complex arithmetic expression (e.g., log(investment.income + hourly.wage.rate*hours.worked). 5

The formulation of the linear predictor on the right-hand side of a model formula is more complex. What are normally arithmetic operators (+, -, *, /, and ^) in R expressions have special meanings in a model formula, as do the operators : (colon) and %in%. The numeral 1 (one) may be used to represent the regression constant (i.e., the intercept) in a model formula; this is usually unnecessary, however, because an intercept is included by default. A period (.) represents all of the variables in the data set with the exception of the response. Parentheses may be used for grouping, much as in an arithmetic expression.

In the large majority of cases, you’ll be able to formulate a model using only the operators + (interpreted as “and”) and * (interpreted as “crossed with”), and so I’ll emphasize these operators here. The meaning of these and the other model-formula operators are summarized and illustrated in  Table 7.1 . Especially on first reading, feel free to ignore everything in the table except +, :, and * (and : is rarely used directly).

A final formula subtlety: As I’ve explained, the arithmetic operators take on special meanings on the right-hand side of a linear-model formula. A consequence is that you can’t use these operators directly for arithmetic. For example, fitting the model savings ~ wages + interest + dividends estimates a separate regression coefficient for each of wages, interest, and dividends. Suppose, however, that you want to estimate a single coefficient for the sum of these variables—in effect, setting the three coefficients equal to each other. The solution is to “protect” the + operator inside a call to the I (identity or inhibit) function, which simply returns its argument unchanged: 6  savings ∼ I(wages + interest + dividends). This formula works as desired because arithmetic operators like + have their usual meaning within a function call on the right-hand side of the formula—implying, incidentally, that savings ∼ log10(wages + interest + dividends) also works as intended, estimating a single coefficient for the log base 10 of the sum of wages, interest, and dividends.

TABLE 7.1: Operators and other symbols used on the right-hand side of R linear-model formulas.







x1 + x2

x1 and x2




interaction of x1 and x2




x1 crossed with x2 (i.e., x1 + x2 + x1:x2)


x1 - 1

regression through the origin (for numeric x1)


cross to order k

(x1 + x2 + x3)^2

same as x1*x2 + x1*x3 + x2*x3


province %in% country

province nested in country




same as country + province %in% country







x1 - 1

suppress the intercept


everything but the response

y ~ .

regress y on everything else

( )


x1*(x2 + x3)

same as x1*x2 + x1*x3

The symbols x1, x2, and x3 represent explanatory variables and could be either numeric or factors.

7.2.2 The Principle of Marginality

Introduced by Nelder (1977), the principle of marginality is a rule for formulating and interpreting linear (and similar) statistical models. According to the principle of marginality, if an interaction, say x1:x2, is included in a linear model, then so should the main effects, x1 and x2, that are marginal to—that is lower-order relatives of—the interaction. Similarly, the lower-order interactions x1:x2, x1:x3, and x2:x3 are marginal to the three-way interaction x1:x2:x3. The regression constant (1 in an R model formula) is marginal to every other term in the model. 7

It is in most circumstances difficult in R to formulate models that violate the principle of marginality, and trying to do so can produce unintended results. For example, although it may appear that the model y ∼ f*x – x – 1, where f is a factor and x is a numeric explanatory variable, 8  violates the principle of marginality by removing the regression constant and x slope, the model that R actually fits includes a separate intercept and slope for each level of the factor f. Thus, the model y ∼ f*x – x – 1 is equivalent to (i.e., an alternative parametrization of) y ∼ f*x. It is almost always best to stay away from such unusual model formulas.

7.2.3 Examples Using the Canadian Occupational Prestige Data

For concreteness, I’ll formulate several linear models for the Canadian occupational prestige data (introduced in  Section 4.2.3  and described in  Table 4.2  on  page 61 ), regressing prestige on income, education, women (gender composition), and type (type of occupation). The last variable is a factor (categorical variable) and so it cannot enter into the linear model directly. When a factor is included in a linear-model formula, R generates contrasts to represent the factor—one fewer than the number of levels of the factor. I’ll explain how this works in greater detail in Section 7.2.4, but the default in the R Commander (and R more generally) is to use 0/1 dummy-variable regressors, also called indicator variables.

A version of the Canadian occupational prestige data resides in the data frame Prestige in the car package,9 and it’s convenient to read the data into the R Commander from this source via Data > Data in packages > Read data from an attached package. Prestige replaces Duncan as the active data set.

Recall that 4 of the 102 occupations in the Prestige data set have missing values (NA) for occupational type. Because I will fit several regression models to the Prestige data, not all of which include type, I begin by filtering the data set for missing values, selecting Data > Active data set > Remove cases with missing data (as described in Section 4.5.2).

Moreover, the default alphabetical ordering of the levels of type—“bc” , “prof” , “wc”—is not the natural ordering, and so I also reorder the levels of this factor via Data > Manage variables in active data set > Reorder factor levels to “bc”, “wc”, “prof” (see Section 3.4). This last step isn’t strictly necessary, but it makes the data analysis easier to follow.

I first fit an additive dummy regression to the Canadian prestige data, employing the model formula prestige ∼ income + education + women + type. To do so, I select Statistics > Fit models > Linear model from the R Commander menus, producing the dialog box in Figure 7.3. The automatically supplied model name is LinearModel.2, reflecting the fact that I have already fit a statistical model in the session, RegModel.1 (in Section 7.1).

Most of the structure of the Linear Model dialog is common to statistical modeling dialogs in the R Commander. If the response text box to the left of the ∼ in the model formula is empty, double-clicking on a variable name in the variable list box enters the name into the response box; thereafter, double-clicking on variable names enters the names into the right-hand side of the model formula, separated by +s (if no operator appears at the end of the partially completed formula). You can enter parentheses and operators like + and * into the formula using the toolbar in the dialog box.10 You can also type directly into the model-formula text boxes. In Figure 7.3, I simply double-clicked successively on prestige, education, income, women, and type.11 Clicking OK produces the output shown in Figure 7.4.

I already explained the general format of linear-model summary output in R. What’s new in Figure 7.4 is the way in which the factor type is handled in the linear model: Two dummy-variable regressors are automatically created for the three-level factor type. The first dummy regressor, labelled type[T.wc] in the output, is coded 1 when type is “wc”and 0 otherwise; the second dummy regressor, type[], is coded 1 when type is “prof” and 0 otherwise. The first level of type—“bc”—is therefore selected as the reference or baseline level, coded 0 for both dummy regressors.12

Consequently, the intercept in the linear-model output is the intercept for the “bc” reference level of type, and the coefficients for the other levels give differences in the intercepts between each of these levels and the reference level. Because the slope coefficients for the numeric explanatory variables education, income, and women in this additive model do not vary by levels of type, the dummy-variable coefficients are also interpretable as the average difference between each other level and “bc” for any fixed values of education, income, and women.


FIGURE 7.3: Linear Model dialog completed to fit an additive dummy-variable regression of prestige on the numeric explanatory variables education, income, and women, and the factor type.

To illustrate a structurally more complex, nonadditive model, I respecify the Canadian occupational prestige regression model to include interactions between type and education and between type and income, in the process removing women from the model—in the initial regression, the coefficient of women is small with a large p-value. 13  The Linear Model dialog (not shown) reopens in its previous state, with the model name incremented to LinearModel.3. To fit the new model, I modify the formula to read prestige ∼ type*education + type*income. Clicking OK produces the output in  Figure 7.5 .

With interactions in the model, there are different intercepts and slopes for each level of type. The intercept in the output—along with the coefficients for education and income— pertains to the baseline level “bc” of type. Other coefficients represent differences between each of the other levels and the baseline level. For example, type[T.wc] = –33.54 is the difference in intercepts between the “wc” and “bc” levels of type; 14  similarly, the interaction coefficient type[T.wc]:education = 4.291 is the difference in education slopes between the “wc” and “bc” levels. The complexity of the coefficients makes it difficult to understand what the model says about the data;  Section 7.6  shows how to visualize terms such as interactions in a complex linear model.


FIGURE 7.4: Output for the linear model prestige ∼ income + education + women + type fit to the Prestige data.


FIGURE 7.5: Output for the linear model prestige ∼ type*education + type*income fit to the Prestige data.

TABLE 7.2: Contrast-regressor codings for type generated by contr.Treatment, contr.Sum, contr.poly,, and contr.Helmert.

Levels of type


Contrast Names








































7.2.4 Dummy Variables and Other Contrasts for Factors

By default in the R Commander, factors in linear-model formulas are represented by 0/1 dummy-variable regressors generated by the contr.Treatment function in the car package, picking the first level of a factor as the baseline level. 15  This contrast coding, along with some other choices, is shown in  Table 7.2 , using the factor type in the Prestige data set as an example.

The function contr.Sum from the car generates so-called “sigma-constrained” or “sum-to-zero” contrast regressors, as are used in traditional treatments of analysis of variance. 16  The standard R function contr.poly generates orthogonal-polynomial contrasts—in this case, linear and quadratic terms for the three levels of type; in the R Commander, contr.poly is the default choice for ordered factors. Finally, contr.Helmert generates Helmert contrasts, which compare each level to the average of those preceding it.

Selecting Data > Manage variables in active data set > Define contrasts for a factor produces the dialog box on the left of  Figure 7.6 . The factor type is preselected in this dialog because it’s the only factor in the data set. You can use the radio butt

Related Tags

Academic APA Assignment Business Capstone College Conclusion Course Day Discussion Double Spaced Essay English Finance General Graduate History Information Justify Literature Management Market Masters Math Minimum MLA Nursing Organizational Outline Pages Paper Presentation Questions Questionnaire Reference Response Response School Subject Slides Sources Student Support Times New Roman Title Topics Word Write Writing