09 Sep In this exercise, students will conduct a One-Way ANOVA using SAS. The student will create a summary report of their analysis and create an informative graphic that is included in a busines
In this exercise, students will conduct a One-Way ANOVA using SAS. The student will create a summary report of their analysis and create an informative graphic that is included in a business memorandum that provides a recommendation based on their analysis of the dataset.
Our research question is "Is there a difference in employee pay by location for our company?" You need to turn this question into a statistical question that you can answer using this data.
To help you get started, we have provided an additional video presentation and PowerPoint discussing the concept of ANOVA.
25 September 2020
From: Sydney Student
To: Professor
Subj: Enter a subject line (e.g., Analysis of pay by location for Widget Inc.)
· Null hypothesis – HO: µ1 = µ2 = µ3 = µ4 = µ5 = µ6 = µ7 = µ8 = µ9 = µ10
· Alternative hypothesis – HA: Not all group means are equal
2. This paragraph(s) describes where the data came from (This is likely from an internal database), the steps that you took to explore the data and if the data is appropriate for conducting an ANOVA (You may break this into multiple paragraphs if needed). Where is the data from? How did you filter the data? Examine the assumptions in the instructions and determine the tables and plots that are needed and state if each appears to be met. Are the dependent and independent variables appropriate? Is the dependent variable normally distributed for each level of the categorical variable? Are there sufficient observations per level of the categorical variable? You should reference the appropriate figures and tables to support your statements. You should always provide a summary statistics table and discuss it (missing data and relevant statistics). The last sentence should be whether the data is appropriate for an ANOVA.
3. This paragraph discusses the results. Were the ANOVA results significant (Report the f value and the p value from SAS)? Clearly state whether your rejected or failed to reject the null hypothesis. Are the means of any levels (values) of the categorical variable significantly different from each other? Again, use the appropriate tables, charts, and figures to back up your statements.
4. What is your recommendation based on this analysis? Is further analysis needed? In plain language, clearly state what you found (answer the research question) and your recommendation on what to do next. What is the impact of your recommendation for the company?
After the body of the memo you should have several figures (Summary statistics, boxplot of $/hr by location, ANOVA results, Levene’s results) and any other tables you feel necessary to support your statements). Format the charts, tables, and figures for readability, title and number them appropriately and refer to each chart, figure, and table in the text. If you do not refer to it in the text, it should not be in your memo. Similarly, if you refer to a figure, chart, or table in your text, ensure it is included in the memo.
,
1
DAX3 – ANOVA – Updated Fall 2020 (NOTE YOUR NUMBERS MAY BE SIMILAR BUT NOT EXACT)
A video presentation on ANOVA can be viewed by clicking on the following link: DAX3 ANOVA Video
A video walking you through DAX can be viewed by clicking on the following link: DAX3 Instructions Video
1. When you are ready to work on DAX 3, download the "DAX 3 Report Example" from the DAX 3 Assignment. Open
the example report in Word. Save the file as "LastName DAX 3" (e.g., Bohler DAX3.docx). You will need to replace
and fill out in your own words the meanings to the paragraphs marked in red and change the font color to black.
2. Log into SAS Studio. You should already have the “IS3310.LaborF2020” dataset uploaded to SAS, you may have
given it a different name, but it is the same dataset you used for DAX1 and DAX2. If you cannot find it, redo DAX1.
3. Click on “Tasks and Utilities” in the navigation area on the left side of the window.
4. Next, click on the icon arrow next to "Tasks" and then icon arrow next to "Statistics."
5. In the menu that opens, select the "Summary Statistics" task by double-clicking it.
6. In the new tab that opens (labeled “Summary Statistics”), click on “DATA” (below the word “Settings”).
7. Using the “Select a Table” icon, locate the dataset you created in DAX 1. (IS3310.LaborF2020).
8. In the “Analysis variables” area, using the “Add Columns” icon (it looks like a big plus sign), select the dependent variable “$/H”. In the Classification variables area, add the categorical variables “Location.” Click on the “ADDITIONAL ROLES” area arrow, and when it opens up, in the “Group analysis by” area, click on the plus sign and add the categorical variable “Position.” See Figure 1.
Figure 1 – Data Tab
Figure 2 – Options Tab
9. On the OPTIONS tab, under “Basic Statistics” you can deselect “Number of observations” and select “Number of missing values.” Click on “Additional Statistics” and select “Variance,” “Skewness,” and “Kurtosis.” See Figure 2.
10. Now click on the Run icon (it looks like a running person) or hit F3. 11. Remember to check the "LOG" for any "Errors" or "Warnings." If there are no errors or warnings, go to the next
step and review the results.
2
12. Look at the results of the operation. You can scroll down using the scroll bar on the right-hand side of the Results
window. Which of the charts, tables, plots are helpful to your understanding of this dataset?
a. Looking at the first table, “Position 1,” we see that there are 34 observations (employees) at each of ten locations.
That meets our requirement to have at least 30 observations per group.
b. The means for each location go from a low of $12.32/hour at Location 3 to a high of $14.56/hour for Location 9. Is
that a significant difference?
c. The standard deviation runs from a low of $1.33/hour for Location 6 to a high of $1.92/hour for Location 7
d. Remember that standard deviation is the square root of variance. These measure the spread or distribution of the
data. In conducting the ANOVA, we are trying to examine the variance within each group and compare it to the
variance between each group, which is where we get the name for this type of analysis which is called an Analysis
of Variance (ANOVA).
e. Finally, for Position 1, for each Location, we see that the Skewness and Kurtosis measures are within our rules of
thumb of -1 to +1 for skewness and -2 to +2 for kurtosis. Please note that these are only rough guidelines for
determining if your data is normally distributed.
f. The second table, “Position 2” also has employees at 10 locations, but there are only 3 or 4 employees at each
location. While you can run an ANOVA on groups that have fewer than 30 observations, the results may be not be
accurate due to the distribution of your data. Indeed, we can see that the skewness and kurtosis measures exceed
the limits discussed earlier. If you do conduct an ANOVA on data that looks like this, there are additional items to
consider which are outside the scope of this exercise.
13. Download the results as an RTF (little icon with the W on it). In your Downloads folder, you should have a file
named "Summary Statistics-results.rtf." It will open in Word. If you have issues with an RTF download view the
following (Video).
14. Save this file so that you can return to it later in the assignment.
15. Click on the “Tasks” in the navigation area on the left side of the window.
16. Click on “Linear Models” in the navigation area on the left side of the window, under the “Tasks” list.
17. Next, double-click on the icon arrow next to "One- Way ANOVA.”
18. In the new tab that opens (labeled “One-Way ANOVA”), click on “DATA” (below the word “Settings”). (The IS3310.LABOR F 2020 table will likely already appear if so, you may skip the next step).
19. Using the “Select a Table” icon, locate the dataset from DAX 1.
20. Click on the Filter icon (it looks like a funnel). 21. In the dialog box that opens type: Position = 1 22. Hit the blue “Apply” button below the text entry
area.
Figure 3 – Filter Table Rows Dialogue Box
23. In the “ROLES” area, using the “Add Columns” icon (it looks like a big plus sign), select “$/HR” as your “Dependent
variable.” For your “Categorical variable” select the “Location” variable.
24. Click on the “OPTIONS" tab and confirm that under the “HOMOGENEITY OF VARIANCE” area, the “Levene” test is
selected, under “COMPARISONS” the “Tukey” Comparisons method is selected, and for “Significance level” 0.05 is
showing in the text box, since we determined that α = 0.05 for this analysis.
25. Now click on the Run icon (it looks like a running man) or hit F3.
3
26. Remember to check the "LOG" for any "Errors" or "Warnings." If there are no errors or warnings, go to the next
step and review the results. Tables and charts from the “RESULTS” that might be of interest to help you provide
evidence of your analysis include Figures 4 through 9 on the next page.
Figure 4 – ANOVA Results
Figure 7 – Boxplot
Figure 5 – Levene’s Test Results
Figure 6 – Welch’s ANOVA results
Figure 8 – Least Squares Means Adjustment
for Multiple Comparisons: Tukey
Figure 9 – Location group means comparison
27. Note: The numbers in your results may differ. Download the results as an RTF (little icon with the W on it). In
your Downloads folder, you should have a file named "One-Way ANOVA-results.rtf." It will open in Word. You
can copy and paste tables and charts from this file, but you may need to edit them so that they are readable in
your report.
28. Save this file so that you can return to it later.
Editing the Memo
29. Edit the date, your instructor's name, and your name to have the correct information.
30. Follow the instructions in the DAX3 Example Report that are provided in red text to add additional text.
31. Remove all the red text instructions from the memo.
32. Save your report as a PDF file.
33. Upload your file to the DAX3 Assignment in Canvas.
,
DAX 3 – ANOVA
IS 3310
Analysis of Variance
Updated August 2020
Dr. Bohler
Topics
Descriptive vs. Inferential Statistical Method
Significance level and P values
Why use ANOVA?
Requirements & Assumptions for ANOVA
Hypothesis for ANOVA
Visualization of data distribution
ANOVA Results
DAX 3
2
Descriptive vs Inferential
Descriptive statistics – Describe, show or summarize data in simpler way to help us understand the distribution of the data.
We may look at the measures of central tendency (mean, mode, median) and measures of the spread (variance and standard deviation).
The values we obtained using descriptive statistics on a population are called parameters, while the values for central tendency and spread obtained for samples are called statistics.
Inferential statistics – Use a random sample of data from a population to describe and make inferences (guesses, predictions, relationships) about the population.
Our sample may not accurately represent the population, so we could be wrong about inferences about the population
We can control our chances of being wrong by setting a threshold of significance.
3
Significance Level (α) and P values
We use inferential statistics to infer relationships about our data.
We may make a guess about that relationship which we call a hypothesis.
Usually, the null hypothesis is that there is no significant difference between specified populations and that any observed difference is due to sampling or experimental error.
The P value is the probability of seeing a relationship as extreme as observed in our sample, assuming the null hypothesis is correct.
We compare the P value to stated significance level (α).
If the P value is smaller than α we conclude that the relationship in our data, we are observing is real and we should reject the null hypothesis that there is no effect (or relationship) present in our data.
The significance level (alpha or α) is the probability of rejecting the null hypothesis when it is actually true (also known as a Type I error).
For general research like you will be doing, we generally set the level of significance to 5% (0.05).
This means we have a 5% chance of concluding a relationship exist when there is no actual relationship present in our data.
4
Why ANOVA?
A one-way analysis of variance (ANOVA) is used to identify if there is any statistically significant differences between the means of two or more independent (unrelated) groups.
If we only have two groups, we might use a t test instead.
However, if you repeated the t test over and over, you could erroneous results due to compounded errors.
ANOVA analyzes the differences of group variances using the F distribution, which is the ratio of between and within group variances.
An ANOVA will only indicate is a statistically significant difference between the group means may exist.
You must perform ad hoc tests to determine which groups may be significantly different from each other.
5
ANOVA Requirements & Assumptions
The dependent variable should be an interval or ratio level variable, in other words, a continuous variable. You can perform an ANOVA if you only have two categorical variables. (e.g., Age {0, 1, 2, … 120})
The independent variable should have two or more independent groups. (e.g., College Class = {Freshman, Sophomore, Junior, Senior}). The categorical variable is College Class that has four groups or categories.
The observations are independent. In other words, members of a group cannot be classified into more than one group. In our example, someone cannot be classified as a Freshman and a Sophomore at the same time. This is determined by the research study design and is very important assumption for the one-way ANOVA.
The dependent variable is normally distributed for each category of the independent variable. We can easily test for this using SAS. While ANOVA is robust in this regard, it works better when the number of data points is nearly the same for each group, and that there are at least 30 data points in each group.
Finally, for the ANOVA results to be valid, we test for homogeneity of variance, in other words, the variance of the data around each group mean is approximately the same. We will use the Levene statistic to test the null hypothesis that the group variances are approximately equal.
6
ANOVA Hypotheses
For example, we are going to test if there is a statistically significant difference in the average age of a group of students, based on their College Class (Freshman, Sophomore, Junior, or Senior). We will set α to 0.05.
While we may think we know the answer to this question (maybe Freshmen are 18 years old on average, Sophomores are 19, etc.), we cannot assume that. Instead, we must assume that there is no statistically significant difference in the ages of students by College Class.
In plain language, our null hypothesis is that the average age of Freshman is equal to the average age of Sophomores, Juniors, and Seniors. Mathematically, we would write it like this:
Our alternative hypothesis would be that at least one of the groups had a different average age than the other groups, and we would write it mathematically as:
Not all group means are equal.
If the results of the ANOVA are not significant, then our null assumption is correct. We would “Fail to reject the null hypothesis.” It does not mean the null hypothesis is wrong, it means we cannot prove it is wrong. However, if the ANOVA results are significant, we say that we “reject the null hypothesis, and accept the alternative hypothesis.” We must also include the results of analysis to provide evidence.
7
Summary Statistics of example – Age
8
| Analysis Variable: Age | ||||||||
| Class | Mean | Std Dev | Min | Max | N | Variance | Skewness {-1 to 1} | Kurtosis {-2 to 2} |
| Freshman | 19.22 | 1.08 | 17.00 | 21.00 | 37 | 1.17 | -0.041 | -0.467 |
| Junior | 20.84 | 1.19 | 19.00 | 23.00 | 31 | 1.41 | -0.052 | -0.879 |
| Senior | 22.09 | 1.23 | 20.00 | 25.00 | 32 | 1.51 | 0.704 | 0.519 |
| Sophomore | 20.33 | 1.34 | 18.00 | 23.00 | 34 | 1.80 | 0.082 | -1.082 |
Visualization – Histogram
9
Visualization – Boxplot
10
ANOVA Results – Example
11
Dependent variable: Interval or Ratio?
Independent variable: Has two or more categories?
Observations: Are they independent?
Dependent variable: Normally distributed each group?
HoV: Is Levene’s Test significant (Pr>F is less than α)?
If no, fail to reject the null hypothesis, you have HoV.
If significant then you do not have HoV and must use another test, Welch’s ANOVA for example.
ANOVA results: Are they significant (Pr>F less than α)?
If no, fail to reject the null hypothesis
If yes, reject the null hypothesis and accept the alternative hypothesis, the group means are different at a statistically significant level. (Remember α = 0.05)
However, which group is different?
Visually, looking at the boxplot, they all look different from each other, but statistically, SAS gives us a method to examine the difference between each group.
| Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
| Model | 3 | 146.40 | 48.80 | 33.28 | <.0001 |
| Error | 130 | 190.62 | 1.47 | ||
| Corrected Total | 133 | 337.02 |
ANOVA Results
| Levene's Test for Homogeneity of Age Variance | |||||
| ANOVA of Squared Deviations from Group Means | |||||
| Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
| Class | 3 | 6.67 | 2.22 | 0.76 | 0.516 |
| Error | 130 | 378.30 | 2.91 |
Group Comparison (Tukey-Kramer): Example
12
| Class | Age LSMEAN | LSMEAN Number |
| Freshman | 19.22 | 1 |
| Junior | 20.84 | 2 |
| Senior | 22.09 | 3 |
| Sophomore | 20.32 | 4 |
| Least Squares Means for effect Class | ||||
| Pr > |t| for H0: LSMean(i)=LSMean(j) | ||||
| Dependent Variable: Age | ||||
| i/j | Freshman | Junior | Senior | Sophomore |
| Freshman | <.0001 | <.0001 | 0.0011 | |
| Junior | <.0001 | 0.0004 | 0.321 | |
| Senior | <.0001 | 0.0004 | <.0001 | |
| Sophomore | 0.0011 | 0.321 | <.0001 |
Reviewing the results of the group comparisons, it looks like all the groups are different from each other except the Sophomore and Junior groups, which did not have significant difference from each other with α = 0.05.
Our report to communicate our results should indicate that we met the requirements for an ANOVA, tested the assumptions, and obtained statistically significant results. We should report the values we obtained and if we “failed to reject the null hypothesis” or if we “rejected the null hypothesis and accepted the alternative hypothesis.”
In this case, we reject the null hypothesis and accept the alternative hypothesis, that at least one group means is different from the other group means.
DAX3: Is pay equal by location?
13
We were asked to determine if there is a significant difference in pay by location for our company.
Using the “Labor F 2020” dataset, we decide to conduct an ANOVA of pay by location.
Using the “Summary Statistics” we will first look at the number of observations by location for the different positions.
It looks like there are 340 employees (N Obs) in “Position 1” throughout the company, and 34 in “Position 2”. Since we need to have at 30 observations per group for an ANOVA to provide meaningful results, we will focus our analysis only on “Position 1” for now.
Next, we will run the “Summary Statistics” task again to see if
