# Linear regression and ANOVA are the Same Analysis?!

| category Statistics  | tag Statistics

When I tried to find out the difference between ANOVA and linear Regression, I got this interesting post: Why ANOVA and Linear Regression are the Same Analysis.

The author used data employment.sav to show why.

I tried both ANOVA and linear Regression using the same data with R lm and aov/anova functions. Not surprisely, I got the same results.

Actually, the description of aov is “Fit an analysis of variance model by a call to lm for each stratum.”

employee.csv is the same data as employment.sav with csv format.

require(RCurl)

Loading required package: RCurl

myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/stat/employee.csv",
ssl.verifypeer = FALSE)

id gender     bdate educ jobcat salary salbegin jobtime prevexp minority
1  1      m  2/3/1952   15      3  57000    27000      98     144        0
2  2      m 5/23/1958   16      1  40200    18750      98      36        0
3  3      f 7/26/1929   12      1  21450    12000      98     381        0
4  4      f 4/15/1947    8      1  21900    13200      98     190        0
5  5      m  2/9/1955   15      1  45000    21000      98     138        0
6  6      m 8/22/1958   15      1  32100    13500      98      67        0

mydata$jobcat = gsub(1, "Clerical", mydata$jobcat)
mydata$jobcat = gsub(2, "Custodial", mydata$jobcat)
mydata$jobcat = gsub(3, "Manager", mydata$jobcat)

library(plyr)
ddply(mydata, .(jobcat), summarize, mean_prevexp = mean(prevexp))

jobcat mean_prevexp
1  Clerical        85.04
2 Custodial       298.11
3   Manager        77.62

lm0 <- lm(prevexp ~ jobcat, data = mydata)
lm0


Call:
lm(formula = prevexp ~ jobcat, data = mydata)

Coefficients:
(Intercept)  jobcatCustodial    jobcatManager
85.04           213.07            -7.42


aov0 <- aov(prevexp ~ jobcat, data = mydata)
aov0

Call:
aov(formula = prevexp ~ jobcat, data = mydata)

Terms:
jobcat Residuals
Sum of Squares  1174907   3998900
Deg. of Freedom       2       471

Residual standard error: 92.14
Estimated effects may be unbalanced

summary(lm0)


Call:
lm(formula = prevexp ~ jobcat, data = mydata)

Residuals:
Min     1Q Median     3Q    Max
-154.1  -65.0  -32.0   36.7  391.0

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        85.04       4.84   17.58   <2e-16 ***
jobcatCustodial   213.07      18.38   11.59   <2e-16 ***
jobcatManager      -7.42      11.16   -0.67     0.51
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.1 on 471 degrees of freedom
Multiple R-squared:  0.227,	Adjusted R-squared:  0.224
F-statistic: 69.2 on 2 and 471 DF,  p-value: <2e-16

anova(lm0)

Analysis of Variance Table

Response: prevexp
Df  Sum Sq Mean Sq F value Pr(>F)
jobcat      2 1174907  587453    69.2 <2e-16 ***
Residuals 471 3998900    8490
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov0)

Df  Sum Sq Mean Sq F value Pr(>F)
jobcat        2 1174907  587453    69.2 <2e-16 ***
Residuals   471 3998900    8490
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


what is the difference between anova and aov in R?

• input format: anova use the object as input, that means you need input the lm(); but aov use the formula as input.
• output: output is slightly different as the examples demostrated.