Chapter 3: Regression
Analysis
AP Statistics Standards
I. Exploring Data: (continued)
D. Exploring bivariate data

Analyzing patterns in scatterplots

Correlation and linearity

Leastsquares regression line

Residual plots, outliers, and
influential points

Objectives 
Essential Question:
How can we establish and quantify
a cause and effect relationship between two variables? 
Chap 3 2
Variable (Bivariate)
Relationships
 Identify the response and explanatory variables
from a plot.
 response:
yvariable, dependent
 explanatory:
xvariable, independent
 Identify positive and negative associations from scatter plots.
Note: an association does not establish
cause and effect
 Detect linear and nonlinear relationships
using scatter plots.
Note:
ALWAYS
make a scatter plot when analyzing bivariate data
 Judge the relative strength of a relationship
by the amount of scatter around the curve of best fit.
 Identify
outliers
on scatter plots.
 Within the expected range Xvalues
 Outside the expected range of Yvalues for
a given Xvalue
 Identify
"influential outliers"
on scatter plots.
 Outside the expected range Xvalues
 Note: in this region the expected range of Yvalues
is undefined

Make scatter plots using the TI83 calculator and Excel.

State why any analysis of 2 variable (bivariate) data
should always begin with a scatter plot regardless of which tools are used to
further analyze the data.
Note: in the ideal situation
all the data points would have equal influence and be uniformly distributed.
Homefun
(formative/summative assessment): Exercises 1, 3, 5, 7 pp. 158159
Relevance: Many scientific
constants and predictions are based on measurements of the slopes of lines.


Essential Question:
Why is it important to quantify
correlation instead of just estimating it by looking at a graph? 
Correlation
 State
the meaning of correlation and how it is typically indicated.
 r = correlation coefficient, range goes from 1 to +1
 Strength  the absolute value of r is close to 1
 Direction
 Assumes a linear relationship
 Calculate r using the formula:
r =

1 
Σ

( 
xi
 xbar 
) 
( 
yi
 ybar 
) 

n  1 
sx 
sy 

Be as one with the following facts about
correlation:
 rsquare is
bulletproof
 adding a constant
to either yvariable or xvariable or both has no effect on rsquare
or slope.
 multiplying either the yvariable or the xvariable or both
has no effect on rsquare
Homefun (formative/summative
assessment): Exercise 9, 15, 17 pp.159160

Essential Question:
Why would we need to find a
mathematical relationship between variables? Isn't correlation enough? 
Regression
 Explain
the difference between correlation and regression.
correlation: denotes the strength of an association
regression: yeilds a mathematical model (regression equation) of the association.

Perform regression/correlation analysis with the TI83
calculator and Excel Spreads sheets.
 What type of error does least squares regression minimize?
 Interpret regression equations.
 Single
yhat = ax + b
 Multiple
yhat = a_{o} + a_{1}x_{1} + a_{2}x_{2}
+ ... + a_{n}x_{n}

Calculate ybar using a regression equation, given xbar.

Properly state the meaning of slope according
to the official statistics definition. (p155)
For every increase of one
in the xvariable, the predicted y increases by the slope
 Properly interpret the intercept.

example:
(sales) = 50 (advertising dollars) + 87
 What are the sales with
no advertising?
Answer:
the intercept
or 87

Describe the region where a given regression equation
will give a meaningful association. within the range of
xvalues

Define and decry the use of
extrapolation.
Extrapolation is the act of drawing a conclusion based on the regression line
in a region significantly outside of the range of xvalues. These conclusions
can be highly misleading.
 example:
(bushels tomatoes) = 2 (lb fertilizer)
+ 10 ,
 xrange 0 to 5
 If Bob puts 100 lb of
fertilizer on his plants,
how many bushels
of tomatoes will he get.
Answer: zerohe
kills his tomato plants.

Be aware that the point (xbar, ybar) is in the
center of the regression line. ybar = b (xbar) + a
Homefun (formative/summative
assessment): Exercise 35, 37, 41 p.191
Essential Question:
What happens to the
regression analysis when we change units? 
 Solve problems using the following equations (b
= slope, a = intercept):
b = r (s_{y}/
s_{x})
a = ybar  b(xbar)
Action 
 Effect
 (Derived from the above 2 equations &
info at right)

 Effect
 (Based on review information from
previous chapters.)

Slope 
Intercept 
s_{y} 
s_{x} 
ybar 
xbar 
Multiply by constant = k 






xdata points 
multiply by 1 / k 
none 
none 
multiply by k 
none 
multiply by k 
ydata points 
multiply by k 
multiply by k 
multiply by k 
none 
multiply by k 
none 
both 
none 
multiply by k 
multiply by k 
multiply by k 
multiply by k 
multiply by k 
Add a constant = k 






xdata points 
none 
adds bk 
none 
none 
none 
adds k 
ydata points 
none 
adds k 
none 
none 
adds k 
none 
both 
none 
adds (kbk) 
none 
none 
adds k 
adds k 
Homefun (formative/summative
assessment): Exercise 47 p.192
Relevance: Regression
and correlation are the mathematical tools much of the social sciences as well
as business tools are founded on.

Essential Question:
What does RSquare really
mean? 
The Meaning of RSquare
 State the meaning of SST and SSE. Use them to calculate
Rsquare.

SST = ∑ (y_{i}  ybar)^{2 }
SST (Sum of Squares Total)
is a measure of the scatter or variability of the ydata
points about the ydata's mean.

SSE = ∑ (y_{i}  yhat)^{2
} SSE (Sum of Squared Errors) is a measure of the scatter or variability
of the ydata points about the regression line.
Remember, the xdata is assumed to be error free.

(SST  SSE)
is a measure of the amount of variability in the ydata points
explained by the regression line.

r^{2 }= (SST  SSE) / SST is a measure of the fraction of the
variation in the values of y that is accounted for by the regression line of y on x.
 Give the official interpretation of rsquare
(coefficient of determination).

Explain why care must be taken in using the official
interpretation of rsquared. Remember, even when
correlating data from random sources, rsquared can sometimes be reassuringly
high.

Susceptible to outliers,
especially influential outliers

Data points furthest from the center of the line have
more influence. It's similar to a playground
seesaw or teetertotter: a person seated on the end will have more
influence than a person seated close to the middle.

There may be no causative relationship between explanatory
and response variables. A high rsquare does not
establish causation!

rsquare applies to linear relationships.
A low rsquare value does not establish that there is no association. The
association could be nonlinear.
Homefun (formative/summative
assessment): Homefun (formative/summative
assessment): Exercise 53, 59, 63 p.192194
Relevance: Sometimes
major political decisions are made or social theories proposed based on
questionable evidence from correlation/regression analysis. It's difficult to evaluate this evidence without knowing
something about the meaning of rsquare. 
Stats
Investigation
(formative/summative assessment): Meaning of
RSquare  time approx 2 class periods (individual work) 
Purpose: Determine if a regression analysis using random
numbers can yield an rsquare value of 50% or more.
Instructions: Set up a
regression analysis in Excel using integer xvalues from 0
to 9. Use a random number from 0 to 10 for the yvalues. Run
this simulation 100 times. Calculate the average rsquare
and record the highest rsquared value. Record the three
highest rsquare values obtained in the class.
Save the data sets from your 4
regression/correlation results with the highest
Rsquare value. You will use it again at the end of the
year.
Questions /Conclusions:
 Based on your data, does a high
rsquare value by itself indicate a meaningful association
or causation?
 Is the random number generator used in this
investigation truly random?
 Is it possible to get a high rsquared value merely
from random events?
 What does it really mean when
we say that rsquare
represents the fraction of the variation in the values of
y that is "explained" by the least squares regression of y
on x? Discuss things like the SSM and SSE.



Essential Question:
Can a regression equation with a
high Rsquare be inappropriate? 
Residuals
 Define what is meant by a residual.
Mathematically:
resid = y_{i}  yhat
English:
a residual is the difference between the measured yvalue and the
yvalue predicted by the regression equation.

Calculate residuals using a TI83 calculator.

State 2 ways to plot residuals.

State the major assumptions concerning
distribution of ydata points about regression lines.

Ydata normally distributed:
If ydata points were repeatedly
gathered for a given xvalue, the ydata would form a normal distribution
with its mean corresponding to the yhat value calculated with the given
xvalue. Remember,
yvalues have random measurement errors in them. Repeated measurements of a
yvalue will not give the exact same number.

Uniform spread in ydata from one end of the
line to the other:
The spread in the above distribution would be the same for every possible
xvalue. (See objective 32 to estimate the size of the spread.)
 Interpret residual plot patterns.
Residual Plot conclusion:
either appropriate
or inappropriate
 Randomappropriate
 Smiley or Frowning Face (Mr. R's Terms)inappropriate

Pattern in the scatterinappropriate
Note:
residual plots
merely magnify the patterns that can be observed in a scatter plot.
The horizontal line at the origin
of a residual plot represents the regression line.
A person
skilled at interpreting scatter plots will arrive at the same conclusions that
can be drawn from a residual plot.
 Make residual plots using a TI83.
 Store xdata in L1
and ydata in L2
 First perform the regression analysis for
L1, L2

 Create a scatter plot of
L1
on the horizontal axis and L3 on the
vertical axis

State the sum of the residuals.
zero

Correctly interpret the standard error of the least squares regression
line. The standard error of the least squares regression
line is related to the residuals as shown below and is a measure of the spread of the data around the regression line. It can be considered an estimate of the standard deviation of the normal distribution described in objective 28.
Most computer printouts will report a value for s. (see Minitab Output )
s = [ S(y  yhat)^{2 }/ (n2) ]^{1/2}
s = [ S(residual)^{2 }/ (n2) ]^{1/2}
s = [ SSE / (n2) ]^{1/2}
Homefun (formative/summative
assessment): prob. 46, 60, 61, 71, 73 pp.192196
Relevance: Even though
the world is largely nonlinear, parts of it can often be accurately described
with linear models. Knowing when a linear model is inappropriate is essential to
building effective models.
Various types of regression
models are used in everything from predicting grades on AP tests to computer
control of chemical plants.
Essential Question:
How can I make an "A" on
the test? 
Regression/Correlation Analysis Review
 Work the practice test.
 Review the objectives.
 Look over
free response problems
from previous years.
 Master the vocabulary (see example
below).
Summative Assessment: TestObjectives 132 
Stats
Investigation
(formative/summative assessment): Determining if a
Regression Equation is Appropriate  time approx 1 class
periods (individual work) 
Purpose: Determine if a linear regression equation is
appropriate for two different situations.
Background: Commercial
resistors follow ohm's law while light bulbs, due to their
high temperatures do not. Ohm's law is as follows:
I = (1/R) V
Where: I = current, V= voltage and R =
resistance.
Plotting I vs. V will theoretically
yield a straight line passing through the origin.
Instructions: Set up a least
squares linear regression analysis in Excel to find the
association between current (response variable) and voltage
(explanatory variable) for a commercial resistor and for a
light bulb. Remember that this means a scatter plot as well
as finding the slope, intercept, and Rsquare for the data.
Set up the formulas needed to plot a residual plot and make
such a plot for the two sets of data.
Questions /Conclusions:
 Based on your data, does a high
rsquare value by itself indicate a meaningful association
or causation?
 Find the resistance value in Ohms for the commercial
resistor?
 Is a linear equation appropriate for the commercial
resistor? How about the light bulb. Explain your answers.


