|
Chapter 3: Regression
Analysis
AP Statistics Standards
I. Exploring Data: (continued)
D. Exploring bivariate data
-
Analyzing patterns in scatterplots
-
Correlation and linearity
-
Least-squares regression line
-
Residual plots, outliers, and
influential points
|
|
Objectives |
| Essential Question:
How can we establish and quantify
a cause and effect relationship between two variables? |
Chap 3 2
Variable (Bivariate)
Relationships
- Identify the response and explanatory variables
from a plot.
- response:
y-variable, dependent
- explanatory:
x-variable, independent
- Identify positive and negative associations
from scatter plots.
Note: an association does not establish
cause and effect
- Detect linear and non-linear relationships
using scatter plots.
Note:
ALWAYS
make a scatter plot when analyzing bivariate data
- Judge the relative strength of a relationship
by the amount of scatter around the curve of best fit.
- Identify
outliers
on scatter plots.
- Within the expected range X-values
- Outside the expected range of Y-values for
a given X-value
- Identify
"influential outliers"
on scatter plots.
- Outside the expected range X-values
- Note: in this region
the expected range of Y-values
is undefined
- Make scatter plots using the TI-83 calculator and Excel.
- State why any analysis of 2 variable (bivariate) data
should always begin with a scatter plot regardless of which tools are used to
further analyze the data.
- identifies outliers
- reveals gaps and clusters in the data
- displays patterns such as linearity or
non-linearity
Note: in the ideal situation
all the data points would have equal influence and be uniformly distributed.
Homefun
(formative/summative assessment): 3.11, 3.13
Relevance: Many scientific
constants and predictions are based on measurements of the slopes of lines.
|
|
|
Activities |
- Lesson 1
- Key Concept: How to represent 2 variable data.
- Purpose: Lay the foundation for correlation and
regression.
Interactive Discussion:
Objectives--How to interpret scatter plots. How
to identify outliers.
|
|
|
| Essential Question:
Why is it important to quantify
correlation instead of just estimating it by looking at a graph? |
Correlation
- State
the meaning of correlation and how it is typically indicated.
- Strength
- Direction
- Linear Relationship
- r
- Be as one with the 7 facts about
correlation. on p. 143-144.
- Calculate r using the formula (p. 140).
| r =
|
1
|
Σ
|
( |
xi
- xbar |
) |
( |
yi
- ybar |
) |
|
| n - 1 |
sx |
sy |
|
- Note: r-square is
bullet-proof
- adding a constant
to either y-variable or x-variable or both has no effect on r-square
or slope.
multiplying
either the y-variable or the x-variable or both
has no effect on r-square
Homefun (formative/summative
assessment): prob. 3.29, 3.31
|
- Lesson 2
- Key Concept: Correlation
- strength of the relationship
- Purpose: Develop the
ability to evaluate the strength of the relationship.
Interactive Discussion: The
dog barked and the tree fell down. Was there an association. Was
there causation
Individuals: Perform
correlations on SAT data (p71) using a TI-83. (This will carry over
into the regression section.)
|
| Essential Question:
Why would we need to find a
mathematical relationship between variables? Isn't correlation enough? |
Regression
- Explain
the difference between correlation and regression.
- Perform regression/correlation analysis with the TI-83
calculator, Excel Spreads sheets.
- What type of error does least squares regression minimize?
- Error measured in
y-dimension (y = response variable)
- x-dimension (explanatory
variable) considered error-free
- Interpret regression equations.
- Single
yhat = ax + b
- Multiple
yhat = ao + a1x1 + a2x2
+ ... + anxn
- Calculate ybar using a regression equation, given xbar.
- Properly state the meaning of slope according
to the official statistics definition. (p155)
For every increase of one
in the x-variable, the predicted y increases by the slope
- Properly interpret the intercept.
-
example:
(sales) = 50 (advertising dollars) + 87
- What are the sales with
no advertising?
Answer:
the intercept
or 87
- Describe the region where a given regression equation
will give a meaningful association. within the range of
x-values
- Define and decry the use of
extrapolation.
Extrapolation is the act of drawing a conclusion based on the regression line
in a region significantly outside of the range of x-values. These conclusions
can be highly misleading.
- example:
(bushels tomatoes) = 2 (lb fertilizer)
+ 10 ,
- x-range 0 to 5
- If Bob puts 100 lb of
fertilizer on his plants,
how many bushels
of tomatoes will he get.
Answer: zero--he
kills his tomato plants.
- Be aware that the point (xbar, ybar) is in the
center of the regression line. ybar = b (xbar) + a
| Essential Question:
What happens to the
regression analysis when we change units? |
- Solve problems using the following equations (b
= slope, a = intercept):
b = r (sy/
sx)
a = ybar - b(xbar)
|
Action |
- Effect
- (Derived from the above 2 equations &
info at right)
|
- Effect
- (Based on review information from
previous chapters.)
|
|
Slope |
Intercept |
sy |
sx |
ybar |
xbar |
|
Multiply by constant = k |
|
|
|
|
|
|
| x-data points |
multiply by 1 / k |
none |
none |
multiply by k |
none |
multiply by k |
| y-data points |
multiply by k |
multiply by k |
multiply by k |
none |
multiply by k |
none |
| both |
none |
multiply by k |
multiply by k |
multiply by k |
multiply by k |
multiply by k |
|
Add a constant = k |
|
|
|
|
|
|
| x-data points |
none |
adds -bk |
none |
none |
none |
adds k |
| y-data points |
none |
adds k |
none |
none |
adds k |
none |
| both |
none |
adds (k-bk) |
none |
none |
adds k |
adds k |
Homefun (formative/summative
assessment): prob. 3.33, 3.35
Relevance: Regression
and correlation are the mathematical tools much of the social sciences as well
as business tools are founded on.
|
- Lesson 3
- Key Concept: Regression -
finding the mathematical relationship between two variables.
- Purpose: Obtain and
understand regression equations.
Demo: Using
Fathom software, demonstrate the
reasoning process behind least squares regression analysis.
Interactive Discussion:
On objectives
2-person teams: See above
|
| Essential Question:
What does R-Square really
mean? |
The Meaning of R-Square
- State the meaning of SST and SSE. Use them to calculate
R-square.
- SST = ∑ (yi - ybar)2
SST (Sum of Squares Total)
is a measure of the scatter or variability of the y-data
points about the y-data's mean.
- SSE = ∑ (yi - yhat)2
SST (Sum of Squares
Total) is a measure of the scatter or variability
of the y-data points about the regression line.
Remember, the x-data is assumed to be error free.
- (SST - SSE)
is a measure of the amount of variability in the y-data points
explained by the regression line.
- r2 = (SST - SSE) / SST is a measure of the fraction of the
variability in the y-data points explained by the regression line.
- Give the official interpretation of r-square
(coefficient of determination).
- Use the proper magic words p.162
-
r-square evaluates the entire equation
- Explain why care must be taken in using the official
interpretation of r-squared. Remember, even when
correlating data from random sources, r-squared can sometimes be reassuringly
high.
- Susceptible to outliers,
especially influential outliers
- Data points furthest from the center of the line have
more influence. It's similar to a playground
see-saw or teeter-totter: a person seated on the end will have more
influence than a person seated close to the middle.
- There may be no causative relationship between explanatory
and response variables. A high r-square does not
establish causation!
-
r-square applies to linear relationships.
A low r-square value does not establish that there is no association. The
association could be non-linear.
Homefun (formative/summative
assessment): prob. 3.43, 3.45
Relevance: Sometimes
major political decisions are made or social theories proposed based on
questionable evidence. It's impossible to evaluate this evidence without knowing
something about the meaning of r-square. |
- Lesson 4
- Key Concept: The use and
misuse of R-square
- Purpose: To understand
how R-square is often overused as a measure of regression
analysis "goodness".
Interactive Discussion:
Objectives
2-person teams: See Stats
investigation below
|
|
Stats
Investigation
(formative/summative assessment): Meaning of
R-Square - time approx 2 class periods (individual work) |
|
Purpose: Determine if a regression analysis using random
numbers can yield an r-square value of 50% or more.
Instructions: Set up a
regression analysis in Excel using integer x-values from 0
to 9. Use a random number from 0 to 10 for the y-values. Run
this simulation 100 times. Calculate the average r-square
and record the highest r-squared value. Record the three
highest r-square values obtained in the class.
Save the data sets from your 4
regression/correlation results with the highest
R-square value. You will use it again at the end of the
year.
Questions /Conclusions:
- Based on your data, does a high
r-square value by itself indicate a meaningful association
or causation?
- Is the random number generator used in this
investigation truly random?
- Is it possible to get a high r-squared value merely
from random events?
- What does it really mean when
we say that r-square
represents the fraction of the variation in the values of
y that is "explained" by the least squares regression of y
on x? Discuss things like the SSM and SSE.
|
|
|
| Essential Question:
Can a regression equation with a
high R-square be inappropriate? |
Residuals
- Define what is meant by a residual.
Mathematically:
resid = yi - yhat
English:
a residual is the difference between the measured y-value and the
y-value predicted by the regression equation.
- Calculate residuals using a TI-83 calculator.
- State 2 ways to plot residuals.
- Residuals vs x
commonly used with straight line equation
- Residuals vs y
commonly used with multiple regression analysis
- State the major assumptions concerning
distribution of y-data points about regression lines.
- Y-data normally distributed:
If y-data points were repeatedly
gathered for a given x-value, the y-data would form a normal distribution
with its mean corresponding to the yhat value calculated with the given
x-value. Remember,
y-values have random measurement errors in them. Repeated measurements of a
y-value will not give the exact same number.
-
Uniform spread in y-data from one end of the
line to the other:
The spread in the above distribution would be the same for every possible
x-value.
- Interpret residual plot patterns.
Residual Plot conclusion:
either appropriate
or inappropriate
- Random--appropriate
- Smiley or Frowning Face (Mr. R's Terms)--inappropriate
-
Pattern in the scatter--inappropriate
Note:
residual plots
merely magnify the patterns that can be observed in a scatter plot.
The horizontal line at the origin
of a residual plot represents the regression line.
A person
skilled at interpreting scatter plots will arrive at the same conclusions that
can be drawn from a residual plot.
- Make residual plots using a TI-83.
- Store x-data in L1
and y-data in L2
- First perform the regression analysis for
L1, L2
-
- Create a scatter plot of
L1
on the horizontal axis and L3 on the
vertical axis
- State the sum of the residuals.
zero
Homefun (formative/summative
assessment): prob. 3.47, 3.61,
3.71
Relevance: Even though
the world is largely non-linear, parts of it can often be accurately described
with linear models. Knowing when a linear model is inappropriate is essential to
building effective models.
Various types of regression
models are used in everything from predicting grades on AP tests to computer
control of chemical plants.
| Essential Question:
How can I make an "A" on
the test? |
Regression/Correlation Analysis Review
- Work the practice test.
- Review the objectives.
- Look over
free response problems
from previous years.
- Master the vocabulary (see example
below).
Summative Assessment: Test--Objectives 1-32 |
- Lesson 5
- Key Concept: Residual
Plots.
- Purpose: Understand when
a given regression equation is appropriate.
Interactive Discussion:
Objectives.
Individual work:
Perform residual plots on TI-83
calculators and with Excel software.
|
|
Stats
Investigation
(formative/summative assessment): Determining if a
Regression Equation is Appropriate - time approx 1 class
periods (individual work) |
|
Purpose: Determine if a linear regression equation is
appropriate for two different situations.
Background: Commercial
resistors follow ohm's law while light bulbs, due to their
high temperatures do not. Ohm's law is as follows:
I = (1/R) V
Where: I = current, V= voltage and R =
resistance.
Plotting I vs. V will theoretically
yield a straight line passing through the origin.
Instructions: Set up a least
squares linear regression analysis in Excel to find the
association between current (response variable) and voltage
(explanatory variable) for a commercial resistor and for a
light bulb. Remember that this means a scatter plot as well
as finding the slope, intercept, and R-square for the data.
Set up the formulas needed to plot a residual plot and make
such a plot for the two sets of data.
Questions /Conclusions:
- Based on your data, does a high
r-square value by itself indicate a meaningful association
or causation?
- Find the resistance value in Ohms for the commercial
resistor?
- Is a linear equation appropriate for the commercial
resistor? How about the light bulb. Explain your answers.
|
|
|