Uploaded by dibz959

8.1 Residential Property Price Data Analysis

advertisement
GE2213
Understanding Uncertainty
and Statistical Reasoning
Regression Analysis
Case Study: Residential Property Price
Data Analysis
1
Outline
•
•
•
•
•
•
•
Overview of regression analysis
Simple linear regression model
Determination coefficient
Correlation coefficient
Multiple regression model
3 special regression models
Case study: Residential property price data analysis
2
Overview of Regression Analysis
• Input
• Dependent (or called response) variable, π‘Œ
• The variable we wish to estimate or predict
• Independent (or called explanatory) variable, 𝑋
• The variable used to connect (sometimes, also explain) the
dependent variable
• Output
• A linear function that allows us to
• Provide estimation: Estimate the average value of the dependent
variable based on the corresponding value of the independent
variable, usually in the form of a confidence interval
• Provide prediction: Predict the value of the dependent variable
based on the corresponding value of the independent variable,
usually in the form of a prediction interval (which must be wider
than the corresponding confidence interval)
3
Overview of Regression Analysis
• It is quite possible to develop a relation (e.g. linear,
quadratic) between X and Y that there is no causality (i.e.
no reason and effect) at all. The relationship exists owing
to casualty (i.e. happening by chance).
• e.g. A straight line may appear to provide a good model
for relating monthly output Y of a steel mill to the weight X
of computer printouts appearing on a manager's desk
during the month, but their causal relationship is tenuous.
A 3rd variable may have caused the change in both X and Y,
producing the relationship that we have observed.
• Causality can be inferred only when analysis uncovers
some plausible reasons for its existence. For instance,
workers work faster as the queue length increases, and so
a linear relationship is at least plausible.
4
Overview of Regression Analysis
Cont’d
e.g. two quadratic below
5
Simple Linear Regression Model
• A simple linear regression model consists of two components
• Regression line: A straight line describes the dependence of a
single variable 𝒀 on one variable 𝑿
• Random error: The unavoidable deviation is between the actual
value and the expected value
Population Intercept
Dependent
Variable
π’€π’Š =
Population Slope Coefficient
Independent
Variable
Random Error
𝜷𝟎 + 𝜷𝟏 π‘Ώπ’Š + πœΊπ’Š
Regression
Line
6
6
Simple Linear Regression Model
Cont’d
𝒀
π’€π’Š
π’†π’Š
π’€π’Š
π’€π’Š = π’ƒπŸŽ + π’ƒπŸ π‘Ώπ’Š
π‘Ώπ’Š
• 𝑏0 represents the sample intercept
• 𝑏1 represents the sample slope coefficient
𝑿
7
Simple Linear Regression Model
– Method of Least Squares
Cont’d
• 𝑏0 and 𝑏1 are estimated using the method of least squares,
which minimizes the Sum of Squared Errors (SSE)
𝑛
𝑛
𝑒𝑖2 =
𝑆𝑆𝐸 =
𝑖=1
𝑛
(π‘Œπ‘– − π‘Œπ‘– )2 =
𝑖=1
[π‘Œπ‘– − 𝑏0 + 𝑏1 𝑋𝑖 ]2
𝑖=1
8
Simple Linear Regression Model
– Method of Least Squares
Cont’d
• The formulae of 𝑏0 and 𝑏1 can be obtained by first partially
differentiating SSE with respect to 𝑏0 and 𝑏1 , and next solving
πœ•
𝑛
2
𝑒
𝑖
𝑖=1
πœ•π‘0
𝑛
= −2
𝑖=1
π‘Œπ‘– − 𝑏0 + 𝑏1 𝑋𝑖
=0
and
πœ•
𝑛
2
𝑖=1 𝑒𝑖
πœ•π‘1
𝑛
= −2
𝑋𝑖 π‘Œπ‘– − 𝑏0 + 𝑏1 𝑋𝑖
=0
𝑖=1
simultaneously
9
Simple Linear Regression Model
– Method of Least Squares
Cont’d
• The formulae of 𝑏0 and 𝑏1 are
𝑏0 = π‘Œ − 𝑏1 𝑋
and
𝑏1 =
𝑛
𝑖=1(𝑋𝑖 − 𝑋)(π‘Œπ‘– −
𝑛
2
(𝑋
−
𝑋)
𝑖
𝑖=1
π‘Œ)
10
Simple Linear Regression Model
– Determination Coefficient
Cont’d
• Measures the proportion of variation in π‘Œ that is removed by
the independent variable 𝑋, using the simple linear regression
model
• Sample determination coefficient
π‘Ÿ2
=
removed variation
total variation
=
𝑛
2
𝑖=1(π‘Œπ‘– −π‘Œ)
𝑛
2
𝑖=1(π‘Œπ‘– −π‘Œ)
• A unitless number between 0 and +1, inclusive
• “Magnitude” measures how good the regression model fits
• Closer to +1, the better the regression model fits
• Also known as goodness-of-fit measure
11
Simple Linear Regression Model
– Correlation Coefficient
Cont’d
• Describe the trend and strength of a linear relationship
between two numerical variables
• Sample correlation coefficient
either π‘Ÿ = − π‘Ÿ 2 or π‘Ÿ = + π‘Ÿ 2
• A unitless number between -1 and +1, inclusive
• “Sign” indicates the trend (down = negative r values / up =
positive r values) of a linear relationship
• “Magnitude” measures the strength of a linear relationship
• Closer to -1, the stronger negative linear relationship
• Closer to +1, the stronger positive linear relationship
• Closer to 0, no linear relationship
12
Scatter Plots against
Correlation Coefficients
𝒀
𝒀
𝒀
r = -1
𝑿
𝑿
r = -0.6
𝒀
Cont’d
r=0
𝑿
𝒀
r = +0.6
𝑿
r = +1
𝑿
13
Simple Linear Regression Model +
Correlation Coefficient in Calculator
• Data set
X
5
5
5
10
10
10
15
15
15
20
20
20
Y
1.6
2.2
1.4
1.9
2.4
2.6
2.3
2.7
2.8
2.6
2.9
3.1
14
Determination or Correlation
Coefficient – Example
• Data set: The sales of vitamin supplement and the
corresponding advertising expense
Month
1
2
3
4
5
6
7
8
9
10
Sales
(HK$’000)
963.500
893.000
1057.250
1183.250
1419.500
1547.750
1580.000
1071.500
1078.250
1122.500
Advertising Expense
(HK$’000)
37.427
40.850
41.431
44.842
51.788
63.760
63.572
44.686
48.959
50.056
15
Determination or Correlation
Coefficient – Example
Cont’d
• Scatter diagram
16
Determination or Correlation
Coefficient – Example
Cont’d
• Sample determination coefficient between sales of vitamin
supplement and advertising expense is
π‘Ÿ 2 = 0.8617 (you should use a hand calculator to check)
which indicates that 86.17% variation in π‘Œ removed by
the independent variable 𝑋, using the regression model
• Sample correlation coefficient between sales of vitamin
supplement and advertising expense is
π‘Ÿ = +0.9283 (you should use a hand calculator to check)
• There is a very strong positive linear relationship between
sales of vitamin supplement and advertising expense
17
Simple Linear Regression Model
– Example
Cont’d
• We wish to examine the linear dependency of the sales of
vitamin supplement on advertising expenses
• 10 monthly records were randomly selected
Month
1
2
3
4
5
6
7
8
9
10
Sales
(HK$’000)
963.500
893.000
1057.250
1183.250
1419.500
1547.750
1580.000
1071.500
1078.250
1122.500
Advertising Expense
(HK$’000)
37.427
40.850
41.431
44.842
51.788
63.760
63.572
44.686
48.959
50.056
• Find a straight line function that best fits the data
18
Simple Linear Regression Model
– Example
Cont’d
• The sample simple linear regression model
π‘Œ = −15.339 + 24.765𝑋 (you should use a hand calculator
to check)
where π‘Œ = sales (in HK$’000)
𝑋 = advertising expense (in HK$’000)
• For each HK$1000 (= 1 × 1000) increase in advertising expense,
the model estimates or predicts that the sales of vitamin
supplement increases by HK$24,765 (= 24.765 × 1000)
19
Simple Linear Regression Model
– Example
Cont’d
• What is the estimated or predicted amount of sales if the
advertising expense is HK$42,000?
π‘Œ = −15.339 + 24.765𝑋
= −15.339 + 24.765 42
= 1024.791 (in HK$’000, i.e. HK$1,024,791)
20
Simple Linear Regression Model
in Microsoft Excel
• Step 1: Add the “Analysis ToolPak”
• File οƒ  Options οƒ  Add-Ins οƒ  Click “Go” at the bottom οƒ  Check
“Analysis ToolPak” and click “OK”
• “Data Analysis” button can be found in the “Data” menu bar
21
Simple Linear Regression Model
in Microsoft Excel
Cont’d
• Step 2: Develop regression model
• Invoke the Data Analysis function, select “Regression” and click
“OK”
22
Simple Linear Regression Model
in Microsoft Excel
Cont’d
• Step 3: Input the necessary information
23
Simple Linear Regression Model
in Microsoft Excel
Cont’d
• Step 4: Obtain the output
|π‘Ÿ|
π‘Ÿ2
86.17% (about 86.2%) of the
variation in sales of vitamin
supplement can be removed
by the variability in advertising
expense
24
𝑏0 and 𝑏1
Multiple Regression Model
• A multiple regression model is to connect one dependent
variable with two or more independent variables in a linear
function
Population Intercept
Population Slope Coefficients
π’€π’Š = 𝜷𝟎 + 𝜷𝟏 π‘ΏπŸπ’Š + 𝜷𝟐 π‘ΏπŸπ’Š + β‹― + πœ·π‘² π‘Ώπ‘²π’Š + πœΊπ’Š
Dependent Variable
Independent Variables
Random Error
• 𝑏0 represents the sample intercept
• 𝑏1 , 𝑏2 … , 𝑏𝐾 represent the sample slope coefficients
25
Multiple Regression Model
Cont’d
π‘Œ
π’€π’Š
π’ƒπŸŽ
π’†π’Š
π’€π’Š
π’€π’Š = π’ƒπŸŽ + π’ƒπŸ π‘ΏπŸπ’Š + π’ƒπŸ π‘ΏπŸπ’Š
(𝑋1𝑖 , 𝑋2𝑖 )
𝑋2
𝑋1
26
Multiple Regression Model – Example
Cont’d
• In addition to advertising expense, it is believed that bonus
paid to salespersons affects the sales of vitamin supplement
• Fit a multiple regression model for estimating or predicting the
sales of vitamin supplement using both plausible independent
variables
Month
1
2
3
4
5
6
7
8
9
10
Sales
(HK$’000)
963.500
893.000
1057.250
1183.250
1419.500
1547.750
1580.000
1071.500
1078.250
1122.500
Advertising Expense
(HK$’000)
37.427
40.850
41.431
44.842
51.788
63.760
63.572
44.686
48.959
50.056
Bonus Paid
(HK$’000)
23.098
23.628
27.157
29.12
28.217
32.116
29.432
30.569
23.841
27.138
27
Multiple Regression Model – Example
Cont’d
• Excel output (see the Excel file)
88.6% of the variation in sales
of vitamin supplement can be
π‘Ÿ 2 removed by the two
independent variables included
in the regression model
28
π‘†π‘Žπ‘™π‘’π‘  = −279.603 + 21.216 𝐴𝑑𝑣 + 15.940(π΅π‘œπ‘›π‘’π‘ )
Multiple Regression Model – Example
Cont’d
π‘†π‘Žπ‘™π‘’π‘  = −279.603 + 21.216 𝐴𝑑𝑣 + 15.940(π΅π‘œπ‘›π‘’π‘ )
For each HK$1000 increase in
advertising expense, the
estimated or predicted sales of
vitamin supplement increased by
HK$21,216; holding bonus paid
to salespersons unchanged
For each HK$1000 increase in
bonus paid to salespersons, the
estimated or predicted sales of
vitamin supplement increased by
HK$15,940; holding advertising
expense unchanged
29
Multiple Regression Model
– Quick Exercises
Cont’d
• What is the estimated or predicted sales of vitamin
supplement if the advertising expense is HK$60,000 and
bonus to salespersons is HK$30,000?
π‘†π‘Žπ‘™π‘’π‘  = −279.603 + 21.216 𝐴𝑑𝑣 + 15.940(π΅π‘œπ‘›π‘’π‘ )
60
30
• In this month, the company has an extra HK$100,000
budget, how should the money be used? Advertising or
paying bonus to salespersons or both?
30
Special Regression Model –
Quadratic Model
• A regression model connecting one dependent variable with
two or more independent variables in a quadratic function
π‘Œπ‘– = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋1𝑖 2 + πœ€π‘–
• The second independent variable is the square of the first
independent variable
• It is suitable when a scatter diagram indicates a non-linear
relationship, with one of the following 4 shapes
𝒀
𝒀
𝒀
𝒀
31
2 > 0
π‘ΏπŸ
2 > 0
π‘ΏπŸ
2 < 0
π‘ΏπŸ
2 < 0
π‘ΏπŸ
Special Regression Model –
Dummy-Variable Model
Cont’d
• It is useful when categorical variable is used as independent
variable
HK
π‘Œ
Island
π‘Œπ‘– = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + πœ€π‘–
𝑏0 + 𝑏2
• For example
• π‘Œ = Sales of vitamin supplement (HK$’000)
• 𝑋1 = Advertising expense (HK$’000)
• 𝑋2 = Location of shop =
Kowloon
𝑏0
𝑋1
0 𝑖𝑓 πΎπ‘œπ‘€π‘™π‘œπ‘œπ‘›
1 𝑖𝑓 π»π‘œπ‘›π‘” πΎπ‘œπ‘›π‘” πΌπ‘ π‘™π‘Žπ‘›π‘‘
• For 𝑋2 = 0 (i.e. Kowloon)
π‘Œπ‘– = 𝑏0 + 𝑏1 𝑋1𝑖 + 𝑏2 0 = 𝑏0 + 𝑏1 𝑋1𝑖
• For 𝑋2 = 1 (i.e. Hong Kong Island)
π‘Œπ‘– = 𝑏0 + 𝑏1 𝑋1𝑖 + 𝑏2 1 = (𝑏0 +𝑏2 ) + 𝑏1 𝑋1𝑖
Same
slopes
32
Special Regression Model –
Interaction-Variable Model
Cont’d
• Interaction variable is created by multiplying two or more
independent variables together
• Usually, one of those independent variables is categorical
• It allows regression models for different categories to have
different intercepts and different slopes
π‘Œπ‘– = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝛽3 𝑋1𝑖 𝑋2𝑖 + πœ€π‘–
• Without interaction term (i.e. 𝛽3 = 0), effect of 𝑋1 (or 𝑋2 ) on π‘Œ
is measured by 𝛽1 (or 𝛽2 )
• With interaction term(i.e. 𝛽3 ≠ 0), effect of 𝑋1 (or 𝑋2 ) on π‘Œ is
measured by (𝛽1 +𝛽3 𝑋2 ) [or (𝛽2 +𝛽3 𝑋1 )] π‘Œ
𝑏0 + 𝑏2
33
𝑏0
𝑋1
Application of Regression Analysis
– Centa-City Index
• Why Property Price Indices?
• Investors and potential home-buyers require indicators to study
the current movement of property prices in Hong Kong
• The creation of the “Centa-City Index” aims to provide such
information to the general public as a source of reference on
trends in Hong Kong’s property market
• How are the Index constructed?
• Multiple regression analysis is used to determine the effect of
various attributes on property price
• Attributes are considered such as floor area, years of occupancy,
location, direction, view, floor level, and so on.
• Who originally developed the “Centa-City Index”?
• Jointly developed by Centaline Property Agency Limited and
Department of Management Sciences in City University of Hong
Kong
34
Application of Regression Analysis
– Centa-City Index
35
Application of Regression Analysis
– Centa-City Index
36
Application of Regression Analysis
– Centa-City Index
37
Application of Regression Analysis
– Centa-City Index
• More information
• http://www.cb.cityu.edu.hk/ms/work/hkcci/
• http://hk.centadata.com/cci/cci_e.htm
38
Case Study:
Residential Property
Price Data Analysis
39
Case Background
• A new ocean side complex consists of two adjacent and
connected 8-floor buildings
• The complex contains 200 flats of equal size (where 96 flats
are in Building 1 and 104 flats are in Building 2)
• 96 full ocean-view flats are those facing the ocean (e.g. 101,
103, 105, 107, 109 in Building 1; 113, 115, 117, 119, 121, 123,
125 in Building 2). 40 full ocean-view flats in Building 1 also
have a good view of the pool (e.g. 101, 103, 105, 107, 109).
• 8 partial ocean-view flats (i.e. 111, 211, 311, 411, 511, 611,
711, 811 in Building 1) are those having part of their ocean
view blocked by Building 2
• 96 bay-view flats are those facing the parking lot (e.g. 102,
104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124).
• The only elevator is located near the office and the game
room, all at the east end of Building 1
40
Layout of the Complex
41
Case Background
Cont’d
• The complex was completed during an economic recession;
thus, sales were slow
• The developer furnished many unsold flats and rented them
out
• Eventually, the developer stopped all leases and sold out
remaining flats by auction
42
Study Objective
• To investigate the relationship between auction price and
• Height of a flat (expressed by the floor number)
• Distance of a flat from the elevator (expressed by the number of
flats)
• Presence or absence of full ocean view
• Presence or absence of partial ocean view
• Presence or absence of furniture in an unsold flat
43
The Data
• 106 auctioned flats (see the Excel file) are examined using a
multiple regression model as follows:
• Auction price (π‘Œ, called 𝐏𝐫𝐒𝐜𝐞)
• in US $’00
• Height of a flat (𝑋1 , called Floor)
• Takes the floor number: 1 or 2 or … or 8
• Distance of a flat from the elevator (𝑋2 , called Distance)
• The distance is expressed by the number of flats
• An additional distance of two flats is added to a flat in Building 2
• Takes the number of flats: 1 or 2 or … or 6 in Building 1;
9 or 10 or … or 15 in Building 2
44
The Data
Cont’d
• Full ocean view (𝑋3 , called π•π’πžπ°)
• The presence or absence of full ocean view
• A dummy variable is used
• 1 if the flat has full ocean view; 0 if not
• Partial ocean view (𝑋4 , called End)
• The presence or absence of partial ocean view
• A dummy variable is used
• 1 if the flat is 111 or 211 or 311 or 411 or 511 or 611 or 711 or 811;
0 if not
• Furniture (𝑋5 , called Furnish)
• The presence or absence of furniture
• A dummy variable is used
• 1 if the flat has furniture; 0 if not
45
The Data
𝒀
:
:
π‘ΏπŸ
Cont’d
π‘ΏπŸ
46
Dummy Variables
π‘ΏπŸ‘ , π‘ΏπŸ’ , π‘ΏπŸ“
Data Analysis
Correlation coefficient between Auction price and Floor height,
π‘Ÿ = −0.320
47
Data Analysis
Cont’d
Correlation coefficient between Auction price and Distance from elevator,
π‘Ÿ = −0.089
48
Data Analysis
Cont’d
• Effect of full ocean view
View of Ocean
Sample Size
Mean
St. Dev.
0
36
173.944
8.802
1
70
201.000
13.847
• Effect of partial ocean view
End Unit
Sample Size
Mean
St. Dev.
0
99
192.141
18.221
1
7
187.143
10.351
Furniture
Sample Size
Mean
St. Dev.
0
61
189.508
14.880
1
45
194.933
20.942
• Effect of furniture
49
Regression Analysis – Model 1
• All five plausible independent variables are considered
π‘Ÿ2
π‘Œ = 177.703 − 0.715𝑋1 − 0.873𝑋2 +31.273𝑋3
−17.808𝑋4 + 9.984𝑋5
50
Regression Analysis – Model 2
Cont’d
• Effect of Height & Full Ocean View interaction (𝑋1 𝑋3 ) on auction price is added
π‘Ÿ2
π‘Œ = 152.212 + 3.225𝑋1 − 0.828𝑋2 + 60.054𝑋3
−18.651𝑋4 + 11.481𝑋5 − 4.724 𝑋1 𝑋3
51
Regression Analysis – Model 3
Cont’d
• Effect of Height might not be linearly related to Auction price;
hence, interaction variables (𝑋1 2 and 𝑋1 2 𝑋3 ) are included
π‘Ÿ2
π‘Œ = 132.378 + 10.211𝑋1 − 0.820𝑋2 + 86.733𝑋3
−19.178𝑋4 + 11.633𝑋5 − 15.303 𝑋1 𝑋3
−0.585𝑋1 2 + 0.958𝑋1 2 𝑋3
52
Conclusions and
Recommendations
• Which model will you choose?
• How the results can help in prediction?
• What additional factor(s) will you consider?
53
Download