Index
1. Abstract
2. Introduction
3. Literature review
4. Study design
4.1 Data Collection Energy Performance Certificate
4.2 Legal and Ethical Issues
5. Methodology and Results
6. Discussion
7. Summary
8. References
9. Appendices
Estimate the skewness of a distribution
Central tendency
Variance, Standard deviation, Standard error and Confidence interval
z-score
Normality test for large data bases
1. Abstract
This report focuses in the relationship between Energy Rating and Building Emission Rate in the city of London comparing the years 2011 and 2015 to understand if the EPC answer the hypothesis that has been proposed for the study. The research question that is the purpose for this study is: Is there a correlation between energy rating and building emission rate?
Keywords: Energy Rating, Building Emission rate, London
2. Introduction
The study is driven by the necessity of information of how is the energy rating working and how to improve it; the study that is presented here is to show the comparison between the EPC of two years 2011 and 2015 and also a comparison between both of them using descriptive data to found the outliers and remove them in excel, then the correlation coefficient of spearman it’s going to be use to see the relationship between the variables, then inferential analysis to see the behaviour of the data and test the normality of it, finally a linear regression to observe how the data behaves with different equations. The data is an EPC non domestic data that has been facilitated by the University of Southampton so that the study can take place
3. Literature review
In the world the global energy consumption keeps growing every year and the developed countries because to increase energy efficiency has become one of the first problem and as a result of this the Energy Performance Certificate (EPC) was created and it indicates the energy efficiency of building. The assessments are banded from A to G, where A (or A+ for non-domestic properties) is the most efficient in terms of likely fuel costs and carbon dioxide emissions. An EPC is required whenever a building is newly constructed, sold or is let to a new tenant. The purpose of an EPC is to show prospective tenants or buyers the energy efficiency of the building. [20]
In the United Kingdom the BREEAM was created it is similar to the LEED in de United states and it’s a study that qualifies the energy efficiency of the buildings as the auto B. Mattoni expresses in the article” Critical review and methodological approach to evaluate the differences among international green building rating tools” the “Building Research Establishment Environmental Assessment Method for buildings (BREEAM) is the oldest protocol. The most recent version was developed in 2016. Initially, it was based on the construction phase of individual new buildings, now it covers the entire life cycle of buildings, starting from the design stage, to in-use and B. Mattoni et al. Renewable and Sustainable Energy Reviews 82 (2018) 950–960 retrofitting. This protocol takes into account and analyzes 9 different macro-areas, reaching a maximum achievable score dependent on the building uses. Each macro-area is composed by various credits. Furthermore, an additional 1% up to 10% for each “innovation credit” can be added to the final BREEAM score. Starting from the final score achieved, it is possible to earn one of the following rating levels: Unclassified (< 30 points), Pass (≥ 30 points), Good (≥ 45 points), Very good (≥ 55 points), Excellent (≥ 70 points) and Outstanding (≥85 points).” (B. Mattoni et al.,2017) [19]
4. Study design
For the study design the research question is: “What is the relationship between Building Emission Rate and Energy Rating in the city on London and how does it behave in time” and the hypothesis is: “As the Energy Rating tends to increase the Building Emission rate tends to Decrease”; which this two the study will be focusing in the years 2011 and 2015 comparing the area of London in both years in non-domestic buildings to see the difference between these two years and also the correlation between the two variables.
The reason of choosing London as the city for the study is that is the biggest city in the United Kingdom with 7,5 million in its urban area and over 14 million in the metropolitan area, and is divided into 33 districts,[16] with 3204 buildings.
The data used in the study is the next:
In the data the variables that are going to be used in the study are Energy Rating and Building Emission Rate hoping to get a strong relationship between the both variables
4.1 Data Collection Energy Performance Certificate
In the case of non-domestic EPCs the data will be collected by an assistant and it will be working under the supervision of the company that it represents, enabling the company to produce EPCs for larger and more complex buildings and portfolios of buildings. However, the company needs to be in a position to verify the data and supervise how and by whom it is collected, having considered the situation, we consider that the advice on use of data gatherers for non-domestic properties should be clarified. [18]
4.2 Legal and Ethical Issues
According to the Data Protection act of 1998 Section 1 defines as personal data “as any data that can be used to identify a living individual” [17] for this study the gathering of data is done by using the direction of the buildings, as a non-domestic buildings, the buildings can be listed by a company and not a person furthermore in the study the only thing that was used was de post town leaving aside the direction avoiding to broke the protection act, also all persons that help with the use and collection of the information will be listed as anonymous, so that no names shall be exposed to the community.
The Risk assessment for the data collection does not represent a risk for any of the field team that goes for the collection and also the data is secure according with the law of Data protection.
5. Methodology and Results
The First Step of the study is to analyse the data form London in the year 2011-2015, and see the tendency in the data for “Energy rating and building Emission Rate”, in both years separately to observe the behaviour of the variables in the different period of time.
Figure 1. Plot and histogram for the data Energy Rating and Building Emission Rate in London on the years 2011 and 2015 done with R studio
For this plots it can be observed that there are “Outliers” make our data don’t behave as a normal tendency affecting out plot and results that can be obtained; outliers are “observations that are considered to be unusually far from the bulk of the data.” [1]; using R studio we can identify and remove outliers manually in excel; after removing them the data will have a more normal tendency, but still will not be a normal distribution data.
As there is shown in the Figure x there is a change in the tendency of the variables although is not a normal distribution without the outliers the clean can be study for the relationship between the variables Energy Rating and Building Emission Rate.
Figure 2. Plot and histogram for the data Energy Rating and Building Emission Rate in London on the years 2011 and 2015 with no outliers done with R studio
Figure 3. Plot and histogram for the data Energy Rating and Building Emission Rate in London on the years 2011-2015 with no outliers done with R studio
The Figure 3 show us the final data that is going to be used in the study including the data of both years without the outliers, there is a not normal tendency as we can see in the skewness a variable that tell us how symmetric is our data [2] in the case of energy rating is: 0.6488178 this tells us that is skewed to the right as for the building emission rate the skewness is: 0.947514 also skewed to the right.
For large data bases” SHAPIRO test” it`s not possible because this test only work for 0 to 5000, due to this it was decided to apply a Kolmogorov-Smirnov Test this test is to determine the normality of the variables for each one.[3] So we obtain for Energy Rating D = 0.072754 (max distance between the values), p-value < 2.2e-16 (since p-value is small number we can conclude that the two groups were sampled from data with different distributions) which means that as an alternative hypothesis: two-sided It can “specify the null hypothesis that the true distribution function of x is equal to, not less than or not greater than the hypothesized distribution function” [4] as for the Building emission rate D = 0.097093 (max distance between the values), p-value < 2.2e-16 (since p-value is small number we can conclude that the P value is small, conclude that the two groups were sampled from data with different distributions) which means that as an alternative hypothesis: two-sided It can “specify the null hypothesis that the true distribution function of x is equal to, not less than or not greater than the hypothesized distribution function” [4]. The small p value is logic because the energy rating is a different measure form the building emission rate due to this the distribution of the samples are different as can be seen in the Figure x.
Figure 4 Scatter graph of Energy rating and Building Emission Rate
The scatter graph show us the relation of the variables that as it is shown it is a positive relation and it tell us that as the Energy rating increases also the building emission tends to increase this hive us a relation between the both variables, the correlation test that is use here is the spearman method. The value of rho is 0.4674271 that is a moderate correlation between the variables, also a small p-value in this test means a strong correlation between the variables. [5]
Figure 5 Scatter graph of Energy rating and Building Emission Rate coloured by Energy Rating DESC and separated by year
This scatter graph show us how is the distribution in buildings in both of the years, from this we can observe than in the year 2011 there were more buildings in the statistics data, also more buildings with high emission rate and low energy efficiency, also in 2015 we have a number of buildings with low energy rating and low building emission rate, which is an anomaly to the hypothesis proposed in this study, there can be multiple factor for these to happen such as where is the building located and what is its use for, also when it was build and when does it was taken account for the database that is used in the study. Furthermore the graph of 2015 show us less “G” type buildings than in the year 2011 these means that the buildings are getting better strategies of performance.
The analysis continue with the correlation of the variables separated by the “ENERGY RATING DESC”, there the correlation coefficient for the spearman method variates according to the Energy Rating DESC where the “A” Label is the lowest with a -0.006 followed by the “E” Label also these to Labels have the largest P value, showing there is a low correlation in the variables, the only label with a strong correlation is the “B” label, but it is also low compared with the correlation for all the data base, that was shown in previous part of the report, these means that if another variable like “ENERGYRATING_DESC” is add to the study it can make a significant variation in the results of the study.
Energy Rating DESC | Correlation Coefficient ( r ) | P Value |
A | -0.006 | 5.31683 e-01 |
B | 0.396 | 2.553700 e-21 |
C | 0.087 | 3.700949 e-10 |
D | 0.116 | 2.597915 e-11 |
E | 0.062 | 9.756159 e-02 |
F | 0.135 | 2.258656 e-04 |
G | 0.135 | 2.401233 e-03 |
Table 1 Correlation Coefficients
The example of a scatterplot matrix where the variables are written in a diagonal line from top left to bottom right; then each variable is plotted against each other, we can see that in the middle square in the first column is an individual scatterplot of Energy rating and Building Emission Rate, with Energy Rating as the X-axis and Building Emission Rate as the Y-axis. This same plot is replicated in the left second square, but with the axis inverted, with these variables there is a small correlation as we can see a linear plot, also in these plots related time variables are evenly distributed into columns or rows, suggesting that data was actually collected in a regimented way.
Figure 6 Scatter Graph Matrix between Energy Rating, Building Emission Rate and Year
The next graph in the analysis is a “correlogram” this type of graph shows the correlation between the variables using the histograms, plots and a colour sample in the graph it can be seen a strong relationship (dark green) between the variables that are used for this study “Energy Rating” and “Building Emission Rate”, then the third variable ”YEAR” does not have a relation with the other two variables and is only there as an informative part of the graph, with this graph it is shown that a strong correlation exists. [7]
Figure 7 Scatter Graph between Energy Rating, Building Emission Rate and Year
The t-test is a statistical test, where the basic idea is the inference problem from a sample size data set to test the hypothesis that is “As the Energy rating tends to increase the building emission tends to decrease”. In t-test, the null hypothesis is that the mean of the two samples is equal, which means that the alternative hypothesis for the test is that the difference of the mean is not equal to zero. In a hypothesis test the objective is to reject or accept the null hypothesis with some confidence interval. Since the test the relation between the variables, the confidence interval in this case specifies the relation of them. The t-test also produces the p-value, which is the probability of wrongly rejecting the null hypothesis, the smaller it is the more confident we can reject the null hypothesis. [8]
Welch Two Sample t-test
t = 1.7605 df = 22375 p-value = 0.07834
Alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -0.1065724 -1.9869077
Sample estimates: mean of x = 92.57784 mean of y = 91.63768
Based on the result it can be infer that: at 95% confidence level, there is no significant difference (p-value = 0.07834) of the two means then the null hypothesis should be accepted as the two means are equal because the p-value is larger than 0.05. The maximum difference of the mean can be as low as -.1065 and as high as 1.98. The output also estimates of the sample means, the mean and the degree of freedom of the t-distribution is 1.7605. [8]
Normal Quantile-Quantile plot for Energy Rating and Building Emission Rate, this type of graph helps is understand if the data came from theoretical distribution, this plot is created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. The quantiles are points from where the data fall, and are in the X-axis; Q-Q plots take the sample data in an ascending order, and then plot them versus quantiles calculated from a theoretical distribution. The first part of the analysis is the “QQNORM” that with a given vector of data and plots the data in sorted order versus quantiles from a standard Normal distribution; with this it can be observed how our data is comparing it with a normal distribution data [9]; the second part of the analysis is the Q-Q Line that as default uses the second and third Quantile [10];the third part is the “QQPLOT” these function allow a quantile-quantile plot for any distribution and the two arguments need to be provided by the variables that are being used giving us the distribution of the relationship between the both variables.[9] The graphs show us that the Normal Q-Q plot are heavy tailed meaning that compared to the normal distribution there is more data located at the extremes of the distribution and less data in the center of the distribution; in terms of quantiles this means that the first quantile is much less than the first theoretical quantile and the last quantile is greater than the last theoretical quantile. [11]
Figure 8 Histogram, and Quantile-Quantile graphs for the Energy Rating and Building Emission Rate
The Linear regression modal is one that describes the relationship between two variables x and y that can be expressed by the equation y = α + βx + ξ where y = Building Emission Rate and x = Energy Rating, the formula can be modify to: Building Emission Rate = α + Energy Rating * X + ξ [12]. The results of the Regression were:
Residuals | ||||
Min | 1q | Median | 3Q | Max |
-131.8 | -33.44 | -11.89 | 25.16 | 200.29 |
The residuals are the difference between the actual values of the variable that were predicted and the values that are being predicted from the linear regression; for most regressions the residuals must look like a normal distribution when plotted, if the residuals are normally distributed, this indicates the mean of the difference between the predictions and the actual values is close to 0. [13]
Coefficients | |||||
Estimate Coefficient | Std. Error | T value | Pr(>|t|) | Significance Stars | |
Intercept | 29.99194 | 1.13886 | 26.34 | < 2e-16 | *** |
Energy Rating | 0.66588 | 0.01155 | 57.65 | < 2e-16 | *** |
The Significance Stars are the asterisk for significance levels, with the number of asterisks displayed according to the p-value computed. *** For high significance and * for low significance. In this case, *** indicates that it’s likely that there is strong relationship between Building Emission Rate and Energy Rating [13]; the estimated coefficient is the value of slope calculated by the regression, this means that the slope is always multiplied by 1, this number will be based on the magnitude of the variable that is being input into the regression.[13] the standard error measures the variability in the estimate for the coefficient, a lower coefficient means a better but this number is relative to the value of the coefficient.[13] the t-value is the score that measures whether or not the coefficient for this variable is meaningful for the model [13].
Residual Standard Error | 44.09 | on | 12645 | degrees of freedom |
The Residual Standard Error is the deviation of the residuals, this number has to be proportional to the quantiles of the residuals; the Degrees of Freedom is the difference between the number of observations in the sample and the number of variables used in the model [13].
Multiple R-squared | 0.2081 | Adjusted R-Squares | 0.2081 |
The Metric for evaluating the goodness of fit of your model, being closes to one or one is the best option that there is, with a big data and an R square of 0.2081 it can be said that the samples are separated from the linear model.
F-statistic | 3324 | on | 1 | and | 12645 |
p-value | < 2.2 e -16 |
This takes the parameters of our model and compares it to a model that has fewer parameters; if the model with more parameters doesn’t perform better than the model with fewer parameters, the F-test will have a high p-value; if the model with more parameters is better than the model with fewer parameters, the p value will be lower as in the p value shown in the table. [13]
The graph that we obtain for the results in the linear regression is the one that show us how the linear tendency is should be and how the relation between the variables are distributed for the line, looking at the graph it can be understand the r squared is far form 1 and it is 0.2081
Figure 9 Linear Regression Graph
Durbin Watson Test
This type of test measures the autocorrelation or serial correlation in residuals from regression analysis, the results obtained from the teste were: Autocorrelation: 1 D-W: 0.9568517 Statistic: 0.08604671 p-value: 0; with the results it can be infer that with a p value of 0 the null hypothesis can be rejected, and with a D-W close to one it can be said that there is moderate correlation. [14]
Figure 10 Spread Level Plot
This type of plot is for examining the possible dependence of spread on level the increasing trend in this plot – which is saying that the absolute residuals are getting larger as the fitted values do – would indicate a spread that’s related to the mean. [15]
6. Discussion
After seeing the methodology and the results that were explain, the conclusion for the hypothesis “As the energy rating tends to increase the Building emission rate tends to decrease” in the city of London for the years 2011 and 2015 has a positive acceptance, although some results does not support the strong correlation within the variables as it can be seen in Table 1. In the summary of the results and the data that is presented below, it can be seen that the p values are small this means that the correlation between the Variables is strong.
For the difference between the years and its results, in figure 5 it is shown the relation between de variables and how in 2015 there are less buildings with the lower categories, also in 2011 le G category has more buildings implying that form year 2011 form 2015 better Energy Rating buildings have been constructed, as a conclusion it can be tell that the Hypothesis is true, but it has certain factors that are not taken in consideration such as if the buildings are the same, the number of buildings taking form the data in each year; so to make the study more reliable there will have to be a narrow data base to work with.
Dependent Variable | |
Building Emission Rate | |
Energy Rating | .0666 *** (0.643 , 0.689) |
Constant | 29.992*** (27.760 , 32.224) |
Observations | 12,647 |
R2 | 0.208 |
Adjusted R2 | 0.208 |
Residual Standard Error | 44.088 (df = 12645) |
F Statistic | 3,323,877 *** (df = 1; 12645) |
Note : | * p<.01;** p<0.05; *** p<0.01 |
Table 2. Summary of the results of the data
7. Summary
The data that is analysed in the study is a numerical data, also it includes some categorical data (Energy Rating DESC) that was not used for the study and was only used for showing how the energy rating can make de correlation coefficient of the study lower than the expectations.
The limitations that can be encountered in the study is that not all the variables that can have a positive or negative effect on the study are taken in consideration such as if the same building is two times in the data or if the buildings form 2011 have been remodelled, also not being able to see the buildings and lectures at the field only seeing the data base give a little understanding of what is happening therefore if there was a way to take the lecture for the study and observe all the variables it will be a more accurate study; for future investigation these variables will be taken in consideration making the study more reliable and seeing how adding new variables changes the correlation between energy rating and building emission rate.
8. References
[1] Walpole, R., Myers, R., Myers, S. and Ye, K. (2017). Probability & Statistics for Engineers & Scientists. 9th ed. London: Prentice Hall, pp.1-812.
[2] Zaiontz, C. (2017). Symmetry, Skewness and Kurtosis | Real Statistics Using Excel. [Online] Real-statistics.com. Available at: http://www.real-statistics.com/descriptive-statistics/symmetry-skewness-kurtosis/ [Accessed 11 Dec. 2017].
[3] Itl.nist.gov. (2017). 1.3.5.16. Kolmogorov-Smirnov Goodness-of-Fit Test. [online] Available at: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm [Accessed 11 Dec. 2017].
[4] The two-sided one-sample distribution comes via Marsaglia, Tsang and Wang (2003). [Accessed 11 Dec. 2017]. Accessed Via R-Studio Help
[5] Support.minitab.com. (2017). A comparison of the Pearson and Spearman correlation methods – Minitab Express. [Online] Available at: http://support.minitab.com/en-us/minitab-express/1/help and-how-to/modeling-statistics/regression/supporting-topics/basics/a-comparison-of-the-pearson-and-spearman-correlation-methods/ [Accessed 11 Dec. 2017].
[6] Stat.ethz.ch. (2017). R: Scatterplot Matrices. [Online] Available at: https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/pairs.html [Accessed 11 Dec. 2017].
[7] Support.minitab.com. (2017). Autocorrelation function (ACF) – Minitab. [online] Available at: https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/time-series/how-to/autocorrelation/interpret-the-results/autocorrelation-function-acf/ [Accessed 12 Dec. 2017].
[8] Scribbling …. (2017). Understanding t.test() in R. [online] Available at: https://suinotes.wordpress.com/2009/11/30/understanding-t-test-in-r/ [Accessed 12 Dec. 2017].
[9] Data.library.virginia.edu. (2017). Understanding Q-Q Plots | University of Virginia Library Research Data Services + Sciences. [online] Available at: http://data.library.virginia.edu/understanding-q-q-plots/ [Accessed 12 Dec. 2017].
[10] Astrostatistics.psu.edu. (2017). R: Quantile-Quantile Plots. [online] Available at: http://astrostatistics.psu.edu/su07/R/html/stats/html/qqnorm.html [Accessed 12 Dec. 2017].
[11] Seankross.com. (2017). A Q-Q Plot Dissection Kit. [online] Available at: http://seankross.com/2016/02/29/A-Q-Q-Plot-Dissection-Kit.html [Accessed 12 Dec. 2017].
[12] R-tutor.com. (2017). Simple Linear Regression | R Tutorial. [online] Available at: http://www.r-tutor.com/elementary-statistics/simple-linear-regression [Accessed 12 Dec. 2017].
[13] Blog.yhat.com. (2017). Cite a Website – Cite This For Me. [online] Available at: http://blog.yhat.com/posts/r-lm-summary.html [Accessed 12 Dec. 2017].
[14] Durbin-Watson Test, I. (2017). Interpretation of a Durbin-Watson test?. [online] Stats.stackexchange.com. Available at: https://stats.stackexchange.com/questions/59757/interpretation-of-a-durbin-watson-test [Accessed 12 Dec. 2017].
[15] Mgimond.github.io. (2017). Spread-level plots. [online] Available at: http://mgimond.github.io/ES218/Week07b.html [Accessed 12 Dec. 2017].
[16] Londoncitybreak.com. (2017). About London – Currency, travel advice and more. [online] Available at: https://www.londoncitybreak.com/general-information [Accessed 12 Dec. 2017].
[17] Gov.uk. (2017). Data protection – GOV.UK. [online] Available at: https://www.gov.uk/data-protection [Accessed 12 Dec. 2017].
[18] Gov.uk. (2017). Cite a Website – Cite This For Me. [online] Available at: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/335766/non_domestic_minimum_building_energy_performance_standards_working_group.pdf [Accessed 12 Dec. 2017].
[19] (Critical review and methodological approach to evaluate the differences among international green building rating tools, 2017)
Your Bibliography: Critical review and methodological approach to evaluate the differences among international green building rating tools. (2017). 1st ed. Italy, pp.1-11.
[20] Gov.uk. (2017). Cite a Website – Cite This For Me. [online] Available at: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/634057/EPB_Statistics_Release_-_Q2_2017__final.pdf [Accessed 13 Dec. 2017].
9. Appendices
EPC LONDON COMPARISON
Setup
startTime <- Sys.time()
knitr::opts_chunk$set(echo = TRUE, eval=TRUE)
library(ggplot2)
## Warning: package ‘ggplot2’ was built under R version 3.4.3
library(data.table)
## Warning: package ‘data.table’ was built under R version 3.4.3
library(e1071)
## Warning: package ‘e1071’ was built under R version 3.4.3
library(modeest)
## Warning: package ‘modeest’ was built under R version 3.4.3
##
## This is package ‘modeest’ written by P. PONCET.
## For a complete list of functions, use ‘library(help = “modeest”)’ or ‘help.start()’.
##
## Attaching package: ‘modeest’
## The following object is masked from ‘package:e1071’:
##
## skewness
library(MASS)
library(rmarkdown)
## Warning: package ‘rmarkdown’ was built under R version 3.4.3
library(reshape2)
##
## Attaching package: ‘reshape2’
## The following objects are masked from ‘package:data.table’:
##
## dcast, melt
library(car)
## Warning: package ‘car’ was built under R version 3.4.3
library(matrixStats)
## Warning: package ‘matrixStats’ was built under R version 3.4.3
library(readr)
## Warning: package ‘readr’ was built under R version 3.4.3
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(car)
library(knitr)
## Warning: package ‘knitr’ was built under R version 3.4.3
library(htmltools)
## Warning: package ‘htmltools’ was built under R version 3.4.3
library(pgirmess)
## Warning: package ‘pgirmess’ was built under R version 3.4.3
EPCn3 <- fread(“EPC_LONDON_COMPARISON.csv”)
View(EPCn3)
head( EPCn3 )
tail( EPCn3 )
summary( EPCn3 )
par(mfrow = c(1,2))
plot(EPCn3$ENERGY_RATING,type=”p”, ylim=c(9,198), col=4, xlab=””, ylab=”ENERGY_RATE”)
hist(EPCn3$ENERGY_RATING)
boxplot(EPCn3$ENERGY_RATING)
par(mfrow = c(1,2))
plot(EPCn3$BUILDING_EMISSION_RATE,type=”p”, ylim=c(0,262), col=4, xlab=””, ylab=”ENERGY_RATE”)
hist(EPCn3$BUILDING_EMISSION_RATE)
boxplot(EPCn3$BUILDING_EMISSION_RATE)
Estimate the skewness of a distribution
skewness(EPCn3$ENERGY_RATING)
## [1] 0.6488178
## attr(,”method”)
## [1] “moment”
kurtosis(EPCn3$ENERGY_RATING)
## [1] 0.2160727
skewness(EPCn3$BUILDING_EMISSION_RATE)
## [1] 0.947514
## attr(,”method”)
## [1] “moment”
kurtosis(EPCn3$BUILDING_EMISSION_RATE)
## [1] 0.4436743
Central tendency
mean <- mean(EPCn3$ENERGY_RATING , na.rm = TRUE)
mean
## [1] 92.57784
median <- median( EPCn3$ENERGY_RATING , na.rm = TRUE)
median
## [1] 87
mode <- mlv(round( EPCn3$ENERGY_RATING ,0), method = “mfv”, na.rm = TRUE)
mode
## Mode (most likely value): 74
## Bickel’s modal skewness: 0.3617459
## Call: mlv.default(x = round(EPCn3$ENERGY_RATING, 0), method = “mfv”, na.rm = TRUE)
mean <- mean(EPCn3$BUILDING_EMISSION_RATE , na.rm = TRUE)
mean
## [1] 91.63768
median <- median( EPCn3$BUILDING_EMISSION_RATE , na.rm = TRUE)
median
## [1] 80.17
mode <- mlv(round( EPCn3$BUILDING_EMISSION_RATE ,0), method = “mfv”, na.rm = TRUE)
mode
## Mode (most likely value): 59
## Bickel’s modal skewness: 0.4199415
## Call: mlv.default(x = round(EPCn3$BUILDING_EMISSION_RATE, 0), method = “mfv”, na.rm = TRUE)
Variance, Standard deviation, Standard error and Confidence interval
samplesize <- length(EPCn3$ENERGY_RATING )
samplesize
## [1] 12647
variance <- var( EPCn3$ENERGY_RATING , na.rm = TRUE)
variance
## [1] 1152.257
standarddeviation <- sd(EPCn3$ENERGY_RATING , na.rm = TRUE)
standarddeviation
## [1] 33.94491
standarderror <- standarddeviation/sqrt(samplesize)
standarderror
## [1] 0.3018428
marginoferror <- qt(.618, df=samplesize-1)*standarderror
marginoferror
## [1] 0.0906249
confidenceinterval <- c(mean-marginoferror, mean+marginoferror)
confidenceinterval
## [1] 91.54705 91.72830
samplesize <- length(EPCn3$BUILDING_EMISSION_RATE )
samplesize
## [1] 12647
variance <- var( EPCn3$BUILDING_EMISSION_RATE , na.rm = TRUE)
variance
## [1] 2454.543
standarddeviation <- sd(EPCn3$BUILDING_EMISSION_RATE , na.rm = TRUE)
standarddeviation
## [1] 49.54334
standarderror <- standarddeviation/sqrt(samplesize)
standarderror
## [1] 0.4405463
marginoferror <- qt(.618, df=samplesize-1)*standarderror
marginoferror
## [1] 0.1322691
confidenceinterval <- c(mean-marginoferror, mean+marginoferror)
confidenceinterval
## [1] 91.50541 91.76995
z-score
x <- 15 # enter the value of the recording
z <- (x-mean)/standarddeviation
z
## [1] -1.546881
Normality test for large data bases
ks.gof(EPCn3$ENERGY_RATING)
## Warning in ks.test(var, “pnorm”, mean(var), sd(var)): ties should not be
## present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: var
## D = 0.072754, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.gof(EPCn3$BUILDING_EMISSION_RATE)
## Warning in ks.test(var, “pnorm”, mean(var), sd(var)): ties should not be
## present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: var
## D = 0.097093, p-value < 2.2e-16
## alternative hypothesis: two-sided
plot(EPCn3$ENERGY_RATING, EPCn3$BUILDING_EMISSION_RATE)
myPlot <- ggplot(EPCn3) +
geom_point(aes(x = ENERGY_RATING, y = BUILDING_EMISSION_RATE)) +
labs(
title = “Scattergraph of Energy Rating against Building Emission Rate”,
y = “Energy Rating”,
x = “Building Emission Rate”
)
myPlot
## Statistical Analysis
Now calculate the correlation coefficient:
cor(EPCn3$ENERGY_RATING, EPCn3$BUILDING_EMISSION_RATE, method=”pearson”)
## [1] 0.4562315
cor.test(EPCn3$ENERGY_RATING, EPCn3$BUILDING_EMISSION_RATE, method=”pearson”)
##
## Pearson’s product-moment correlation
##
## data: EPCn3$ENERGY_RATING and EPCn3$BUILDING_EMISSION_RATE
## t = 57.653, df = 12645, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4423200 0.4699235
## sample estimates:
## cor
## 0.4562315
cor.test(EPCn3$ENERGY_RATING, EPCn3$BUILDING_EMISSION_RATE, method=”spearman”)
## Warning in cor.test.default(EPCn3$ENERGY_RATING,
## EPCn3$BUILDING_EMISSION_RATE, : Cannot compute exact p-value with ties
##
## Spearman’s rank correlation rho
##
## data: EPCn3$ENERGY_RATING and EPCn3$BUILDING_EMISSION_RATE
## S = 1.7955e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4674271
myPlot <- ggplot(EPCn3) +
geom_point(aes(x = ENERGY_RATING, y= BUILDING_EMISSION_RATE, colour = ENERGYRATING_DESC)) +
labs(
title = “Scattergraph of ENERGY RATING against BUILDING EMISSION RATE(coloured by ENERGY RATING DEC)”,
y = “ENERGY RATING”,
x = “BUILDING EMISSION RATE”
)
# draw it
myPlot
myPlot <- ggplot(EPCn3) +
geom_point(aes(x = ENERGY_RATING, y= BUILDING_EMISSION_RATE, colour = YEAR)) +
labs(
title = “Scattergraph of ENERGY RATING against BUILDING EMISSION RATE(coloured by ENERGY RATING DEC)”,
y = “ENERGY RATING”,
x = “BUILDING EMISSION RATE”
)+facet_wrap(~YEAR)
# draw it
myPlot
myPlot <- ggplot(EPCn3) +
geom_boxplot(aes(x = ENERGY_RATING, y= BUILDING_EMISSION_RATE, colour = ENERGYRATING_DESC)) +
labs(
title = “Scattergraph of ENERGY RATING against BUILDING EMISSION RATE(coloured by ENERGY RATING DEC)”,
y = “ENERGY RATING”,
x = “BUILDING EMISSION RATE”
)
# draw it
myPlot
myPlot <- ggplot(EPCn3) +
geom_point(aes(x = ENERGY_RATING, y = BUILDING_EMISSION_RATE, colour = ENERGYRATING_DESC)) +
labs(
title = “Scattergraph of ENERGY RATING against BUILDING EMISSION RATE(coloured by ENERGY RATING DEC)”,
y = “ENERGY RATING”,
x = “BUILDING EMISSION RATE”
)+facet_wrap(~ENERGYRATING_DESC)
myPlot
EPCn3[, .(
“Correlation Coef (r)” = round(cor(ENERGY_RATING,BUILDING_EMISSION_RATE, method = “spearman”),3)
),
by=ENERGYRATING_DESC
]
## ENERGYRATING_DESC Correlation Coef (r)
## 1: A -0.006
## 2: B 0.396
## 3: C 0.087
## 4: D 0.116
## 5: E 0.062
## 6: F 0.135
## 7: G 0.133
EPCn3[, .(
“Correlation Coef (r)” = round(cor(ENERGY_RATING,BUILDING_EMISSION_RATE, method = “spearman”),3),”p value” = cor.test(ENERGY_RATING,BUILDING_EMISSION_RATE)[[3]]
),
by=ENERGYRATING_DESC
]
## ENERGYRATING_DESC Correlation Coef (r) p value
## 1: A -0.006 5.316383e-01
## 2: B 0.396 2.553700e-21
## 3: C 0.087 3.700949e-10
## 4: D 0.116 2.597915e-11
## 5: E 0.062 9.756159e-02
## 6: F 0.135 2.258656e-04
## 7: G 0.133 2.401233e-03
myPlot <- ggplot(EPCn3) +
geom_histogram(aes(x = ENERGY_RATING, colour = ENERGYRATING_DESC)) +
labs(
title = “Histogram of ENERGY RATING by ENERGYRATING DESC”,
x = “ENERGY RATING”
) +
facet_wrap(~ENERGYRATING_DESC)
myPlot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EPCn3 <- EPCn3[,.(ENERGY_RATING,BUILDING_EMISSION_RATE)]
pairs(EPCn3)
library(corrgram)
## Warning: package ‘corrgram’ was built under R version 3.4.3
corrgram(EPCn3,
lower.panel=panel.shade,
upper.panel=panel.pts,
diag.panel=panel.density,
col.regions=colorRampPalette(c(“darkgoldenrod4”,
“burlywood1”,
“darkkhaki”,
“darkgreen”)
)
)
t.test(EPCn3$ENERGY_RATING, EPCn3$BUILDING_EMISSION_RATE, paired=FALSE)
##
## Welch Two Sample t-test
##
## data: EPCn3$ENERGY_RATING and EPCn3$BUILDING_EMISSION_RATE
## t = 1.7605, df = 22375, p-value = 0.07834
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1065724 1.9869077
## sample estimates:
## mean of x mean of y
## 92.57784 91.63768
hist(EPCn3$ENERGY_RATING)
qqnorm(EPCn3$ENERGY_RATING)
qqline(EPCn3$ENERGY_RATING)
hist(EPCn3$ENERGY_RATING)
hist(EPCn3$BUILDING_EMISSION_RATE)
pairs(~ENERGY_RATING+BUILDING_EMISSION_RATE, upper.panel = NULL, labels=c(“Energy Rating”,”BUILDING EMISSION RATE”), data = EPCn3, main= “Simple Scatterplot Matrix”)
EPCn3Model1 <- lm(BUILDING_EMISSION_RATE ~ ENERGY_RATING, EPCn3)
# basic results
summary(EPCn3Model1)
##
## Call:
## lm(formula = BUILDING_EMISSION_RATE ~ ENERGY_RATING, data = EPCn3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131.80 -33.44 -11.89 25.16 200.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.99194 1.13886 26.34 <2e-16 ***
## ENERGY_RATING 0.66588 0.01155 57.65 <2e-16 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 44.09 on 12645 degrees of freedom
## Multiple R-squared: 0.2081, Adjusted R-squared: 0.2081
## F-statistic: 3324 on 1 and 12645 DF, p-value: < 2.2e-16
str(summary(EPCn3Model1))
## List of 11
## $ call : language lm(formula = BUILDING_EMISSION_RATE ~ ENERGY_RATING, data = EPCn3)
## $ terms :Classes ‘terms’, ‘formula’ language BUILDING_EMISSION_RATE ~ ENERGY_RATING
## .. ..- attr(*, “variables”)= language list(BUILDING_EMISSION_RATE, ENERGY_RATING)
## .. ..- attr(*, “factors”)= int [1:2, 1] 0 1
## .. .. ..- attr(*, “dimnames”)=List of 2
## .. .. .. ..$ : chr [1:2] “BUILDING_EMISSION_RATE” “ENERGY_RATING”
## .. .. .. ..$ : chr “ENERGY_RATING”
## .. ..- attr(*, “term.labels”)= chr “ENERGY_RATING”
## .. ..- attr(*, “order”)= int 1
## .. ..- attr(*, “intercept”)= int 1
## .. ..- attr(*, “response”)= int 1
## .. ..- attr(*, “.Environment”)=<environment: R_GlobalEnv>
## .. ..- attr(*, “predvars”)= language list(BUILDING_EMISSION_RATE, ENERGY_RATING)
## .. ..- attr(*, “dataClasses”)= Named chr [1:2] “numeric” “numeric”
## .. .. ..- attr(*, “names”)= chr [1:2] “BUILDING_EMISSION_RATE” “ENERGY_RATING”
## $ residuals : Named num [1:12647] -26.1 -25.8 -34.6 -35.4 -13 …
## ..- attr(*, “names”)= chr [1:12647] “1” “2” “3” “4” …
## $ coefficients : num [1:2, 1:4] 29.9919 0.6659 1.1389 0.0115 26.3351 …
## ..- attr(*, “dimnames”)=List of 2
## .. ..$ : chr [1:2] “(Intercept)” “ENERGY_RATING”
## .. ..$ : chr [1:4] “Estimate” “Std. Error” “t value” “Pr(>|t|)”
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, “names”)= chr [1:2] “(Intercept)” “ENERGY_RATING”
## $ sigma : num 44.1
## $ df : int [1:3] 2 12645 2
## $ r.squared : num 0.208
## $ adj.r.squared: num 0.208
## $ fstatistic : Named num [1:3] 3324 1 12645
## ..- attr(*, “names”)= chr [1:3] “value” “numdf” “dendf”
## $ cov.unscaled : num [1:2, 1:2] 6.67e-04 -6.35e-06 -6.35e-06 6.86e-08
## ..- attr(*, “dimnames”)=List of 2
## .. ..$ : chr [1:2] “(Intercept)” “ENERGY_RATING”
## .. ..$ : chr [1:2] “(Intercept)” “ENERGY_RATING”
## – attr(*, “class”)= chr “summary.lm”
# ggplot visualisation of the model
plot(EPCn3$ENERGY_RATING,EPCn3$BUILDING_EMISSION_RATE)
abline(EPCn3Model1, lwd=2)
hist(EPCn3$ENERGY_RATING)
qqnorm(EPCn3$ENERGY_RATING)
qqline(EPCn3$ENERGY_RATING, col = 2)
qqPlot(EPCn3Model1)
Rsquared <- summary(EPCn3Model1)$r.squared
AdjRsquared <- summary(EPCn3Model1)$adj.r.squared
Rsquared
## [1] 0.2081472
AdjRsquared
## [1] 0.2080846
durbinWatsonTest(EPCn3Model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.9568517 0.08604671 0
## Alternative hypothesis: rho != 0
ncvTest(EPCn3Model1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 409.3046 Df = 1 p = 5.194014e-91
spreadLevelPlot(EPCn3Model1)
##
## Suggested power transformation: 0.2908974
sgmod <- stargazer(EPCn3Model1,
ci = TRUE,
single.row = TRUE,
type = “html”)
HTML(sgmod)
Dependent variable:
BUILDING_EMISSION_RATE
ENERGY_RATING
0.666*** (0.643, 0.689)
Constant
29.992*** (27.760, 32.224)
Observations
12,647
R2
0.208
Adjusted R2
0.208
Residual Std. Error
44.088 (df = 12645)
F Statistic
3,323.877*** (df = 1; 12645)
Note:
p<0.1; p<0.05; p<0.01
Code last run at: 2017-12-11 18:37:17
Results saved to: filestore.soton.ac.uk/users/jdca1n17/mydocuments/DATA ANALISIS FINAL/Comparison/
Analysis completed in: 43.31 seconds using knitr in RStudio.