Supervised classification with interdependent variables to support targeted energy efficiency measures in the residential sector

Sodenkamp, Mariya; Kozlovskiy, Ilya; Staake, Thorsten

doi:10.1186/s40165-015-0018-2

Research
Open access
Published: 27 January 2016

Supervised classification with interdependent variables to support targeted energy efficiency measures in the residential sector

Mariya Sodenkamp¹,
Ilya Kozlovskiy¹ &
Thorsten Staake^1,2

Decision Analytics volume 3, Article number: 1 (2016) Cite this article

4710 Accesses
6 Citations
Metrics details

Abstract

This paper presents a supervised classification model, where the indicators of correlation between dependent and independent variables within each class are utilized for a transformation of the large-scale input data to a lower dimension without loss of recognition relevant information. In the case study, we use the consumption data recorded by smart electricity meters of 4200 Irish dwellings along with half-hourly outdoor temperature to derive 12 household properties (such as type of heating, floor area, age of house, number of inhabitants, etc.). Survey data containing characteristics of 3500 households enables algorithm training. The results show that the presented model outperforms ordinary classifiers with regard to the accuracy and temporal characteristics. The model allows incorporating any kind of data affecting energy consumption time series, or in a more general case, the data affecting class-dependent variable, while minimizing the risk of the curse of dimensionality. The gained information on household characteristics renders targeted energy-efficiency measures of utility companies and public bodies possible.

Background

Reducing energy consumption is the best sustainable long-term answer to the challenges associated with increasing demand for energy, fluctuating oil prices, uncertain energy supplies, and fears of global warming (European commission 2008; Wenig et al. 2015). Since the household sector represents around 30 % of the final global energy consumption (International Energy Agency 2014), customized energy efficiency measures can contribute significantly to the reduction of air pollution, carbon emissions, and economic growth. There is a wide range of potential courses of action that can encourage energy efficiency in dwellings, including flexible pricing schemes, load shifting, and direct feedback mechanisms (Sodenkamp et al. 2015). The major challenge is however to decide upon the appropriate energy efficiency measures in the circumstances when household profiles are unknown.

In recent years, several attempts have been made toward mechanisms of recognition of energy-consumption-related dwelling characteristics. Particularly, unsupervised learning techniques group the households with similar consumption patterns in clusters (Figueiredo et al. 2005; Sánchez et al. 2009). Indeed, interpretation of each cluster by an expert is required. On the other hand, the existing supervised classification of private energy users relies upon the analysis of consumption curves and survey data (Beckel et al. 2013; Hopf et al. 2014; Sodenkamp et al. 2014). Hereby, the average prediction failure rate with the best classifier (support vector machine) exceeds 35 %. The reasons for this low performance seem to stem from the fact that energy consumption is generally assessed in relation to a number of other relevant variables, such as economic, social, demographic and climatic indices, energy price, household characteristics, residents’ lifestyle, as well as cognitive variables, such as values, needs and attitudes (Beckel et al. 2013; Santos et al. 2014; Elias and Hatziargyriou 2009; Santin 2011; Xiong et al. 2014). Practically, one can include all prediction-relevant data, independently on its intrinsic relationships, into a classification task in the form of features. This leads, however, to a high spatio-temporal complexity, and incurs a significant risk of curse of dimensionality.

In predictive analytics, the classification (or discrimination) problem refers to the assignment of observations (objects, alternatives) into predefined unordered homogeneous classes (Zopounidis and Doumpos 2000; Carrizosa and Morales 2013). Supervised classification implies that the function of mapping objects described by the data into categories is constructed based on so called training instances—data with respective class labels or rules. This is realized in a two-step process of, first, building a prediction model from either known class labels or using a set of rules, and then automatically classifying new data based on this model.

In practice, the problem of finding functions with good prediction accuracy and low spatio-temporal complexity is challenging. Performance of all classifiers depends to a large extent on the volume of the input variables and on interdependencies and redundancies within the data (Joachims 1998; Kotsiantis 2007). At this point, dimensionality reduction is critical to minimizing the classification error (Hanchuan et al. 2005).

Feature selection is the first group of dimensionality reduction methods that identify the most characterizing attributes of the observed data (Hanchuan et al. 2005). More general methods that create new features based on transformations or combinations of the original feature set are termed feature extraction algorithms (Jain and Zongker 1997). Indeed, by definition all dimensionality reduction methods result in some loss of information since the data is removed from the dataset. Hence it is of great importance to reduce the data in a way that preserves the important structures within the original data set (Johansson and Johansson 2009).

In environmental and energy studies, the need to analyze large amounts of multivariate data raises the fundamental problem: how to discover compact representations of interacting and high-dimensional systems (Roweis and Saul 2000).

The existing prediction methods that treat multi-dimensional data include multivariate classification based on statistical models (e.g., linear and quadratic discriminant analysis) (Fisher 1936; Smith 1946), preference disaggregation (such as UTADIS and PREFDIS) (Zopounidis and Doumpos 2000), criteria aggregation (e.g., ELECTRE Tri) (Yu 1992; Roy 1993; Mastrogiannis et al. 2009), model development (e.g., regression analysis and decision rules) (Greco et al. 1999; Srinivasan and Shocker 1979; Flitman 1997), among others. The common pitfall of these approaches is that they treat observations independently, and neglect important complexity issues. Correlation-based classification (Beidas and Weber 1995) considers influence between different subsets of measurements (e.g., instead of using both subsets correlation between them is computed and is then used instead), but use them as features only. The correlation-based classifier also does not consider the possibility of the correlation being dependent on class labels.

In this work, we propose a supervised machine learning method called DID-Class (Dependent-independent data classification) that tackles magnitudes of interaction (correlation) among multiple classification-relevant variables. It establishes a classification machine (probabilistic regression) with embedded correlation indices and a single normalized dataset. We distinguish between dependent observations that are affected by the classes, and independent observations that are not affected by the classes but influence dependent variables. The motivation for such a concept is twofold: first, to enable simultaneous consideration of multiple factors that characterize an object of interest (i.e., energy consumption affected by economic and demographic indices, energy prices, climate, etc.); and second, to represent high-dimensional systems in a compact way, while minimizing loss of valuable information.

Our study of household classification is based on the half-hourly readings of smart electricity meters from 4200 Irish households collected within an 18-month period, survey data containing energy-efficiency relevant dwelling information (type of heating, floor area, age of house, number of inhabitants, etc.), and weather figures in that region. The results indicate that the DID-Class recognizes all household characteristics with better accuracy and temporal performance than the existing classifiers.

Thus, DID-class is an effective and highly scalable mechanism that renders broad analysis of electricity consumption and classification of residential units possible. This opens new opportunities for reasonable employment of targeted energy efficiency measures in private sector.

The remaining of this paper is organized as follows. “Supervised classification with class-independent data” describes the developed dimensionality reduction and classification method. “Application of DID-Class to household classification based on smart electricity meter data and weather conditions” presents application of the model for the data. The conclusions are in “Conclusion”.

Supervised classification with class-independent data

Problem definition and mathematical formulation

Supervised classification is a typical problem of predictive analytics. Given a training set $ \bar{N} = \left\{ {\left( {\bar{x}_{1} , y_{1} } \right), \ldots ,\left( {\bar{x}_{n} ,y_{n} } \right)} \right\} $ containing n ordered pairs of observations (measurements) $ \bar{x}_{i} $ with class labels $ y_{i} \in J $ and a test set $ \bar{M} = \left\{ {\bar{x}_{n + 1} , \ldots ,\bar{x}_{n + m} } \right\} $ of m unlabeled observations, the goal is to find class labels for the measurements in the test set $ \bar{M} $.

Let $ \bar{X} = \{ \bar{x}_{i} \} $ be a family of observations, and $ Y = \{ y_{i} \} $ be a family of associated true labels. Classification implies construction of a mapping function $ \bar{x}_{i} \mapsto f\left( {\bar{x}_{i} } \right) $ of the input, i.e. vector $ \bar{x}_{i} \in \bar{X} $, into the output, i.e. a class label $ y_{i} $.

A classification can be either done by a single assignment (only one class label $ y_{i} $ is assigned to one sample $ \bar{x}_{i} $), or as a probability distribution over $ J $ classes. The latter algorithms are also called probabilistic classifiers.

Some elements of vector $ \bar{x}_{i} $ can be differentiated, if they are not related to $ y_{i} $. Let $ S = \left\{ {s_{i} } \right\} $ be a subset of $ \bar{X} $ with $ s_{i} \subset \bar{x}_{i} $ that are statistically independent from $ Y $:

$$ P(y = j | s = z) = P(y = j), \forall j \in J, z \in \left\{ {s_{1} , \ldots ,s_{n} } \right\} $$

(1)

We define a family $ S $ of measurements $ s_{i} $ as independent. Simply put, observations $ s_{i} $ are not influenced by class labels $ y_{i} $. The remaining observations are called dependent, and are defined as $ X = \left\{ {x_{i} } \right\} $ with $ x_{i} = \{ z \in \bar{x}_{i} | z \notin s_{i} \} $.

Independent variables can influence the dependent ones and thus be relevant for prediction. Figure 1 shows an example for this. The two different classes are represented by o and x. The x-axis shows the dependent variable, the y-axis the independent variable. The objects in blue colour show that the classification with only the dependent variable would be impossible. On the other hand using both the dependent and independent data it is possible to linearly separate the two classes. For each independent value there are two observations with different class labels (i.e., the class label does not depend directly on S). On the other hand the class labels change depending on the values X, but only with combination of X and S do the classes become linearly separable.

We implement the given notation of dependent and independent variables in our DID-Class prediction methodology.

A training set $ N $ of DID-Class takes the form of the set of three-tuples:

$$ N = \left\{ {\left( {x_{1} ,s_{1} ,y_{1} } \right), \ldots ,\left( {x_{n} ,s_{n} ,y_{n} } \right)} \right\}. $$

(2)

A test set $ M $ is then extended to the ordered pairs:

$$ M = \left\{ {\left( {x_{{{\text{n}} + 1}} ,s_{{{\text{n}} + 1}} } \right), \ldots ,\left( {x_{n + m} ,s_{n + m} } \right)} \right\}. $$

Figure 2 visualizes relationships between variables in a conventional classification and in the DID-Class graphical model.

The DID-class model

In this section we describe the proposed three-step algorithm in detail. In Fig. 3 we have attempted to capture the main aspects of the DID-Class model discussed in this section.

Step 1: Estimation of interdependencies between the datasets

DID-class is a method that makes use of the fact of relationships between input variables of the classification. Therefore, once the input datasets are available, it is necessary to test underlying hypotheses about associations between the variables. Throughout this note, the following two assumptions will be made.

Assumption 1

The independent variables $ S $ are statistically independent of the class labels $ Y $, as defined by (1).

In other words, classification based on $ s_{i} $ is random.

Assumption 2

Independent variables $ S $ affect the dependent variables $ X $ and this influence can be measured or approximated.

Correlation coefficients can be found by solving a regression model of $ X $ expressed through regressors $ S $ and class labels $ Y $.

DID-class relies upon the multivariate generalized linear model, since this formulation provides a unifying framework for many commonly used statistical techniques (Dobson 2001):

$$ x_{i} = \mathop \sum \limits_{j \in J} d_{ij} f_{j} \left( {s_{i} } \right) + \varepsilon_{i} . $$

(3)

The error term $ \varepsilon_{i} $ captures all relevant but not included in the model variables, because they are not observed in the available datasets. $ d_{ij} $ are the dummy variables for the class labels:

$$ d_{\text{ij}} = \left\{ {\begin{array}{*{20}c} {1 \quad {\text{for }}y_{i} = j} \\ {0 \quad {\text{for }}y_{i} \ne j} \\ \end{array} } \right. . $$

The most important assumption behind a regression approach is that an adequate regression model can be found (Zaki and Meira 2014). A proper interpretation of a linear regression analysis should include the checks of (1) how well the model fits the observed data, (2) prediction power, and (3) magnitude of relationships.

1.
Model fit Depending on the choice of forecasting model and problem specifics, any appropriate quality measure [e.g., coefficient of determination $ R^{2} $ or Akaike information criterion (D’Agostino 1986)] can be used to estimate the discrepancy between the observed and expected values.
2.
Generalizing performance The model should be able to classify observations of unknown group membership (class labels) into a known population (training instances) (Lamont and Connell 2008) without overfitting the training data.
3.
Effect size If the strength of relationships between the input variables is small, then the independent data can be ignored and application of DID-class is not necessary. The effect size is estimated in relation to the distances between classes using appropriate indices (e.g., Pearson’s $ r $ or Cohen’s $ f^{2} $).

The functions $ f_{j} $ describe how the dependent variables can be calculated from the independent ones and class labels. These functions are utilized in later steps of the presented algorithm to normalize response measures and eliminate predictor variables. The unknown functions $ f_{j} $ are typically estimated with maximum likelihood approach in an iterative weighted least-squares procedure, maximum quasi-likelihood, or Bayesian techniques (Nelder and Baker 1972; Radhakrishna and Rao 1995). Thus, a unique $ f_{j} $ is set in a correspondence to each class $ j \in J $. A single relationship model can be built for the set of all classes, by adding dummy variables of these classes.

If $ X $ linearly depends on $ S $, then (3) can be rewritten as follows:

$$ x_{i} = \mathop \sum \limits_{j \in J} d_{ij} (\alpha_{j} + \beta_{j} \times s_{i} ) + \varepsilon_{i} . $$

(4)

Correlation coefficients $ \alpha $ and $ \beta $ can be calculated using the ordinary least squares method. If the relationships are not linear, then a more complex regression model can be used. For instance, for the polynomial dependency, variables with powers $ S $ can be added on the right side of Eq. (4). In this case, networks with mutually dependent variables can be taken into account, as shown in Fig. 2.

Step 2: Integration of dependent and independent measurements

In order to take the correlation coefficients revealed on the previous step into consideration, and transform the multivariate input data to a lower dimension without loss of classification-relevant information, we normalize the dependent variables with respect to the independent ones. Normalization means elimination of changes in the dependent measurements that occur due to the shifts in the independent values, and transformation of $ X $ into $ X^{{\prime }} $.

Since the relationships of $ X $ and $ S $ are different for each class, the normalization is also class-dependent. Each measurement $ x_{i} $ in the training set $ N $ is normalized according to the corresponding class label $ y_{i} $.

Model (3) expresses regression for each single class label. $ f_{j} $ are used as the normalization functions. The normalized training set takes the following form:

$$ N^{\prime } = \left\{ {\left( {x_{1}^{{\prime }} ,y_{1} } \right), \ldots ,\left( {x_{n}^{{\prime }} ,y_{n} } \right)} \right\} $$

(5)

with

$$ x_{i}^{{\prime }} = x_{i} - f_{{y_{i} }} \left( {s_{i} } \right) + f_{{y_{i} }} \left( {s_{1} } \right). $$

Every $ x_{i}^{{\prime }} $ is the normalized representation of the dependent measurement $ x_{i} $. The term $ f_{{y_{i} }} \left( {s_{\text{i}} } \right) - f_{{y_{i} }} \left( {s_{1} } \right) $ describes the expected difference between $ s_{\text{i}} $ and $ s_{1} $, by the chosen regression model. Hereby, $ s_{1} $ is the default state and no normalization is needed for $ x_{i} \,{\text{with}}\, s_{i} = s_{1} $. As a result, there are $ a $ normalization functions for $ a $ class labels. Without loss of generality, any value can be chosen as the default value. However, a data-specific value may allow for better interpretation of the results.

Figure 4 provides a simple example for the normalization, where dependent measurements are time series that must be classified between two categories. Additionally, there is a discrete independent variable with 3 possible values. In this case, the normalization functions can be computed as the difference of the time series.

Step 3: Classification

Once all classification-relevant input datasets are integrated into one normalized set, it can be used as an input for a probabilistic classifier (further referred to as $ C $) that returns distribution probability over the set of class labels.

To enable prediction of the class for a new observation it is necessary to normalize its value. The challenge is however to choose the appropriate normalization function $ f_{j} $ from the $ a $ functions constructed on the step 1. But since the class labels for the test data are unknown, there is no a priori knowledge on which normalization function should be used. It is possible to apply any $ f_{j} $, but the classification is more likely to be successful if the correct $ {\text{f}}_{\text{j}} $ was chosen (i.e., the test data belongs to the class from which the normalization function was derived). Therefore, DID-class tests all functions for a new observation and chooses the solution with the highest probability for each individual class. After that, the test-set-measurement is transformed and classified $ {\text{a}} $ times. Finally, the averages of the resulting probabilities for each class are derived. The observation belongs to the class with the highest resulting probability.

The prediction process is formally described below.

A normalized measurement is derived for each unlabeled element and each class. The test set takes the following form:

$$ \{ x_{i}^{{j^{{\prime }} }} \}_{{i \in \{ n + 1, \ldots ,m\} , \; j \in J}} $$

where

$$ x_{i}^{{j^{{\prime }} }} = x_{i} - f_{j} \left( {s_{i} } \right) + f_{j} \left( {s_{1} } \right). $$

This transformed test set is used as in input of the trained probabilistic classifier $ C $ that is chosen at the beginning of Step 3. Its output is a probability vector of $ a $ values for each normalized measurement. Thus, the following probability matrix is set to each unlabeled value in correspondence:

$$ P^{i} = \left( {p_{kl}^{i} } \right)_{{\left\{ {k,l \in J} \right\}}} $$

where $ p_{kl}^{i} $ designates the probability of element $ x_{i} $ to belong to the class $ l $ after normalization according to the class $ k $. Prediction of the class label of $ x_{i}^{{j^{{\prime }} }} $ is naturally biased for $ j $, but this is compensated by the prediction being done for all $ a $ classes. The aggregated probability for class $ l $ is $ P_{l}^{i} = \frac{{\varSigma_{k} p_{kl}^{i} }}{a}. $

This process results in a probabilistic classifier for the measurements from the test set. We can also label the measurements as belonging to the class with the greatest probability. The resulting figures $ P_{l}^{i} $ are biased probabilities of element $ i $ belonging to class $ l $, which means that DID-Class should be used as a non-probabilistic classifier. Figure 5 visualizes an example of the classification step for two class labels.

Temporal complexity of classification depends on the dimension of included variables (Lim et al. 2000).

Classification training with DID-Class uses normalized set $ N^{{\prime }} $ with $ n $ variables of dimension $ d_{x} $. A classification without DID-Class would use the initial training set $ N $, with the dimension of $ d_{x} + d_{s} $, where $ d_{x} $ and $ d_{s} $ stand for the dimensions of variables in $ X $ and $ S $ respectively.

The prediction using DID-Class is done for $ a \times n $ variables with dimension $ d_{x} $. A classification without DID-Class would be done for only $ n $ variables, but with higher dimension, namely $ d_{x} + d_{s} . $ Hence, DID-Class can perform better or worse than a classifier that analyzes all data in its initial form (without correlation information), depending on the number of categories $ a $ and dimension of independent variables. Particularly, DID-class is more efficient compared to the algorithms where training complexity is higher than the prediction complexity, which is valid for the majority of commonly used classifiers (e.g., support vector machines (SVM), Adaboost, Random Forest, Linear Discriminant Analysis, etc.).

The proposed algorithm is described in Table 1.

Table 1 The algorithm of DID-Class

Full size table

Verification of DID-class

In this section, we show that the proposed methodology yields linearly separable categories of objects, under the specified conditions.

Theorem 1

Let $ N $ be the training set of a classification problem as described by Eq. (2).

Further suppose that the model for the dependency (3) of $ S $ on $ X $ is known (i.e., the functions $ f_{j} $ are known in advance) and let $ \delta $ be an upper bound on the errors $ \varepsilon_{i} $ in the model:

$$ \delta = \mathop {\hbox{max} }\limits_{i} \parallel x_{i} - f_{{y_{i} }} \left( {s_{i} } \right)\parallel.$$

(6)

Let then $ N^{{\prime}} $ be the training set normalized with the correct functions (5).

If there exists an index $ l_{j} $ for each class label $ j \in J $ , such that the distance between the chosen normalized measurements is greater than $ 4\delta $

$$ \parallel x^{{\prime}}_{{l_{1} }} - x^{{\prime}}_{{l_{2} }} \parallel > 4\delta ,\quad \forall l_{1} , l_{2} \in \left\{ {l_{j} } \right\}_{j \in J} , l_{1} \ne l_{2} , $$

then the classes in the normalized training set are linearly separable.

Proof

All the normalized measurement of a single class $ j $ are contained in the $ 2\delta $-neighborhood of the normalized measurement $ x_{{l_{j} }}^{{\prime }} $ since, for $ y_{i} = j $:

$$ \begin{aligned} \parallel x{^{\prime}}_{i} - x{^{\prime}}_{{l_{j} }} \parallel \, & = \,\parallel x_{i} - f_{j} \left( {s_{i} } \right) + f_{j} \left( {s_{1} } \right) - x_{{l_{j} }} + f_{j} \left( {s_{{l_{j} }} } \right) - f_{j} \left( {s_{1} } \right)\parallel \\ & = \,\parallel x_{i} - f_{j} \left( {s_{i} } \right) - x_{{l_{j} }} + f_{j} \left( {s_{{l_{j} }} } \right)\parallel \\ & \le \,\parallel x_{i} - f_{j} \left( {s_{i} } \right)\parallel + \parallel x_{{l_{j} }} - f_{j} \left( {s_{{l_{j} }} } \right)\parallel \\ & \le \, \delta + \delta \\ & = 2\delta . \\ \end{aligned} $$

This means that every normalized measurement for different classes is contained in a convex compact ball of radius $ 2\delta $ centered on the normalized measurements $ x_{{l_{j} }} $. The different balls are disjoint since the distance between the centers of the balls is greater then $ 4\delta $, and therefore the distance between balls is greater than 0. Hence there exists a hyperplane separating any two classes.$ \square $

An analogous statement can be proven if the model is unknown, but the kind of dependency is known. The estimation of errors in this case is inherent to the model. Theorem 2 proves this statement for the case of linear dependency. Other regression models can be treated in a similar manner.

Theorem 2

Let N be the training set of a classification problem as described by Eq. (2).

Further suppose that the model (3) is unknown (i.e., the functions $ f_{j} $ have to be estimated based on the data), and let $ \delta $ be an upper bound on the error $ \varepsilon_{i} $ in (6).

If there exists an index $ l_{j} $ for each class label $ j \in J $ , such that the distance between the chosen normalized measurements is greater than $ 4\sqrt n \delta $

$$ \parallel x_{l_{1}}^{\prime} - x_{l_{2} }^{\prime} \parallel > 4 \sqrt n \delta , \quad \forall l_{1} , l_{2} \in \left\{ {l_{j} } \right\}_{j \in J}, \,\, l_{1} \ne l_{2} ,$$

then the classes in the normalized training set are linearly separable.

Proof

Let $ f_{j} ' $ be the estimated functions of the linear model. The sum of squared errors for the estimated model is bounded by the sum of squared errors for the actual model:

$$ \mathop \sum \limits_{il} \parallel x_{i} - f_{{y_{i} }}^{{\prime }} \left( {s_{i} } \right)\parallel^{2} \le \mathop \sum \limits_{i} \parallel x_{i} - f_{{y_{i} }}^{{\prime }} \left( {s_{i} } \right)\parallel^{2} \le n\delta . $$

Therefore we get an upper bound for a single error in the estimated model:

$$ \begin{aligned} \parallel x_{i} - f_{{y_{i} }}^{{\prime }} \left( {s_{i} } \right) & \parallel^{2}\;\le \mathop \sum \limits_{i} \parallel x_{i} - f_{{y_{i} }}^{{\prime }} \left( {s_{i} } \right)\parallel^{2} \le n\delta^{2} \\ & \parallel x_{i} - f_{{y_{i} }}^{{\prime }} \left( {s_{i} } \right)\parallel \le \sqrt n \delta . \\ \end{aligned} $$

Analogously to the proof of Theorem 1 we can now show, that each normalized measurement of the class $ j $ is contained in the $ 2\sqrt n \delta $ neighbourhood of the normalized measurement $ x_{{l_{j} }}{{\prime:}} $

$$ \begin{aligned} \parallel x_{i}^{\prime } - x_{{l_{j} }}^{\prime } \parallel & = \;\parallel x_{i} - f_{j}^{\prime } \left( {s_{i} } \right) + f_{j}^{\prime } \left( {s_{1} } \right) - x_{{l_{j} }} + f_{j}^{\prime } \left( {s_{{l_{j} }} } \right) - f_{j}^{\prime } \left( {s_{1} } \right)\parallel \\ & = \;\parallel x_{i} - f_{j}^{\prime } \left( {s_{i} } \right) - x_{{l_{j} }} + f_{j}^{\prime } \left( {s_{{l_{j} }} } \right)\parallel \\ & \le \;\parallel x_{i} - f_{j}^{\prime } \left( {s_{i} } \right)\parallel + \parallel x_{{l_{j} }} - f_{j}^{\prime } \left( {s_{{l_{j} }} } \right)\parallel \\ & \le \;\sqrt n \delta + \sqrt n \delta \\ & = \;2\sqrt n \delta . \\ \end{aligned} $$

This means that every normalized measurement for different classes is contained in a convex compact ball of radius $ 2\sqrt n \delta $ centered on the normalized measurements $ x_{{l_{j} }} $. The different balls are disjoint since the distance between the centers of the balls is greater then $ 4\sqrt n \delta $, and therefore the distance between balls is greater than 0. Hence there exists a hyperplane separating any two classes. $ \square $

We have shown that if (a) Assumptions 1 and 2 are satisfied, (b) regression model describing relations of input variables is reasonable, and (c) different classes are far from each other, then the normalized training set yielded by DID-Class is a linearly separable set.

Application of DID-Class to household classification based on smart electricity meter data and weather conditions

In this section we present a classification of residential units based on smart electricity meter data and weather variables by DID-Class. Results indicate that DID-Class outperforms the best existing classifier with regard to the accuracy and temporal characteristics.

Data description

Our study is based on three following samples.

(a)
The power consumption data of 30-minutes granularity that originates from the Irish Commission for Energy Regulation (CER) (ISSDA. Data from the commission for energy regulation 2014). It was gathered during a smart metering trial over a 76-week period in 2009–2010, and encompasses 4200 private dwellings. It is a dependent data set $ X $, according to the definition given in “Problem definition and mathematical formulation”.
(b)
The respective customer survey data containing energy-efficiency-related attributes of households (such as type of heating, floor area, age of house, number of residents, employment, etc.). It is a data set of known object categories $ Y $, according to the definition given in “Problem definition and mathematical formulation”. For the classification problem we consider 12 different household properties, which are presented on the left hand side of Table 2. The classification is made for each property individually. The properties can take different values (“class labels”) that are presented on the right hand side of Table 2. For example, each household can be classified as either “electrical” or “not electrical” with respect to the property “type of cooking facility”. The continuous values are divided into discrete intervals [e.g., property “age of building” is expressed by two alternative class labels “old”(>30 years) and “new” (≤30 years)]. For three household properties “age of house”, “floor area” and “number of bedrooms” the discrete class labels were defined according to the training data (surveys). For the properties “number of residents” and “number of devices”, the classes were defined to have a roughly equal distribution of households (Beckel et al. 2013).
Table 2 The properties and their class labels
Full size table
(c)
Multivariate weather data, including outdoor temperature, wind speed, and precipitation of 30-min granularity in the investigated region provided by the US National Climatic Data Center (NCDC 2014). It is an independent multivariate data set $ S $, where $ S_{1} = outdoor\, temperature $, $ S_{2} = wind\, speed $, and $ S_{3} = precipitation. $

In the current implementation, we assume that an observation refers to a 1-week data trace (including weekend), because it represents a typical consumption cycle of inhabitants. One week of data at a 30-min granularity implies that an input trace contains 336 data samples for each variable.

Since the CER data set does not contain any facts about household locations or about the geographical distribution of households, we calculated the average of independent variables over all 25 weather stations in Ireland

Prediction results

In the present study, we split the input data into training and test cases in the proportion 80–20 %. The training instances are used to estimate the interdependencies between electricity consumption and outdoor temperature and then train the classifier. The test instances are then used to evaluate accuracy of the classification results.

Step 1: Influence estimation

First, we check if Assumptions 1 and 2 hold for the given variables.