do not necessarily reflect the views of UKDiss.com.

Ensemble RBF-KF model: using AdaBoost as a bridge

**(Improving ensemble RBF-EKF prediction model: using AdaBoost technique as a Bridge)**

- Introduction

The focus of the research is to use extended Kalman filter in training radial basis function algorithm and use AdaBoost as a method in creating a committee of classifiers and outputting the final classifiers. It is our goal to accomplish the following objectives:

- RBF Issue – Application of Kalman filter to train Radial Basis Function with alternative form of generator function [8].
- Kalman Filter Issue

- To improve the convergence of Kalman filter by more intelligently initializing the training process [8] [11].
- To determine the tuning parameters (P, Q and R) of KF [8] [11] more effectively.

- AdaBoost – To use AdaBoost as an ensemble algorithm in bridging trained RFF – KF models to obtain RFF-KF – AdaBoost model.
- Simulation and analysis – To test and analyse performance of the model (RFF – KF – AdaBoost) using IRIS, Cancer and Geophysical datasets

- Radial Basis Function

Radial Basis Function – RBF can be viewed an alternative to the MLP neural network for non-linear modelling [1]. It uses radial basis function as its activation function and can be trained in many ways unlike MLP that are typically trained with back propagation algorithms [2]. It has a similar structure and configuration to multi-layer perceptron network (MLPN). RBF is a feed-forward neural network with a three-layer structure [3]; unlike MLPN Radial Basis Function has only one hidden layer that uses radial basis functions as an activation function [3]. RBF three layers namely are input layer, hidden layer and the output layer. Therefore, it can be viewed as a type of artificial neural network and can be used for supervised learning problems such as regression and classification. RBF are quite used in science and engineering tasks such as function approximation, curve fitting, time series and classification problems. The neurons in the RBF hidden layer contain Gaussian transfer functions. The activation of the hidden units is a non-linear function of the distance between the input vector and the weight vector.

Review of literature shows that Kalman filter has been used extensively in training Neural Network models with challenges [4] [5] [6] [7] and promising results. Despite efforts been made in training RBF with Kalman Filter there are number of issues that are yet to be resolved [8] [9] [10]. These involves the need to focus on the application of Kalman filter to train RBF networks with alternative forms of generation function (instead of current randomized vectors and initializing weight matrix to zero); the need to improve the convergence of the Kalman filter by intelligently initializing the training process and effective determination of the Kalman filter tuning parameters.

Firstly, this research is motivated by the need to address some of the current issues that are related to RBF-EKF models. Secondly, to use AdaBoost as a technique in combining RBF predations obtained when trained with EKF. Therefore, the over objective of this research is to improve the performance of RBF-KF models and propose a new algorithm: RBF-KF-Boost. The algorithm will be based on existing RBF-KF model but AdaBoost will serve as a bridge that combines multiple predictions to obtain the final classifier.

**Review of RBF Network**

RBF network has three layers: the input layer, the hidden layer, and the output layer. The neurons at the hidden layer are activated by a radial basis function. The hidden layer consists of an array of computing units called the hidden nodes. In addition, each hidden node contains a centre

c

vector that is a parameter vector of the same dimension as the input

x

. The activation of the hidden units in RBF is given by a non-linear function of the distance between the input vector and a weight vector. In RBF networks, there is a two-stage training which has been shown to be faster than the methods used in training multi-layer perceptron [1]. During the first stage of the training the parameters of the basis functions are set so that they model the unconditional data density. The second of the training determines the weight in the output layer which has been addressed as quadratic optimization problem that be solved using algebra methods.

**Theory and architecture of RBF**

In this section, we discuss the basic theory and architecture of RBF networks and relevant equations that are to be implemented. Radial basis function network is an artificial neural network that uses radial basis functions as its activation functions. The output of the RBF network is a linear combination of the radial basis functions of the inputs and neuron parameters. For instance, in a RBF network that consists of

m

-dimensional input

x

being passed directly to a hidden layer that consists of

c

neuron. In such a network, each of the c neurons in the hidden layer applies an activation function which is a function of the Euclidean distance between the input and a

m-dimensional

prototype vector. In the network, each hidden neuron contains its own prototype vector as a parameter such that the output of each hidden neuron is weighted and passed to the output layer. Therefore, the outputs of the network consist of the sums of the weighted hidden layers.

The input of RBF network can be modelled as a vector of real numbers

y ∈ Rn

. Similarly, the output of the network can be modelled as a scalar of the input vector,

y : Rn→ R.

Therefore, RBF network mapping can be expressed in the following form:

yx= ∑j=1Mwjϕx-ci ;σ | (1) |

where,

- Mis the number of neurons in the hidden layer,
- wjis the weight of neuroniin the linear output neuron,
- ciis the centre vector for neuroni,

In a typical RBF network, all inputs are connected to each neuron. The standard used in representing RBF function is the Euclidean distance and the radial basis function is normally taken to be Gaussian function such that:

ϕ(|(|x-ci |)|)=exp[-β |(|x-ci |)|2] | (2) |

ϕr= e-r2δ2 | (3) |

The Gaussian basis function is local to the centre vector such that:

limx→ ∞ϕx-ci =0 | (4) |

Therefore, changing the parameter of one neuron only has a small effect for input values that are far away from the centre of that neuron. The parameters of RBF network i.e.

wj, ci

and

βi

are determined in such a way that it optimize the fit between

ϕ

and the data. Figure 1 shows a schematic architecture of a RBF network.

Figure 1 Typical architecture of radial basis function network

**Exact Interpolation vs Function Approximation**

**Exact interpolation** – Given a set of N different d-dimensional input vectors

xn

and a corresponding set of on-dimensional target

tn

it is possible to find a continuous function

h(x)

such that:

hxn= tn, n=1,2,…,N | (5) |

By adopting a radial basis function method that consists of choosing a set of N basis functions that are centred at the N data points using Eq. 1. This can be solved by obtaining an exact solution to a set of N linear equations to find the unknown weights of the equation:

ϕ11ϕ12⋯ϕ1Nϕ21ϕ22⋯ϕ2N⋮⋮⋮⋮ϕN1ϕN2⋯ϕNNw1w2⋮wN= t1t2⋮tN | (6) |

Such that

ϕij= ϕxi-xj, i, j=1, 2,…,N | (7) |

Therefore Eq.6 can be represented in a more compact form as

ϕw=t | (8) |

It has been shown that the interpolation matrix in the above equation is non-singular because exists a large class function

ϕ,

including Gaussians, inverse multi-quadric and thin-plate splines, for which interpolation of matrix

ϕ

is non-singular if the points are distinct [12]. The weights can therefore be obtained by using the transpose of matrix

ϕ

as:

w= ϕ-1b | (9) |

This achieve exact interpolation when substituting the weights obtained by Eq. 9 into Eq. 1, the function

y(x)

represents a continuous differentiable surface passing through each data point. In a similar way generalization to multidimensional target space is mapping from

d

-dimensional input space

x

to

k

-dimensional target space and is given by:

hkxn= tkn, n=1,2,…,N | (10) |

Where

tkn

are components of the output vector

tn

, and

hkx

are obtained by linear superposition of N basis functions as used for the one-dimensional output case:

hkx= ∑nwknϕx-xn | (11) |

The weight parameters are obtained in the form

wkn= ∑j(ϕ-1)njtkj | (12) |

In Eq. 12 above

ϕ-1

is used for each output functions.

**RBF and Function Approximation**

In practice, there is no need to perform strict interpolation due to noisy data as interpolating every data point will lead to overfitting and poor generalization. This is simply because interpolation will imply that the number of basis function required will be equal to the number of patterns in the learning dataset in addition it will be costly to map a large dataset. In order to avoid this RBF model for function approximation and generalization is obtained by modifying the exact interpolation procedure [12]. Modifications to the exact interpolation therefore give a smoother fit to the data using a reduced number of basis functions that depends on the complexity of the mapping functions rather than the size of the data. The network mapping can therefore be expressed as:

yx= ∑j=1Mwkjϕjwkj+ wk0 | (13) |

where the

ϕj

are the basis functions, and

wkj

are the output layer weights. However, it has been noted that the bias weights can be included to the summation by including an extra basis function

ϕ0

and making its activation set to be unity therefore Eq. 13 can be expressed as

yx= ∑j=0Mwkjϕjwkj | (14) |

In a matrix form this can be expressed as:

yx= Wϕ | (15 ) |

The update equation of a Gaussian basis function can be expressed as

ϕjx=e-x-μj22σj2 | (16) |

__Training RBF Parameters__

The designing and training of RBF as shown in Eq. 1 involves appropriate selection of the following parameters:

- the type of basis function

ϕ, - the associated widths

σ, - the number of functions

M, - the centre of the locations

μj,and - the weight

wj.

In many cases Gaussians or other bell shaped functions with compact support are used. The thin-plate splines have also been used successfully in other function approximation problems. In many scenarios, the number of functions and their type are preselected, therefore training of RBF involves determination of three main parameters namely: *the centres, the widths and the weights to minimize a suitable cost function* which in most cases is a non-convex optimization function. The training of RBF can either be supervised where the prediction is compared with the expectation or unsupervised such as the two-stage training.

**Supervised Training**

It is possible to use gradient descent technique to train and minimize the cost function and at the same time update the training parameters however; other unsupervised training algorithms can also be used.

The sum of squares of the cost function to minimize can be expressed as

E=∑nEn | (17) |

Such that:

En = 12∑ktkn- ykxn2 | (18) |

Where:

- tknis the target value of output unitkand

xnare the input vectors

If the Gaussian basis functions are used to minimize the cost function then the following parameter updates can be can be obtained from Eq. 16 as

∆wkj= η1tkn- ykxnϕjxn | (19) | ||||

Δμj= η2 ϕjxnx-μj.σj2∑kyktkn-xn wkj | (20) | ||||

Δσj= η3ϕjxnx-μj2σj3∑kyktkn-xn wkj | (21) | ||||

where,

η1, η2 and η3

are the learning rates for the weight, centre location and the widths respectively.

**Two-stage training**

It is possible to update the three parameters of RBF in Eq. 19, 20 and 21 at the same time. However, this is mostly suitable for no-stationary environments or on-line settings. Therefore, in most cases that involves static maps a better estimation of the parameters is obtained by decoupling it into two-stage problems [1] [12]. It has been shown that this method of training offers an efficient batch mode solution that improves the quality of the final results compare to that obtained when the parameters are trained simultaneously.

This method involves:

- The first stage involves determining the values of

μjand that of

δj. In this stage only the input values

{xn}are used in determining the values of the centres and widths respectively. Therefore, the learning is unsupervised. - This stage is a supervised training that involve using the values of

μjand of

δjin step 1 to determine the weights to the output units.

**Unsupervised training of RBF centres and widths**

The location and the widths of localized basis function can be determined by seeing them as representing the input of data density. The following methods can be used in determining the centres and the widths:

- Random subset selection
- Clustering algorithms
- Mixture modes
- Width determination

**Batch training of the output layer weights**

The transformation between the inputs and the corresponding output of the hidden inputs are fixed after determining the basis parameters. The network is therefore like a single Neural Network layer with linear outputs units and the minimization error in Eq. 17 can be therefore expressed as:

WT = ϕ†T | (22) |

Where

(T)nk= tkn, (∅)nj= ∅jxn, and ∅†= (∅T∅)-1∅T

stands for the pseudo-inverse of

∅

.

∅

is the design (variance) matrix of

A = ∅T∅

. Therefore, in practice to avoid possible problems due to ill-conditioning of the matrix

∅

singular value decomposition is used to obtain the weights by direct batch-model matrix inversion.

**Two-Stage Training vs Supervised Training**

Unsupervised learning leads to sub-optional choices of parameters while supervised learning leads to optimal estimation of the centres and widths. The gradient descent used in supervised learning is a non-linear optimization technique and is computationally expensive. Therefore, it has been observed that if the basis functions are well localized then only few basis functions will generate a significant activation.

**Model Selection**

The primary goal of any network modelling is to model the statistical process that is responsible for generating the data and to find the exact fit to the data. Therefore, efforts are on generalization or performance of the network on data outside the training dataset. However, this can only be achieved as trade-off between bias and variance and the generalization error is a combination of the bias and variance. If a model is too simple this will lead to a high bias i.e. the model on the average will differ significantly from the desired result. Likewise, if a model is too complex this will lead to a low bias and high variance, thus results will be prone to specific features of the training set. Therefore, there is a need for a balance between bias and variance errors which can be achieved by finding the right number of free parameters.

**RBF and Regularization**

Regularization is used in machine learning as a penalty that is added to the original cost function as additional information to solve a problem and at the same time to prevent overfitting. A variety of penalty has been studied and has been linked to weight decay [12]. Some of the regularization methods in use among others are:

- Projection Matrix
- Cross-validation
- Ridge Regression
- Local Ridge Regression

**Normalized RBFNs**

When a normalizing factor is added to the basis function it gives a normalized radial basis function:

∅ix= ϕx-μi ∑j=1Mϕx-μj | (23) |

where

M

is the total number of kernel. This is a form of equation that is obtained in various settings such as noisy data interpolation and regression. Since the basis function activation are bounded between 0 and 1, therefore they can be interpreted as probability values most especially in classification tasks.

**Classification using radial basis function networks**

Empirical study shows that RBF has a powerful function approximation capabilities, good local structure and efficient training algorithm. It has therefore been used in a variety of medical science and engineering classification tasks. These among others include chaotic time series prediction, speech pattern classification, image processing, medical diagnosis, nonlinear system identification, adaptive equalization in communication systems and nonlinear extraction. In RBF data are separated into various classes by placing localized kernels around each group instead of hyperplanes that are used in other algorithms such as MLPNN, SVM and AdaBoost. Further, it has been noted that RBF share similar properties with other modelling methods such as function approximation, noisy interpolation and kerned regression [1]. It is therefore possible to model conditional densities for each class of RBF such that the sum of the basis function should form a representation of the unconditional probability density of the input data. The class-conditional probability densities can be represented as:

pxCk)= ∑j=1Mpx jPjCk, k =1,2,…,c | (24) |

Where

M

is a density functions,

j

is the index label, the unconditional density is obtained by summing the overall classes in Eq. 24 such that:

p(x)=∑k=1cpxCk)P(Ck) | (25) | |

= ∑j=1Mp(x|j)P(j) | (26) |

where

Pj= ∑k=1cP(j|Ck) PCk | (27) |

Applying Bayes theorem by substituting Eq. 24 and Eq. 26 we obtain the posterior probabilities that leads to a normalized RBFN:

PCkx)= ∑j=1Mwkj∅jx | (28) |

The basis functions

∅j

are represented by:

∅jx= PxjPj∑i=1MpxiPi =Pjx | (29) |

The second layer weights (i.e. the hidden layer to output weights) are given by:

wkj= PjCkP(Ck)P(j) =PCkj | (30) | |

- Kalman Filter Algorithm and RBF Optimization

Kalman filter is a mathematical method that estimates the state of a dynamic system from a series of noisy measurements and other inaccuracies that affects the modelled system. It minimizes the mean of the squared error and can be used to estimate the past, the present, and the future states of a system based on the known component of noise from the measurements, and the known component of disturbance to the system. It uses Bayesian inference in estimating a joint probability distribution over the variables for estimation. The concept of Kalman filter is be represented in a block diagram as shown Figure 2 below:

Figure 2 Architecture of a Kalman Filter Algorithm

In a case when we are modelling a linear dynamic system the state

xt

can be represented mathematically as:

xt=Ftxt-1+ Btut+ wt | (31) |

where,

xt

is the state vector of the process at time t,

ut

is vector is containing control input,

Ft

is the state transition matrix which applies to the system state parameter at time

t-1

,

Bt

is the control input matrix which applies the effect of input parameter,

wt

is the vector contacting the process noise terms for the state vector. The measurements

yt

can also be modelled in the following form:

yt=Htxt+ vt | (32) |

where

yt

is the vector of actual measurements of

x

at time

t

and

Ht

is the transformation matrix between the state vector and the measurement vector and

vt

is the vector containing the associated measurement error.

**Kalman Filter as an optimization algorithm**

Recent review shows that Kalman Filter algorithm has been used in training Neural Network models [4] [5] [6] [7] and other system estimator tasks with promising results yet with various challenges [8]. Similarly, Unscented Kalman Filter models have been used in estimating the dynamic states of various systems [13] [14] [15] [16]. It is important to know that the main purpose of Kalman Filter algorithm is to minimize the mean square error between the actual and estimated data as shown in Eq. 18. Kalman filter algorithm is like the *chi-square* merit function that is used in modelling least square fitting [17] problems. In this session emphasis is on how Extended Kalman Filter algorithm can be applied in minimizing errors in training Radial Basis Function Network. The derivation and review of the EKF are widely available in the literature [18] [19] [20] [21].

Kalman filter can be used to optimize weight matrix and error centre of RBF as a least squares minimization problem. In a nonlinear finite dimension discrete time system, the state and measurements of such a system can be modelled as:

θk+1=fθk+ ωk | (33) | |

yk=hθk+ vk | (34) |

Where, the vector

θk

is the state of the system at time

k

,

ωk

is the process noise,

yk

is the observation vector,

vk

is the observation noise,

fθk

and

hθk

are the non-linear vector functions of the state and process respectively.

The system dynamic models

fθk

and

hθk

are assumed known, therefore EKF can be used as the standard method of choice to achieve a recursive i.e. approximate maximum likelihood estimation of the state

θk

[18]. The state and the output white noise

ωk

and

vk

has zero-correlation (i.e. are zero mean and normally distributed variables) with covariance matrix

Qt

and

Rt

respectively. Assuming that that the covariance of the two noise models is stationary over time, therefore it can be modelled as:

Q=EωkωkT | (35) | |

R=EvkvkT | (36) | |

MSE=EekekT= Pk | (37) |

Where

Pk

is the error covariance matrix at time

k.

Assuming that the extended Kalman approach in Eq. 33 and Eq. 34 are sufficiently smooth, they can be expanded and approximate them around the estimate

θk

using first-order Taylor expansion series such that:

fθk=θ̂k +Fk*θk-θ̂k | (38) | |

fθk=θ̂k + Hk T* θk-θ̂k | (39) |

Where

Fk= ∂fθ∂θ|θ=θ̂k | (40) | |

Hk T= ∂hθ∂θ|θ=θ̂k | (41) |

If we neglect the higher order terms of the Taylor series and substitute Eq. 38 and Eq. 39 into Eq. 33 and Eq. 34 respectively, then Eq. 38 and Eq. 39 can be approximated as Eq. 42 and Eq. 43 respectively

θk+1= Fkθk+ωk+ ∅k | (42) | |

yk= Hk T+ vk+φk | (43) |

The desired value

θ̂n

can be estimated through recursion as in [8] [21] such that:

θ̂k =fθ̂k-1 + Kkyk-hθ̂k-1 (43)

θ̂k =fθ̂k-1 + Kkyk-hθ̂k-1 | (44) | |

Kk= PkKkR+ Hk TPk Hk-1 | (45) | |

Pk+1=FkPk-KkHk TPkFk T+Q | (46) |

where

Kk

is the Kaman Gain and

Pk

is the covariance matrix of the estimation error.

**Optimization of RBF using Kalman Filter**

Just as EKF have been used in training neural network and other algorithms EKF can also be used in training Radial basis function algorithm. In this section we describe the derivative functions of RBF, some its properties and how these can be integrated with EKF for optimization purposes.

**Derivatives of Radial Based Function**

In a case where the basis functions are fixed and the weights are adaptable as shown in Fig. 1, then the derivative function of the network is therefore a linear combination of the derivatives of the radial basis function [22]. It has been shown that there are many main ways in which g(.) function at the hidden layer can be represented [8] [22] . The common choices are:

- The multiquadric function

gv=(v2+ β2)1/2 | (47) |

- The inverse multiquadric function

gv=(v2+ β2)-1/2 | (48) |

- The Gaussian function

gv=exp-vβ2 | (49) |

4. The thine plate spline function

gv=vlogv | (50) |

where

β

is a real constant.

The spline in Eq. 50 among other can be represents as thin-plate spline, spline with tension and regularized spline[xxx]. Karayiannis [23] observed that since RBF prototypes are generally interpreted as the centres of the receptive fields therefore, the hidden layer functions have the following properties:

- The response at a hidden layer is always positive
- The response at a hidden neuron becomes stronger as the input approaches the prototype,
- The response at a hidden neuron becomes more sensitive to the input at the input approaches the prototype.

Taking into consideration the above properties, therefore RBF hidden layer can be expressed as [8] [23]

gv= [g0(v)]11-p | (51) |

Where

p

a real is number and

gv

is a generation function

It has been shown that if

p

is greater than 1 then the function should satisfy number of conditions (generator function

go(v)

conditions) as detailed in [8]. Karayiannis [23] in Reformatted Radial Basis Neural Networks Trained by Gradient Descent, shows that a *generator function* that satisfies the conditions is a linear function that can be expressed as:

go(v) =av+b | (52) |

Where

a > 0

and

b ≥0

. It has been observed that if

a = 1 and p = 3

, then the hidden layer function reduces to the inverse multiquadric function in Eq. 51

**Optimization of RBF based on derivatives**

The RBF architecture represented in Fig. 1 with the hidden layer function

g.

have the form of Eq. 51 and can be expressed in matrix as follows:

y ̂=w10w11⋯w1cw20w22⋯w22⋮⋮⋮⋮wn0wn1⋯wnc1gx- v12⋮gx- vc2 | (53) |

If we denote,

w10w11⋯w1cw20w22⋯w22⋮⋮⋮⋮wn0wn1⋯wnc = w1Tw2T⋮wnT =W | (54) |

Imagine a training set of

M

desired input-output responses

{xi,yi}

, for

i=1,…,M

. Then we can represent Eq. 54 as:

ŷ1 ⋯ ŷn=W1⋯1gx1- v12⋯gxM- v12⋮⋮⋮gx1- vc2⋯gxM- vc2 | (55) |

Representing the RHS of the matrix in Eq. 55 with the following notations:

h0k=1 for k=1,…M | (56) | |

hjk= gxk- vj2 for k=1,…M, j=1,…,c | (57) |

Hence, we can represent RHS of Eq. 55 as

h01⋯h0Mh11⋯h1M⋮⋮⋮hc1⋯hCM= h1 ⋯ hM=H | (58) |

Therefore, Eq.

55

can be represented as follows:

Y ̂=WH | (59) |

When using gradient descent to minimize the training error, the error function can be defined as:

E= 12 Y-Y ̂ F2 | (60) |

Where,

- Yis the matrix of the expected values for the RBF
- . F2is the square of the
*Frobenius norm*of the matrix, which is equal to the sum of the squares of elements of the matrix

The derivative of E with respect to the weight and the prototype [23] can be expressed as

∂E∂wi= ∑k=1Mŷik-yikyk for i=1,…, n | (61) | |||

∂E∂vj= ∑k=1M2g’xk-vj2xk-vj∑i=1nyik-y ̂ikwij for j=1,…, c | (62) | |||

Where,

- ŷikis the element in theith row and

kth column of the

Y ̂matrix in Eq.

59 - yikis the corresponding element inYmatrix form.

To optimize the RBF with respect to the rows of the weight matrix

W

and prototype

vj

we need to iteratively compute the partials in Eq. 61 and 62 and performing the following updates:

wi= wi- η∂E∂wi for i=1,…, n | (63) | |

vj= vj- η∂E∂vj for j=1,…, c | (64) |

Where,

- ηis the step size of the gradient descent method
- wiandvjare local minimal, the optimization stops when reached

**Optimization of RBF using Extended Kalman Filter**

Applying similar approach as in [8] [11] [24], we can view the optimization of RBF weight

W

and the prototype

vj

as a weighed least-square minimization problem. Therefore the error vector can be viewed as the difference between the RBF outputs and the expected target values. Therefore, sing the RBF network as in Fig. 1 with

m

inputs,

c

prototypes, and

n

outputs: Let

y

represent the target vector for the RBF outputs, and

h(θ̂k )

denote the actual outputs at the

kth

iteration of the optimization algorithm. Then

y

and

h(θ̂k )

can be represented as Eq. 65 and Eq. 66 respectively as:

y=y11…..y1M…… yn1…….ynMT | (65) | |

hθ̂k =ŷ11…….. ŷ1M…………ŷn1…….ŷnMkT | (66) |

where

n

is the dimension of the RBF output and

M

is the number of training samples.

The optimization problem of RBF can therefore be represented using Kalman filter algorithm by letting the output of the weight

W

and the elements of the prototype

vj

represent the state of a nonlinear system and the output of the RBF network. Therefore, the state of the system can be represented as:

θ=w1 T…. wn T…. vc TT | (67) |

In Eq. 67 the vector

θ

consists of all the RBF parameters

(nc+1)

arranged in a linear array and the non-linear system to which the KF can be applied is represented by Eq. 68 and Eq. 69 as:

θk+1= θk | (68) | |

yk=h(θk) | (69) |

where

h(θk)

is the RBF nonlinear mapping between its input and output parameters. However, to make the filter algorithm stable there is a need to add artificial noise:

φk

and measurement noise:

vk

to the model as in [19] [8] [**xxx more citations**]. Therefore, Eq. 68 and Eq. 69 can be represented as:

θk+1= θk+φk | (70) | |

yk=h(θk) + vk | (71) |

In this form, we can then apply Eq.44-46 to Eq. 70 and Eq.71

where

- f.is the identity
- ykis the target output of the RBF network
- hθ̂kis the actual output of the RBF network given the RBF parameters at thekthiteration of the Kalman recursion
- Hkis the partial derivative of the RBF output with respect to the RBF network matrix
- QandRmatrices are the tuning parameters which are the covariance matrices of the artificial noise processes

ωkand

vkrespectively.

It can be shown that the partial derivative of Radial Basis Function output [8] [25] [22] with respect to the its parameters can be simply be represented as:

Hk= HwHv | (72) |

Where

Hw =H0⋯00H⋯0⋮⋮⋮⋮0⋯0H | (73) |

and

Hv=-w11g11’2×1-v1 ⋯-w11gm1’2xm-v1⋯-wn1g11’2×1-v1⋯-wn1g11’2xm-v1⋮⋮⋮⋮⋮⋮⋮-w1cg11’2×1-vc⋯-wncg1c’2×1-vc⋯-wncg1c’2×1-vc⋯-wncgmc’2xm-vc | (74) |

Where

- H = is thec + 1 × Mmatrix (as in Eq.58)
- wij= is the element in theith row and

jth column of

Wweights - xi = is theith input vector
- vj = is thejth prototype vector
- Hw = is ann c + 1 × nMmatrix (as in Eq.73)
- Hv = is anmc × nMmatrix (as in Eq. 74)
- Hk = is annc+1+ mc × nMmatrix

As in [8] [11], it is now possible to execute Eq. 46 using the extended Kalman filter to determine the weight matrix

W

and the prototype

vj.

**AdaBoost as an ensemble technique: for combining trained classifiers**

AdaBoost is an ensemble or meta-learning method that can be used with other learning algorithms to generate strong classifiers out of weak classifiers. The concept of AdaBoost is that a better algorithm can be built by combining multiple instances of a simple algorithm where each instance is trained on the same set of training data but with different weights assigned to each case. It thereby boosts their performance as AdaBoost iteratively trains several base classifiers with each classifier paying more attention to data that are misclassified in the previous training. During iteration AdaBoost calls a simple learning algorithm that returns a classifier and assigns a weight coefficient to it. Base classifiers with smaller error have larger weights and those with larger error have smaller weights. AdaBoost combines the learning algorithms linearly into a weighted sum as a final output of the boosted classifier. Like other algorithms, AdaBoost has its own drawbacks. It is sensitive to noisy data and outliers. However, it can be less susceptible to overfitting compare to other learning algorithms such as neural networks and SVM. Review shows that many variants of the algorithm have introduced in the past decades to address one problem or the other, however in this research we will only consider the original AdaBoost algorithm.

**Brief description of AdaBoost**

The derivation and theories of AdaBoost has been highlighted extensively in [26] [27] [28]. The description here follows Schapire [29] : assume we are given a number labelled training examples such that

M={(x1 , y1)

,

(x2 , y2)

,.,

xn , yn}

where

xi∈RM

and the label

yn ∈-1, 1.

On each iteration

t=1,…, T

, a distribution

Dt

is computed over the

M

training examples. A given weak learner is applied to find a weak hypothesis

ht: R→{-1, 1}

. The aim of the weak learner is to find a weak hypothesis with low weighted error

εt

relative to

Dt.

The number of iterations implies the number of weak classifier during training. The final classifier

H(x)

as shown in Figs. 3 and 4 is computed as a weighted majority of the weak hypothesis

ht

by vote where each hypothesis is assigned a weight

αt

. The final classifier is given by:

Hx=sign∑t=1Tαthtx | (75) |

The accuracy of the hypothesis is calculated as an error measure this is given by:

εt=Pri~Dthti≠ yi | (76) |

The weight of the hypothesis is a linear combination of all the hypotheses of the participating experts, it is given by:

αt= 12ln1-εtεt | (77) |

The distribution vector

Dt

is expressed as

Dt+1i=Dtiexp-αtyihtxi Zt | (78) |

Zt

is a normalization factor such that the weights sum up to 1, which makes

Dt+1

a normal distribution. The pseudocode for AdaBoost is shown in figure 2.

During the training process there is always a difference from the predicted values and the expected values this deviation or expected error is the sum of squared errors. The expected error over the committee can be expressed as in Eq.79:

Eerr= 1N∑i=1Nyix-hix2 | (79) |

**The Cost function/Weighting**

**Weighting** – AdaBoost uses weighting function that enables new classifier to focus on the erroneous classification. During iterations AdaBoost sequentially trains several new classifiers and assigns it an output weight that in principle is equivalent to the error made the classifier. This enables new classifiers to pay more attention to data that are misclassified by previous classifiers.

**Training set selection – **After each training AdaBoost increases the weights on the misclassified examples such that examples with higher weights are retrained with more emphasis in the next iteration. The equation for the output weight update is shown in Eq. 77. After computing the weight then AdaBoost updates the training example weights using Eq. 78. Eq. 78 shows how to update the weight for the

i

th iteration during training, where

Dt

is a vector of weights with one weight for each training example set. The equation is evaluated for each of the training sample. The exponential loss function in the pseudocode scales up or down each weight between 1 and -1.

**Exponential loss function – **AdaBoost attempts to minimize an exponential loss function which is an upper bound on the average training error by doing a greedy coordinate descent on

Hxi

[28] [30]. The exponential loss function that AdaBoost attempt to minimize can be expressed as:

LH= ∑t=1mexp-yiHxi | (80) |

Figure 3- The AdaBoost pseudocode

Figure 4 A typical ensemble models showing a committee of neural networks. Each classifier

hi

has an associated contribution

αi

**Using ensemble AdaBoost in training Bridging EKF + RBF with AdaBoost**

Even though RBF and NN have proved to be an effective tool in many applications however, there are some situations that many networks are required to produce accurate results when handling some complex tasks. One way of achieving this is to combine various predictions from the network models. AdaBoost as a technique of training learning algorithms and combining their outputs will serve this purpose. Therefore, using AdaBoost algorithm will create an ensemble of RBF-EKF networks. This will produce a stronger classifier output of from the committee of RBF that were trained by EKF and will enable the RBF network model to handle classification tasks problems that a single RBF network cannot handle effectively.

- Diagrammatic representation of the proposed model check online and current reading papers
- Simulation results of Dan Simon stuff Matlab IDE
- Simulation results b using datasets from b but using AdaBoost and decision stump and RBF (Matlab IDE) for Daqing
- Simulation : AdaBoost + DS(NN, RBF) Matlab IDE
- Send Paper 2 for publication

- To add: We intend to use PSO algorithm to determine and optimize the initial parameters of RBF neural network in order to reduce the error of the RBF.
- Then use AdaBoost to learn the RBF network and train committees of RBF weak predictors.
- Linearly combine committee of weak predictors to produce strong predictors with a better classification performance.

# References

[1] | I. Nabney, NETLAB Algorithms for Pattern Recognition, M. Singh, Ed., London: Springer, 2002. |

[2] | F. Schwenker, H. A. Kestle and G. Palm, “Three learning phases for radial-basis-function networks,” Neural Networks, vol. 14, pp. 439-458, 2000. |

[3] | R. Kruse, C. Borgelt, F. Klawonn, C. Moewes, M. Steinbrecher and P. Held, “Radial Basis Function Networks – Part of the series Texts in Computer Science,” in Computational Intelligence, pp. 83-103. |

[4] | A. Shareef, Y. Zhu, M. Musavi and B. Shen, “Comparison of MLP Neural Networks and Kalman Filter for Localization in Wireless Sensor Networks,” in 19th IASTED International Conference: Parallel and Distributed Computing and Systems, Cambridge, MA, 2007. |

[5] | J.Sum, C.-S. Leung, G. Young and W.-K. Kan, “On Kalman Filtering Method in Neural Network training and Pruning,” IEEE transactions on Neural Networks, vol. 10, no. 1, pp. 161-166, 1999. |

[6] | “EKF LEARNING FOR FEEDFORWARD NEURAL NETWORKS,” in 2003 European Control Conference (ECC), Canbridge, UK, 2003. |

[7] | A. Krok, “The Development of Kalman Filter Learning Technique for Artifical Neural Networks,” Journal of Telecommunications and Information Technolgy, pp. 16-21, 2013. |

[8] | D. Simon, “Training Radial Basis Neural Networks with the Extended Kalman Filter,” Neurocomputing, vol. 48, pp. 455-457, 2002. |

[9] | H. Kamath, A. Goswami, A. Kumar, R. Aithal and P. Singh, “RBF and BPNN Combi Model Based Filter Application for maximum Power Point Tracker of PV Cell,” in Internatinal MultiConference of Engineers and Computer Scientists, Hong Kong, 2011. |

[10] | T. Kurban and E.Beşdok, “A Comparison of RBF Neural Network Training Algorithms for Inertial Sensor Based Terrain Classification,” vol. 9, pp. 6312-6329, 12 August 2009. |

[11] | A. N. Chernodub, “Training Neural Networks for classification using the Extended Kalman Filter: A comparative study,” Optical Memory and Neural Networks, vol. 23, no. 2, pp. 96-103, 2014. |

[12] | J.Ghosh and A.Nag, “Radial Basis Function Networks 2 – New Advances in Design,” in Radial Basis Function Network 2, Physica-Verlag, 2001. |

[13] | S. Papp, K. Gyorgy, A.Kelemen and L.Jakab-farkas, “APPLYING THE EXTENDED AND UNSCENTED KALMAN FILTERS FOR NONLINEAR STATE ESTIMATION,” The 6th edition of the Interdisciplinarity in Engineering International Conference, 2012. |

[14] | A. Attarian, J. Batze, B.Matzuka and H. Tran, “Application of the Unscented Kalman Filtering to Parameter Estimation,” Mathematical Modeling and Validation in Physiology, vol. 2064, pp. 75-88, 11 September 2012. |

[15] | S. Ramadurai, S. Kosari, H. H. King, H. J. Chizeck and B. Hannaford, “Application of Unscented Kalman Filter to a Cable Driven Surgical Robot: A Simulation Study,” in IEEE International Conference on Robotics and Automation, Saint Paul, 2012. |

[16] | K.Dróżdż and K.Szabat, “Application of Unscented Kalman Filter in adaptive control structure of two-mass system,” in Power Electronics and Motion Control Conference (PEMC), 2016. |

[17] | T. Lacey, “Tutorial: The Kalman Filter,” [Online]. Available: http://web.mit.edu/kirtley/kirtley/binlustuff/literature/control/Kalman%20filter.pdf. [Accessed 17 April 2017]. |

[18] | E. Wan and R.V.D.Merwe, “The Unscented Kalman Filter for Nonlinear Estimation, Communications, and Control Symposium,” Adaptive Systems for Signal Processing, pp. 153-158, 2000. |

[19] | S. Haykin, Adaptive Filter Theory, 3rd ed., Prentice-Hall,, 1996. |

[20] | M. Ribeiro, “Kalman and Extended Kalman Filters: Concept, Derivation and Properties,” CiteSeer, 2004. |

[21] | B. Anderson and J. Moore, Optimal Filtering, Englewood Cliffs, NJ: Prentice-Hall, 1979. |

[22] | N. Mai-Duy and T. Tran-Cong, “Approximation of function and its derivatives using radial basis function networks,” Applied Mathematical Modelling 2, vol. 27, pp. 197-220, 2003. |

[23] | N. Karayiannis, “Reformulated radial basis neural networks trained by gradient descent,” Trans. Neural Networks, vol. 3, pp. 657-671, 1999. |

[24] | G. Puskorious and L. Feldkamp, “Neurocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks,” IEEE Trans. Neural Networks , no. 5, pp. 279-297, 1994. |

[25] | R. Schaback, “A Practical Guide to Radial Basis Functions,” [Online]. Available: http://num.math.uni-goettingen.de/schaback/teaching/sc.pdf. [Accessed 18 April 2017]. |

[26] | Y. Freund and R. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997. |

[27] | R. T. J. F. ] T. Hastie, The elements of statistical learning: data mining, inference, and prediction, Springer, 2001. |

[28] | R. Schapire and Y. Freund, Boosting: Foundations and algorithms, MIT Press, 2014. |

[29] | R. Schapire, “Explaining AdaBoost,” in Empirical Inference, Springer, 2013, pp. 37-52. |

[30] | J. Friedman, T. Hastie and R. Tibshirani, “Additive logistic regression: a statistical view of boosting,” The Annals of Statistics, vol. 28, no. 2, pp. 337-407, 2000. |