By Kroese B., van der Smagt P.

8. HOW GOOD ARE MULTI-LAYER FEED-FORWARD NETWORKS? 43 2. The number of learning samples. This determines how good the training samples represent the actual function. 3. The number of hidden units. This determines the ‘expressive power’ of the network. For ‘smooth’ functions only a few number of hidden units are needed, for wildly fluctuating functions more hidden units will be needed. In the previous sections we discussed the learning rules such as back-propagation and the other gradient based learning algorithms, and the problem of finding the minimum error.

2: The descent in weight space. a) for small learning rate; b) for large learning rate: note the oscillations, and c) with large learning rate and momentum term added. Learning per pattern. , a pattern p is applied, E p is calculated, and the weights are adapted (p = 1, 2, . . , P ). There exists empirical indication that this results in faster convergence. Care has to be taken, however, with the order in which the patterns are taught. For example, when using the same sequence over and over again the network may become focused on the first few patterns.

31) When eq. 31) holds for two vectors ui and ui+1 they are said to be conjugate. Now, starting at some point p0 , the first minimisation direction u 0 is taken equal to g0 = −∇f (p0 ), resulting in a new point p 1 . , γi = gTi+1 gi+1 gTi gi gk = −∇f |pk with for all k ≥ 0. 33) Next, calculate pi+2 = pi+1 + λi+1 ui+1 where λi+1 is chosen so as to minimise f (p i+2 )3 . , see (Stoer & Bulirsch, 1980)). The process described above is known as the Fletcher-Reeves method, but there are many variants which work more or less the same (Hestenes & Stiefel, 1952; Polak, 1971; Powell, 1977).

