当前位置：文档之家› Knowledge-based Cascade-correlation Varying the Size and Shape of Relevant Prior Knowledge

Knowledge-based Cascade-correlation Varying the Size and Shape of Relevant Prior Knowledge

Knowledge-based Cascade-correlation: Varying the Size and Shape of Relevant Prior Knowledge Thomas R. Shultz1 and Francois Rivest2

1Department of Psychology and School of Computer Science, McGill University, Montreal, Quebec, Canada H3A 1B1

2School of Computer Science, McGill University, Montreal, Quebec, Canada H3A 1B1

Summary. Artificial neural networks typically ignore the role of knowledge in learning by starting from random connection weights. A new algorithm, knowledge-based cascade-correlation (KBCC), finds, adapts, and uses its relevant knowledge to speed learning. We demonstrate its performance on small, clear problems involving decisions about whether a two-dimensional input falls within a nonlinear distribution of a particular size, shape, and location. Relevance of prior knowledge to a new, target problem was implemented by systematic variations in the width of the distribution space. The more relevant the prior inexact knowledge was, the more likely that KBCC recruited it for solution of the target problem and the faster that new learning was.

Key words. Knowledge-based learning, transfer, cascade-correlation

1 Knowledge and Learning

Most learning in neural networks is done without the influence of previous knowledge, starting with random connection weights. In sharp contrast, when people learn, they make extensive use of their existing knowledge (Pazzani, 1991; Wisniewski, 1995). Learning with prior knowledge is responsible for the ease and speed with which people learn new material, and for occasional interference effects.

Cascade-correlation (CC) is a generative learning algorithm that learns not only by adjusting weights, but also by recruiting new hidden units into the network as needed in order to reduce error (Fahlman and Lebiere 1990). CC is faster than learning by the standard neural learning algorithm known as back-propagation (BP) and makes a better fit to a variety of psychological data on cognitive development than does BP (Buckingham and Shultz 2000).

In extending CC we devised a new algorithm, knowledge-based cascade-correlation (KBCC), that allows previously-learned source networks to compete with each other and with single hidden units for recruitment into a target network.

KBCC treats its existing networks like untrained single units, by training weights to the inputs of source networks to increase the correlations of their outputs with the target network's error (Shultz and Rivest 2001). The best correlating candidate recruit is installed into the target network, and the other candidates are discarded. Output weights from the new recruit are then adjusted to integrate it into a solution of the target problem. Previous work with KBCC has shown that it effectively recruits and uses its knowledge to speed learning. Preliminary experiments involved learning whether a pair of Cartesian coordinate inputs were or were not within a particular geometric shape. Source networks varied in terms of translation or rotation of a target geometric shape. Generally, the more relevant the source knowledge was, the more likely it was recruited and the more it speeded up learning of a target problem (Shultz and Rivest 2000a 2001). In this paper we describe a test of KBCC networks on similar problems in which the size of the geometric shape is varied.

2 General Method

The input space was a square centered at the origin with sides of length 2. Target outputs specified that the output should be 0.5 if the input point was inside of the shape and -0.5 if the point was outside of the shape. Networks were trained with a set of 225 patterns forming a regular 15 x 15 grid covering the input space including the boundary. We designed two different experiments to assess the impact of source knowledge on the learning of a target task. First we varied the relevance of a single source of knowledge to determine whether KBCC would learn faster if it had source knowledge that was more relevant. In a second experiment, to determine whether KBCC would find and use more relevant source knowledge, there were two sources of knowledge that varied in relevance to a new target problem. In both experiments, knowledge relevance was varied by differences in the width or shape of the two-dimensional geometric figures. The target figure in the second phase of knowledge-guided learning was a rectangle as were the figures in several of the knowledge conditions. Rectangles were always centered at (0, 0) in the input space and always had a height of 22/14. Twenty networks were run in each condition of each experiment.

3 One Source of Knowledge

3.1 Method

In this experiment, several knowledge conditions varied the width of the source rectangle. Because scaling the width up vs. down did not produce the same results, we included conditions with either narrow or wide target rectangles. The various

conditions, which also included irrelevant source knowledge in the form of a circle and no knowledge at all, are described in Table 1.

Table 1

Single-source Knowledge Conditions

Condition Description Relation to target

large rectangle Relation to target small rectangle

Narrow rectangle Rectangle of width 6/14 Far relevant Exact

Medium rectangle Rectangle of width 14/14 Near relevant Near relevant

Wide rectangle Rectangle of width 22/14 Exact Far relevant

Circle Center at (0, 0), radius 0.5 Irrelevant Irrelevant

None No

knowledge None None

3.2 Results

A factorial ANOVA of the epochs required to learn when the narrow rectangle was the target yielded a main effect of knowledge condition, F(4, 95) = 103, p < .0001. Figure 1 shows the mean epochs to learn the target problem, with standard deviation bars and homogeneous subsets that were based on the LSD post hoc comparison method. Examination of the means indicates that relevant knowledge, regardless of distance from the target, enabled faster learning than did irrelevant knowledge and no knowledge at all. This suggests that scaling down in width from a wider source is not much affected by the amount of scaling. The relatively few epochs required in target learning demonstrates that scaling down in width is fairly easy for these networks to learn.

A factorial ANOVA of the epochs required to learn when the wide rectangle was the target also produced a main effect of knowledge condition, F(4, 95) = 74, p < .0001, but with a different pattern of means. Figure 2 shows the mean epochs to learn, with standard deviation bars and homogeneous subsets, based on the LSD post hoc comparison method. Exact knowledge produced the fastest learning, followed by near relevant knowledge, far relevant knowledge, and finally by no knowledge and irrelevant knowledge. Thus, scaling up in width became more difficult with the amount of scaling that was required. In these conditions, irrelevant knowledge did not speed up learning, as compared to the no-knowledge control. Examination of source-learning results confirmed that narrow rectangles were easier to learn than wide rectangles, in terms of both the number of hidden units recruited and epochs to learn.

Fig. 1. Mean epochs to victory in the target phase of the narrow rectangle condition, with standard deviation bars and homogeneous subsets (adapted from Shultz and Rivest 2001, by Taylor & Francis Ltd. https://www.doczj.com/doc/5e615343.html,/journals, with permission).

Fig. 2. Mean epochs to victory in the target phase of the wide rectangle condition, with standard deviation bars and homogeneous subsets (adapted from Shultz and Rivest 2001, published by Taylor & Francis Ltd. https://www.doczj.com/doc/5e615343.html,/journals, with permission).

Figure 3 shows output activation diagrams for networks learning a narrow target rectangle, recruiting near relevant directly connected source knowledge. In such diagrams (Figures 3 and 4), darker points represent inputs inside the target shape and lighter points represent inputs outside of the target shape. A white background indicates test inputs classified as being inside the target shape, a black background indicates test inputs classified as being outside the target shape, and a gray background indicates test inputs whose classification is uncertain. These backgrounds are somewhat irregular because they are produced by testing the network on a fine grid of 220 x 220 input patterns. Whether learning a narrow or wide target, there was a strong resemblance between the shape of the source knowledge and the shape of the final solution. Interestingly, networks learned to classify all patterns as being outside of the target class during the first output phase. Because only 33 of the 225 training patterns fall within the target class

when the target is a narrow rectangle, the best initial guess without nonlinear hidden units is that patterns fall outside the target class.

a b

Fig. 3a, b. Output activation diagrams for a network learning a narrow rectangle. a Near relevant source knowledge. b Final target solution at the end of the second output phase

after recruiting this source knowledge.

In contrast, when learning a wide target rectangle, networks do the opposite;

that is, they learn to classify all patterns as being inside of the target class during

the first output phase. Because the majority of the training patterns (121 out of 225) fall within the target class when the target is a wide rectangle, the best initial guess without nonlinear hidden units is that patterns fall inside the target class. Figure 4 shows output activation diagrams for networks learning a wide target rectangle with near relevant source knowledge. Figure 4b shows classification of

all patterns as being inside of the target class by the end of the first output phase.

Network error during target learning, whether scaling down to a narrow rectangle or scaling up to a wide rectangle, involves patterns near the four corners

of the target rectangle. These corners are regions of the intersecting hyperplanes

that the network is learning. When scaling down to a narrow target rectangle, recruitment of a source network sharpens these corners, making target learning quite fast. When scaling up to a wide target rectangle, recruitment of a source network smoothes these corners, thereby prolonging target learning by resharpening the corners. When scaling up to a wide rectangle, the amount of corner smoothing and eventual resharpening increases with the degree of scaling. Because no additional corner sharpening is required when scaling down to a narrow rectangle, learning speed is relatively fast and does not vary with the degree of scaling. This explains why scaling down in width is faster to learn than scaling up in width.

a b

c d

Fig. 4a, b, c, d. Output activation diagrams for a network learning a wide rectangle. a Near relevant source knowledge. Target solutions at the end of the first (b), second (c), and third

and final (d) output phases.

4 Two Sources of Knowledge

In this experiment, networks first learned two source tasks of different relevance

to the target task. The various source knowledge conditions, their relations to the target, and the mean times each network was recruited during input phases are presented in Table 2. The descriptions of the shapes associated with each condition were provided in Table 1. The two means in each row of Table 2 were compared with a t-test for paired samples. Exact knowledge was preferred over

both relevant, p < .005, and irrelevant, p < .005, knowledge. Relevant knowledge

was preferred over irrelevant knowledge only when scaling down to a narrower target rectangle, p < .001. The large number of recruited networks in the relevant

vs. irrelevant, scaling-up condition reflects the relative difficulty of learning in this condition. The longer that learning continued, the more recruits were required. The results of this two-source experiment fit with the analysis of the one-source experiment. Exact source knowledge was preferred over inexact source knowledge because exact knowledge made a nearly perfect match to the target problem. When scaling down to a narrow rectangle, relevant inexact source knowledge was preferred over irrelevant source knowledge because the recruitment sharpened the critical corners of the target figure, which was rectangular just like the relevant sources. In contrast, when scaling up to a wide rectangle, there was no advantage for relevant source knowledge because recruiting smoothed the critical target corners thus requiring additional resharpening through further learning.

Table 2

Dual-source Knowledge Conditions and Mean Networks Recruited (adapted from Shultz and Rivest 2001, published by Taylor & Francis, Ltd. https://www.doczj.com/doc/5e615343.html,/journals , with permission)

Source knowledge Relation to target Mean networks recruited Narrow rectangle Wide rectangle Circle

Target: Narrow rectangle Narrow & wide rectangles

Exact vs. Relevant 1.05 0.60 n/a Narrow rectangle, circle

Exact vs. Irrelevant 1.00 n/a 0.45 Wide rectangle, circle

Relevant vs. Irrelevant n/a 1.50 0.45 Target: Wide rectangle

Wide & narrow rectangles

Exact vs. Relevant 0.15 1.20 n/a Wide rectangle, circle

Exact vs. Irrelevant n/a 1.05 0.0 Narrow rectangle, circle Relevant vs. Irrelevant 2.65 n/a 3.40

5 Discussion

As in previous work, the present results demonstrate that KBCC is able to find, adapt, and use its existing knowledge in the learning of a new problem, shortening the learning time. The more relevant the source knowledge is, the more likely it

will be recruited for solution of a target problem and the faster that new learning will be. The fact that these results hold for a wide variety of input transformations (translation, rotation, and here, width changes) underscores the robustness of KBCC in finding knowledge along different dimensions of relevance to the target problem.

KBCC is similar in spirit to recent neural-network research on transfer, multitask learning, lifelong learning, knowledge insertion, modularity, and input recoding. However, unlike many of these other techniques, for which both the inputs and outputs of the source and target task must match precisely, KBCC can recruit any sort of differentiable function to use in a new task. The inputs and outputs of KBCC source networks can be arranged in different orders and frequencies and employ different coding techniques than in the target task. This wider range of recruitment objects offers considerably more power and flexibility than other knowledge-based learning systems provide. A direct comparison to multitask learning showed that KBCC was faster and more effective (Shultz and Rivest 2000b). KBCC has also been effective in large-scale, realistic domains such as vowel recognition (Rivest and Shultz 2002) and DNA segmentation (Thivierge and Shultz 2002).

References

Buckingham D, Shultz TR (2000) The developmental course of distance, time, and velocity concepts: A generative connectionist model. J Cog & Dev 1: 305-345

Fahlman SE, Lebiere C (1990) The cascade-correlation learning architecture. In: Touretzky DS (ed) Advances in neural information processing systems 2. Morgan Kaufmann, Los Altos CA, pp 524-532

Pazzani MJ (1991) Influence of prior knowledge on concept acquisition: Experimental and computational results. J Expt Psych: Learning, Mem, & Cog 17: 416-432

Rivest F, Shultz TR (2002) Application of knowledge-based cascade-correlation to vowel recognition. IEEE Internat World Congr on Comp Intell, pp. 53-58

Shultz TR., Rivest F (2000a) Knowledge-based cascade-correlation. Internat Joint Conf on Neural Networks Vol V. IEEE Computer Society Press, Los Alamitos CA, pp 641-646 Shultz TR, Rivest F (2000b) Using knowledge to speed learning: A comparison of knowledge-based cascade-correlation and multi-task learning. Seventeenth Internat Conf on Machine Learning. Morgan Kaufmann, San Francisco, pp 871-878

Shultz TR, Rivest, F (2001) Knowledge-based cascade-correlation: Using knowledge to speed learning. Connect Sci 13: 43-72

Thivierge JP, Shultz TR (2002) Finding relevant knowledge: KBCC applied to DNA splice-junction determination. IEEE Internat World Congr on Comp Intell, pp. 1401-1405 Wisniewski EJ (1995) Prior knowledge and functionally relevant features in concept learning. J Expt Psych: Learning, Mem, & Cog 21: 449-468