A Framework for Cooperative Object Recognition

N. Oswald and P. Levi

Institute of Parallel and Distributed High-Performance Systems,

Applied Computer Science - Image Understanding

70565 Stuttgart

Germany

Abstract

This paper explores the problem of object recognition from multiple 
observers. The basic idea is to overcome the limitations of the 
recognition module by integrating information from multiple sources. 
Each observer is capable of performing appearance-based object 
recognition, and through knowledge of their relative positions and 
orientations, the observerrs can coordinate their hypotheses to make 
object recognition more robust.

A framework is proposed for appearance-based object recognition using 
Canny edge maps that are effectively normalized to be translation and 
scale invariant. Object matching is formulated as a non-parametric 
statistical similarity computation between two distribution functions, 
while information integration is performed in a Bayesian belief net 
framework. Such nets enable both a continuous and a cooperative 
consideration of recognition result. Experiments which are reported on 
two observers recognizing mobile robots show a significant improvent of 
the recognition results.

1 Introduction

The fast and robust recognition of objects is a central task in robotic 
applications in traffic, manufacturing or services. In such 
applications, various objects with a priori unknown identity, position 
and orientation are seen with changing background, projective 
distortions, and varying illumination conditions from observers in 
arbitrary view positions. Thus, the design of a recognition system 
requires the avoidance of misinterpretations caused by the just 
mentioned difficulties. Consequently, the presumption seems likely that 
an observation with distributed visual input from several cooperative 
observers will increase robustness in such applications.

Information integration is a kind of multi-sensor fusion with 
homogenouos data. In applications concerning multi-sensor fusion this 
process can typically be assigned to different levels of abstraction [8] 
[11]. Unlike in [2], where cooperation is distinguished be-

tween the levels of sensing, processing, manipulation, behaviors, and 
agents, we consider different aims of cooperation and its prerequisites, 
and develop a relation to the levels of abstraction. At low-level 
processing, the aim of cooperation is to take advantage of available 
required sensory information in a team of observers. There do not exist 
any prerequisites for a single observer. Cooperation leads obviously to 
an increasement of knowledge of individual observers if they are not 
able to look at the required region. Up to now, most of the literature 
concerning cooperation can be assigned to the intermediate-level. The 
aim of cooperation at this level is the composition of object components 
provided that a shared and overlapping field of view between combining 
views exists. [14] describes a method to generate a 3D reconstruction of 
an unknown static scene by combining stereo and focus data. In [7] the 
3D reconstruction is done in a dynamic manner by fusing data gained from 
stereo and optical flow computations. Another approach [1] builds a 
formal object description from vision and touch that is used as a base 
for the recognition task. Recent approaches [12] [13] [4] that deal with 
multiple perspective interactive videos try to enable a 3D scene 
analysis with the background of influencing the individual perspective 
view of an observer. Cooperation at highlevel processing assumes an 
object that is observed from different view positions and deals with the 
integration of object hypotheses. Such integration or fusion of data at 
a higher level is supported by few agent architectures [8]. Cooperation 
aims at this level are of great interest in multi-agent applications 
like in [17] in order to build observers that operate autonomously by 
calculating their own estimations. Nevertheless, research with regard to 
this level can hardly be found.

The recognition scenario (Fig. 1) discussed in this article concentrates 
at high-level processing on the cooperative recognition with several 
observers. In such a shared environment each observer calculates 
recognition results of a common target object. The object

0

1

2

3

4

5 B

C

D

E

F

G

H

0%

50%

100%

Orientierung

Modell

Hypothese

0

1

2

3

4

5 B

C

D

E

F

G

H

0%

50%

100%

Orientierung

Modell

Hypothese

Observer E 4

Observer E 3

Observer E 1

Observer E 2

Classification

Scene

Matching

Model

Scene

Matching

Model

Figure 1: Scenario of cooperative recognition

recognition receives relevant data in the form of areas of motion 
calculated by a tracking algorithm1 similar to the one proposed by [22]. 
The correspondence between object and model data is done by a new 
statistical approach that supplies a list of hypotheses. The hypotheses 
of each observer were finally combined by the use of bayesian belief 
nets. Experimental results will indicate the influence of cooperation on 
the quality of hypotheses.

2 Statistical object recognition

There are mainly two approaches of object modelling, the 
appearance-based and the geometric one[9] [19]. The appearance-based 
approach has the advantage of not requiring a formal description. Thus, 
any kind of object can be modelled, but, in general, object hierarchies 
do not exist. On the other hand, hierarchies are the advantage of the 
geometric object modelling model. But so far, this approach is limited 
to simple objects [15]. Thus, the approach to choose depends on the 
complexity of the objects. Because of the robots in Fig. 1 being complex 
the appearance-based approach was chosen.

Aside from most isolated applications, the recognition process in a 
natural environment has to cope with a number of difficulties. A target 
object can be taken from arbitrary viewing positions with different 
illumination conditions and varying background. Current approaches 
suggest only solutions to parts of these requirements. In [16] an 
eigen-value method is proposed to identify various objects and their 
orientation, but this method requires a constant background. Based on 
that [3] extended this approach to cope with partly occluded objects but 
have not shown yet results in a real scenario like the one of Fig. 1. 
[6] presents a recognition method using quasi-invariants based on 
segments.

1A detailed description is beyond the scope of this article

To analyse complex objects, a lot of effort has to be investigated in 
the generation of segments. A lambertian model is proposed by [5] to 
recognize more complex objects. But this method assumes fixed locations 
and does not permit background information.

2.1 Formal object descriptions

Starting out of an intensity image I - received automatically from the 
tracking algorithm - that contains mainly the target object T iconic 
features like gray values, edges, or corners can be determined to build 
a formal object description. Features are put on geometric attributes 
indicating their location in image I . So far we found edges generated 
by the canny filter as useful features.
Each object is formally described by a set M of nM features.
M = fM1; M2; : : : ; MnMg (1)

The geometric position of each feature Mi in I is determined by its edge 
coordinates ~ci in x and y direction with ~ci = (cx;i; cy;i). Thus, a 
feature is defined as a tupel Mi, consisting of its edge coordinates ~ci 
and optional a set of further attributes Di to specify the feature 
additionally

Mi = f~ci; Dig i = 1; : : : ; nM : (2)

Features are standardized by transformation from I into a scale and 
translation invariant space, the configuration space <=. For this 
geometric transformation the location of a feature Mi is set in 
connection with all remaining feature locations Mj with j 6= i to get 
the relative location of Mi. Because I is 2-dimensional, 22 possible 
transformations into <= exist. With a criterion f , two arbitrary 
features Mi and Mj are marked with the < relation along x or y as 
follows

ffx;yg;ij =
? 1 cfx;yg;i < cfx;yg;j ? "fx;yg ^ i 6= j 0 otherwise
(3)
"fx;yg enables an overlapping tolerance. With ffx;yg;ij the relative 
location of a feature Mi in relation to the remaining features out of M 
n fMig along x and y is given with

bfx;yg;i = 1
nM ? 1

nM
X

j=1
ffx;yg;ij nM > 1 (4)

For each feature Mi the relative location builds a twodimensional 
destination vector ~bi valid in the range [0; 1] with ~bi = (bx;i; 
by;i)T . Thus the configuration space <= is a normalized continuous 
space in the same

range and with the same dimension as I . The transformed features from I 
to <= build the configuration space representation of an object (~b1; : 
: : ;~bnM ).

In Fig. 2 a sample transformation from a reference object R and a target 
object T into <= using the canny operator is shown. As can easily be 
verified, the emerged configuration space is translation and scaling 
invariant. Because of considering only relative locations, few 
disturbances or projective distorsion appear compensated in <=.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 2: Transformation of reference (left) and target object (right) 
via canny edge images into configuration spaces <=R and <=T

2.2 Generation of hypotheses

Because nM of our objects is very high we consider all ~bi in <= as 
randomly located in the statistical sense. However, except for 
classification tasks, statistical methods for recognition are not widely 
used yet. [25] provides a recognition method based on the linear 
combination of normal distributed density functions from point features. 
In [10] a formal statistical object description based on parametric 
methods is introduced to identify and localize objects. Differently, we 
compare distribution functions of a reference object FR with a target 
object F T by the use of a nonparametric statistical test in order to 
have no con-

straints to the distribution of the point features. This test calculates 
a similarity measure between the two distribution functions. A steady 
and discrete distribution function F with ~z 2 [0; 1]2 is built from the 
destination vectors in <= with

F (~z) =

8><

>:

0 ~z < ~b1

~bi ~bi <= ~z < ~bi+1

1 ~z >= ~bnM

(5)

In a number of experiments we found the test of Kuiper useful [21]. This 
test is a variant of the wellknown test of Kolmogoroff and Smirnow. In 
contrast to latter, the maximum distance between the two distribution 
functions is calculated in both directions. The Kuiper statistic D 
between the distributions of a target object F T (x) and a reference 
object FR(x) is calculated by

D = max
~z fF T (~z)?F R(~z)g+max
~z fFR(~z)?F T (~z)g (6)

For a sample set of size nM , the significance is calculated by Q(>=)

Q(>=) =

8<

:

1P
j=?1(1 ? 4j2>=2)e?2j2>=2 for >= > 0

0 otherwise
(7)
In [21] >= is approximated depending on the sample data nM and the 
distance D by

>= =
?pm+ 0:155 + 0:24
pm

?

D (8)

The number of samples m is calculated from the number of features of 
target nMT and reference object nMR with

m = nMTnMR
nMT + nMR (9)

The more any two distribution functions resemble, the smaller gets 
Q(>=). Thus, h = 1 ?Q(>=) is used as hypothesis of the match.

According to the appearance-based approach, each reference object Ri is 
modelled by a number of views nA. Thus, for a statistical match between 
a target object T and a reference object Ri, the configuration space <=T 
has to be compared with all configuration spaces <=Ri . The comparison 
of T with Ri supplies a hypotheses vector ~hRi with ~hRi = (hR1 ; : : : 
; hRnA )T containing the calculated hypotheses of all matches between 
views of Ri and T . The comparison of T with nR reference objects leads 
to the resulting hypotheses set Hi with Hi = f~hR1 ; : : : ; ~hRnRg.


The results of this method are very robust compared to an ideal scenario 
with varying distances as well as few variations in illumination, 
orientation, and identical background like in [16]. However, in real 
scenes with arbitrary viewing positions, variable background, and 
changing illumination conditions the quality of the segmentation depends 
on the tracking algorithm. Results are sometimes not exact, data are 
either not relevant or missing. This may lead to misinterpretations. To 
overcome this problem it is useful to consider recognition results over 
a time period.

3 Information integration

Various methods in the field of information integration were presented 
and compared e.g. in [24] [23]. In contrast to the traditional use of 
information integration, the scenario deals with changing object 
positions and orientations as well as with insufficient quality of 
hypotheses and thus requires dynamic assignments. For these reasons we 
found the method of bayesian belief nets [18] most useful because belief 
nets are rather robust in applications with imprecisions [20] [23] and 
react rather fast to changes [24].

In our application belief nets are used to verify recognition result H 
with H = (H1; : : : ; HnO ) of up to nO observers. In the case of one 
observer the belief net estimates the current orientation of a target 
object T in the scene based on Hi. In the case of several observers, 
hypotheses of nO involved participants are combined in the belief net to 
estimate the objects identity and orientation.

A belief net consists of a number of single belief nodes that represent 
the state of the net at a time step t. The design of a single belief 
node Z is shown in Fig. 3. Each node Z is connected with a predecessor 
node A and a successor node B that represent states at time steps t ? 1 
and t + 1 respectively. To determine the orientation of a moving object 
only topdown propagation is essential. In the single node Z the causal 
support ß(Zi) is calculated from the propagated values ßZ(Aj) of A and 
the conditioned probabilities P (ZijAj) with

ß(Zi) =
nAnR
X

j=1
P (ZijAj)ßZ(Aj) (10)

To get conditioned probabilites P (ZijAj ), we use nAnR ? nAnR aspect 
transition matrices (ATM) which can be different for each belief node:

2
E

3
E

1
E

4
E

pZ

B
p

nO

l

E

l

A

B

Z

Figure 3: Single belief node Z

tk?1ntk R1;1 : : : R1;nA : : : RnR;1 : : : RnR;nA
R1;1 p1;1 : : : p1;nA : : : 0 : : : 0
: : : : : : : : : : : : : : : : : : : : : : : :
R1;nA pnA;1 : : : pnA;nA : : : 0 : : : 0
: : : : : : : : : : : : : : : : : : : : : : : :
RnR;1 0 : : : 0 : : : p1;1 : : : p1;nA : : : : : : : : : : : : : : : : : 
: : : : : : :
RnR;nA 0 : : : 0 : : : pnA;1 : : : pnA;nA

These matrices describe a priori suspected dependencies of nA aspects 
and nR models between two time steps. Dependencies are usually only 
sensible between aspects of the same model. All other dependencies are 
set to zero. Each cell in ATM indicates the probability pu;v, that 
aspect u of model Ri (written as Ri;u) at time tk?1 will change to 
aspect v of model Ri at tk (Ri;v). All dependencies or assumptions can 
basically be described randomly. Assumptions are e.g. gaussian or 
uniform distributions applied to a limited range because ususally the 
observed object will move in a continouos manner. Distributions then 
assess the transition probability pu;v between selected or all aspects 
of Ri. All other transition probabilities within one model Ri are marked 
with 0. As a constraint, each row in ATM has to sum to 1.

At each state and thus at each time step diagnostic support from up to 
nO observers can be processed. This is modelled in Fig. 3 by nodes Ei 
that supply the diagnostic support >=(Ei) which contains Hi. The 
collection of all observer's hypotheses is done by

>=(Zi) =
Y

l
>=(El) (11)

The belief values for each belief node are calculated from causal and 
diagnostic support normalized by ff

BEL(Z) = ff>=(Zi)ß(Zi) (12)

From the current belief values the top-down propagation values ßB(Z) are 
calculated by

ßB(Z) = BEL(Z) (13)


Once a belief node reaches the value zero for one aspect, this value is 
propagated to all future belief nodes. Such a filtering effect is 
desired, if a really false aspect is considered. Unluckily, in cases 
with insufficient input, the object recognition will calculate low or 
zero probability values for correct aspects. As a consequence, belief 
values become zero too. Such a behavior is unwanted because it is not 
robust. To avoid misinterpretations, we extend equation (10) by a factor 
o to

ß(Zi) =
nAnR
X

j=1
P (ZijAj)(ßZ(Aj) + oj) (14)

to smooth the causal support ß(Zi). To keep the filtering effect, an 
aspect marking matrix similar to ATM is used as a criterion whether 
aspects or models are still valid or not.

4 Experimental results

In order to examine the consequences of cooperation on the quality of 
recognition results we will analyze a sampple scenario in detail with 
two observers looking at the same moving robot \Aramis". Fig. 5 shows 
part of that image sequence containing 74 images of size 384 ? 288 with 
the target object observed from the two viewing angles of E1 and E2. T 
is marked by convex hulls automatically generated by our tracking 
algorithm and serves as input I for the recognition. The orientation 
angle between E1 and E2 is approximately 190ffi. To recognize T , both 
observers used a model database with five reference objects each 
consisting of 24 aspects taken from fixed distances in 15ffi steps from 
0ffi to 360ffi. The recognition time measured on a SUN Enterprise 
amounts 220ms.

In this experiment, the cooperative recognition enables processing of 
two tasks, the verification and the localization (Fig. 4). The aim of 
the verification is to improve a single observer's recognition results 
by combining H1 and H2. To do so, known positions of observers are 
assumed and hypotheses are transformed to a common coordinate system, 
usually either to the one used by E1 or E2. In the localization, the 
suspected reference object is used to locate the relative spatial 
orientation between E1 and E2. This has to be done by using the 
difference of the recognition results of both observers. Obviously, the 
latter task requires precise values H1 and H2.

To examine the consequences of information integration, cooperative 
results are compared with single and continuous results of both 
observers. Fig. 6 (a) shows for each of the 74 images both, the 
assignment of T to a reference object and the corresponding estimated 
view in degree. For each ti only the maximum

X1

p

Z2

2

p

1
Y

Y

1

1

2

p

2

Z

1
Z

1

X2

2

E

E

0
Z X0 Y0

E E

E2

p

p

l

l

l

p

p
p

p

E

E E2

1

Z0

l p

p

p
l

l

l

l

1
Z

Figure 4: Belief nets for verification task (left) and localization task 
(right)

hypotheses of E1 and E2 are considered. The recognition results of the 
Kuiper test are quite different for E1 and E2. While H1 contains 24% of 
misinterpretations, H2 has 44%. The main reason for this lies in the 
quality of I depending on the given convex hull. For comparison the 
results of the continuous considerations of T by E1 and E2 are shown in 
Fig. 6 (b) and (c). The aspect transition matrices are filled according 
to a uniform distribution in the interval [i ? 1; i + 1] around aspect i 
and o was set to 0:002. In (b) the respective estimated reference object 
is shown and in (c) the corresponding orientation. Note that in (c) only 
neighbouring hypotheses of the same reference object are connected by a 
line. As e.g. can be seen in (b) the last four hypotheses of E1 indicate 
a wrong model estimation, where T is moving out of E1's field of view. 
Although the continuous results are significantly improved, still not 
all best matches are correct. Observer E1 calculates hypotheses with 
approximately 20% misinterpretations, either the wrong view or the wrong 
model. Observer E2 could improve its recognition rate to about 29% 
misinterpretations.

Now these results are compared with cooperative results. The 
verification is done according to the left picture in Fig. 4. Values of 
both observers E1 and E2 enter the belief net simultaneously at t. 
Hypotheses H2 are transformed to the camera coordinate system of 
observer E1. The result of the verification is indicated in Fig. 6 (d) 
by the square symbols. As can be seen, the integration of information 
from the two different sources E1 and E2 lead to an improvement of the 
estimated current orientation of T with no significant 
misinterpretations. Due to chosen statistical recognition method, 
variations of 15ffi from the real aspect are in the range of tolerance. 
The time for cal-

Figure 5: Images #1, #30, and #60 with convex hulls of T observed from 
the view points of E1 and E2

culations in this belief net amounts 20ms on a SUN Enterprise. Thus the 
total time for the verification task (without time for communication and 
tracking) is approx. 260ms.

The localization is done according to the right picture in Fig. 4. To 
get robust hypotheses, each observer verifies its hypotheses by using 
belief net X respectively Y . To ensure, that E1 and E2 suspect the same 
reference object, the identity of T is estimated by using the maximum 
belief values for each reference object. These values of BEL(X) and 
BEL(Y ) enter a third belief net Z that estimates the identity of T . 
With o = 0 and an identity function for the 5 ? 5 matrix ATM , where 
aspects are replaced by models, \Aramis" remains the only candidate 
reference object after t = 3 images. To determine the orientation angle 
between E1 and E2 the three highest marked views in H1 and H2 are used. 
Such a procedure is necessary in appearance-based approach due to 
ambiguous interpretations. The result of the localization is indicated 
in Fig. 6 (d) by the plus symbols. As long as T is completely visible 
from E1, the estimated orientation angle between E2 and E1 stays steady 
between 165ffi and 195ffi. After the 70th image, where T run out of

I , estimations declines.

5 Conclusions

We introduced a framework for cooperative recognition of objects in a 
multi-robot environment and showed that the use of integrated 
information from multiple sources will increase the robustness of the 
recognition process. To identify complex objects in a real scenario an 
appearance-based qualitative nonparametric statistical matching 
algorithm was developed. Based on that, belief nets were proposed to 
integrate information from both, a single or multiple observers. 
Cooperation can be used not only for verification but also for the 
localization of spacial orientations of several observers. Time 
considerations showed that the whole recognition process is rather fast 
even with a higher number of models.

References
[1] P.K. Allen, Robotic Object Recognition Using Vision and Touch, 
Kluwer Acad. Publishers, 1987.

[2] R. Bajcsy, From active perception to active cooperation - 
fundamental processes of intelligent behavior, GRASP Technical Report 
398, 1995.

[3] H. Bischof, A. Leonardis, Robust recognition of scaled eigenimages 
through a hierarchical approach, Proc. of CVPR98, pp. 664-670, 1998.

[4] Q. Cai and J.K. Aggarwal, Tracking Human Motion Using Multiple 
Cameras, 13th ICPR, pp. C:68-72, 1996.

[5] R. Epstein et. al, Learning Object Representations from Lighting 
Variations, Object Representation in Computer Vision II, pp. 179-199, 
Springer, 1996.

[6] P. Gros, Using quasi-invariants for automatic model building and 
object recognition: an overview, Object Representation in Computer 
Vision, pp. 65-75, Springer, 1994.

[7] E. Grosso et. al, 3D Object Reconstruction Using Stereo and Motion, 
IEEE Trans. Systems, Man and Cybernetics 19(6), pp. 1465-1488, 1989.

[8] D. Hall, Mathematical Techniques in Multi-sensor Fusion, Artech 
House, 1992.

[9] M. Hebert et. al, Object Representation in Computer Vision, 
Springer, 1994.

[10] J. Hornegger, H. Niemann, Statistical Learning, Localization, and 
Identification of Objects, 5th ICCV, pp. 914-919, 1995.


(a)

Aramis

Athos

Digger

Spider

Gripper

0 10 20 30 40 50 60 70

Time step

360

180

    0

E_1

E_2

(b)

Aramis

Athos

Digger

Spider

Gripper

0 10 20 30 40 50 60 70

Time step

E_1

E_2

(c)

0

30

60

90

120

150

180

210

240

270

300

330

0 10 20 30 40 50 60 70

View in degree

Time step

E_1

E_2

(d)

0

30

60

90

120

150

180

210

240

270

300

330

10 20 30 40 50 60 70

View in degree

Time step

Verify

Locate

Figure 6: (a) Hypotheses resulting from test of Kuiper for E1 and E2 
(b+c) model and aspect hypotheses resulting from belief nets for E1 and 
E2 (d) results of verification and localization

[11] S.S. Iyengar, L. Prasad, H. Min, Advances in Distributed Sensor 
Integration, Prentice Hall, 1995.

[12] R. Jain, K. Wakimoto, Multiple Perspective Interactive Video, Int. 
Conf. Multi-media Computing and Systems, pp. 202-211, 1995.

[13] A. Katkere, R. Jain, A Framework for Information Assimilation, 
Exploratory Vision: The Active Eye, pp. 241-256, Springer, 1996.

[14] E.P. Krotkov, Active Computer Vision by Cooperative Focus and 
Stereo, Springer, 1989.

[15] J. Mundy, An Experimental Comparison of Appearance and Geometric 
Model Based Recognition, Object Representation in Computer Vision II, 
pp. 247-269, Springer, 1996.

[16] H. Murase, S. Nayar, Visual Learning and Recognition of 3D Objects 
from Appearance, Int. Jor. of Computer Vision (14), pp. 5-24, 1995.

[17] N. Oswald, P. Levi, Cooperative Vision in a Multi-Agent 
Architecture, LNCS 1310, 709-716, Springer, 1997.

[18] J. Pearl, Distributed Revision of Composite Beliefs, Artificial 
Intelligence 33, pp. 173-215, 1987.

[19] J. Ponce, A. Zisserman, M. Hebert, Object Representation in 
Computer Vision II, Springer, 1996.

[20] M. Pradhan et. al, The sensitivity of belief networks to imprecise 
probabilities: an experimental investigation, Artificial Intelligence 
85, pp. 363- 397, 1996.

[21] W.H. Press, Numerical Recipes in C, Cambridge University Press, 
1992.

[22] S.M. Smith, ASSET-2: Real-time motion segmentation and object 
tracking, Journal of Real Time Imaging, 4(1):21{40, 1998.

[23] P. Walley, Measures of uncertainty in expert systems, Artificial 
Intelligence 83, pp. 1-58, 1996.

[24] E. Waltz, J. Llinas, Multisensor Data Fusion, Artech House, 1990.

[25] W.M. Wells III, Statistical Object Recognition, PhD thesis, MIT, 
1993.