Object detection and classification are major challenges for robotic modules. Navigation, Pick and Place and additional robotics activities are based on the ability to recognize object.

Robot examining camera in factory

Recent years has provided a great progress in object detection mainly due to machine learning methods that became practical and efficient. Also new data representation and models contributed to this task. Object detection algorithms, activated for robotics, are expected to detect and classify all instances of an object type (when those exist). They should be detected even if there are variations of position, orientation, scale, partial occlusion and environment variations as intensity. Object detection is the key to other machine vision functions such as building 3D scene, getting additional information of the object (like face details) and tracking its motion using video successive frames. Robotic application, as mentioned, navigation and pick-place, may require more elaborate information from images. In this case, additional image capturing channels may be used. Self-navigating robots use multi cameras setup, each facing a different direction. A set of additional images generating sensors (as Lidar and Radar) are used. The computer vision system employs data fusion during or post the object detection algorithms.

For each object, the computer vision system provides the following information: localization (position and orientation of the object in the “real world”), type (which object was detected) and the motion attached to each object instance.

Classical object detection methods

Classical methods of object detection consisted of template matching algorithms. Some of them used a structured matching process: first, object parts are recognized and later, globally matching uses the partial matches. Statistical classifiers such as Neural Networks, Adaboost, SVM, Bays were used to enhance the recognition, where variation existed.

Humans are a special class, among the objects robots interact with. Human faces are considered a special part which aids robots to identify the “objects”. In addition, robots need to resolve the recognized human motion and especially those parts of it with which the robot might interact, like hands.

Efficiency in such object detection algorithms may be obtained by multi-resolution models, by which initial recognition is performed with lower resolution while selective parts, where objects are estimated to be found, make use of high resolution sub-image. Such sub-images location and dimensions may be estimated from frame to frame, in video, based on motion estimation.

The initial search for objects (inside an image) may avail itself of a few alternatives. These alternatives are being invoked every few image frames (of a video frames) as frequently as the information the robot is facing may be changed. Efficiency is a key factor, here as well. Generic frame search may be conducted, with a process looking for “hints” of object existence. When such a “hint” is detected, a fine detailed recognition method is engaged. Of course, “hints” from previous image frames, i.e. object’s estimated motion, may be used here in cooperation with other “hints”.

Object detection methods for robotics equipment

Object detection methods used with robotics equipment can be classified according to their machine vision’s performance (how do they recognize objects) and efficiency (how much time do they need to “understand” an image).

robot joining puzzle

Within the first group we find boosted cascade classifiers (or “Coarse-to-Fine” classifiers). They work by eliminating image segments that do not match some predefined object. They usually draw on a set of filters to evaluate the segment under test. Since the operations are sequenced from light to heavy, efficiency of this task is high.

The second group consists of dictionary-based object detection algorithms. They work by checking the presence (or absence) of a single class in the image. Thus, when the image environment is known (like people or cars traffic), the expected object may have higher priorities and high detection efficiency (less search). Some limitations exist here in the case of connected or partly occluded objects. In such cases, the derived position is not accurate.

Methods in the third group are based on partial object handling. Each object is described as set of parts which can be measured. The parts descriptor may use gradients with orientation. Using this parameter with “Coarse-to-Fine” approach may speed up the processing here.

The CNN (Convolutional Neural Networks) algorithms form the fourth group. This group is the most capable today and shows its ability to handle many classes of object simultaneously and accurately classify them. So, it is more reliable and efficient than previous groups. The algorithms that belong to this group learn the objects features rather being programmed with them. Along this advantage of such data-oriented classifiers, the disadvantage is that we need a large amount of data to achieve their performance.

Algorithms in the fifth group are structured algorithms, built from machine vision modules. Each module is dedicated to a different kind of detected item: module for objects, module for features, module for text and so on. Each of the module’s parameters are set by training. Algorithms of this group may form abstract object detection machine.

RSIP Vision has all the experience needed to select the most fitting of these solutions for your data. Talk to us about it today and you might save precious time and money.

Consult our experts at RSIP Vision

Share The Story