Visual analysis of videos and images can help solve the new item problem by automatically assigning tags to the content of the videos. A complete automatic scene understanding and tagging is still beyond the capabilities of computer vision techniques. However, filtering videos into predefined, recognizable objects will help automatic scene analysis to work and recognize objects and their interactions within the right context.
Videos including human action form one broad but still acceptably well-defined category. Recognition of human body and faces in images has now become a routine procedure in computer vision, owing to trained classifiers starting with the celebrated Viola-Jones and its derivatives. Assuming we have detected with high probability a video containing humans and their interactions, how then will we go about the human action recognition process?
Human action recognition
In many cases, video segmentation and action recognition are performed on video clips which are already pre-segmented. This bottom-up approach introduces difficulties, mainly because the information (features) extracted from individually segmented regions is oftentimes insufficient for the recognition of an action. Additional information is incorporated by linking and correlating features from all segments and their spatiotemporal relationship and produce an action class score– a daunting task prone to cumulative errors. To avoid such difficulties, one must adopt a top-bottom approach in which both segmentation and action recognition are performed jointly as one.
Multiclass Support Vector Machines (SVM) can be trained to assign labels to detected actions in the temporal domain of videos. Timescale of actions can be taken into account by the way of dynamic programming for the duration (number of frames) of the action. Such information needs to be incorporated during classifiers’ training procedure. With recognition of several classes of actions in a given video (including null class i.e. no action), the main challenge is to make sense of these actions and place them in higher order categories based on their spatiotemporal relationships. For this end, the score obtained from each class by the SVM classifier needs to be analyzed. A global optimal score needs to be reached, one which best separates the true action class from all others.
The process of joint action segmentation and recognition in videos requires deep understanding of the structure of actions in videos. In addition, the process of classifier training and their score optimization is a truly delicate work, which should be handled with care, especially regarding complex scenes in which the dimensionality of the training data can actually hinder successful convergence and learning. Joint segmentation and categorization of actions can extend to other aspects, besides recommender systems. To name just a few examples: video surveillance, in which prediction of human actions can lead to crime alert system; pedestrian detection and intent prediction in ADAS; and video summarization. RSIP Vision’s consultants and engineers can support you in all these areas of work.