Video recommender systems are utilized to rank videos based on user preferences, viewing history and similarity to other users. Video streaming and image hosting sites often receive hundreds or thousands of new items daily, which need to be ranked and categorized before they are inserted into the recommendation cycle. Oftentimes, adding tags and categories is disregarded by user at the time of upload, which poses a problem to the predictive power of the recommender systems. An untagged new item uploaded to the video hosting site is more commonly known as ‘the new item problem’.
Visual analysis of videos and images can help solve the new item problem by automatically assigning tags to the content of the videos. A complete automatic scene understanding and tagging is still beyond the capabilities of computer vision techniques. However, filtering videos into predefined, recognizable objects will help automatic scene analysis to work and recognize objects and their interactions within the right context.
Videos including human action form one broad but still acceptably well-defined category. Recognition of human body and faces in images has now become a routine procedure in computer vision, owing to trained classifiers starting with the celebrated Viola-Jones and its derivatives. Assuming we have detected with high probability a video containing humans and their interactions, how then will we go about the human action recognition process?
Human action recognition
In many cases, video segmentation and action recognition are performed on video clips which are already pre-segmented. This bottom-up approach introduces difficulties, mainly because the information (features) extracted from individually segmented regions is oftentimes insufficient for the recognition of an action. Additional information is incorporated by linking and correlating features from all segments and their spatiotemporal relationship and produce an action class score– a daunting task prone to cumulative errors. To avoid such difficulties, one must adopt a top-bottom approach in which both segmentation and action recognition are performed jointly as one.
Combining human action segmentation and recognition in a top bottom approach can be considered as detection of event in the temporal domain. For this end, multiple classifiers operating in parallel extract sequences of events detected in various regions of the video. Intelligent evaluation of classifiers’ output enables to predict, with some degrees of certainty, the event being showed in the video. Of course, human actions can span a timescale going from fractions of seconds to many hours and this should be accounted for in the predictions. But for the sake of discussion, we’ll limit ourselves to actions which can be properly detected over a sequence of several tens of video frames.
Multiclass Support Vector Machines (SVM) can be trained to assign labels to detected actions in the temporal domain of videos. Timescale of actions can be taken into account by the way of dynamic programming for the duration (number of frames) of the action. Such information needs to be incorporated during classifiers’ training procedure. With recognition of several classes of actions in a given video (including null class i.e. no action), the main challenge is to make sense of these actions and place them in higher order categories based on their spatiotemporal relationships. For this end, the score obtained from each class by the SVM classifier needs to be analyzed. A global optimal score needs to be reached, one which best separates the true action class from all others.
The process of joint action segmentation and recognition in videos requires deep understanding of the structure of actions in videos. In addition, the process of classifier training and their score optimization is a truly delicate work, which should be handled with care, especially regarding complex scenes in which the dimensionality of the training data can actually hinder successful convergence and learning. Joint segmentation and categorization of actions can extend to other aspects, besides recommender systems. To name just a few examples: video surveillance, in which prediction of human actions can lead to crime alert system; pedestrian detection and intent prediction in ADAS; and video summarization. RSIP Vision’s consultants and engineers can support you in all these areas of work.