The explosion of data collection techniques and resources is a known phenomenon of the current day and age. Human analysis of such large datasets is largely infeasible and the vast amount of information can only be dealt systematically and efficiently by algorithmic means. The so called ‘big data’ domain of research has been enhanced during the last decade with various methods and techniques to perform such tasks. Along with the development of accompanying hardware to increase the speed of processing, tasks involving large amount of data can now be dealt with even in real-time.

Abstract design with connected circles

One of the major hold-backs of the big data era is the difficulty to make sense of the data at hand. For applications where the piece of information being looked for is well defined, the algorithms need to be no more than near routine categorization and retrieval techniques. But what if the patterns are unknown? How can one make sense of all the data?

To answer such a question, algorithms of unsupervised clustering have been proposed. Since an autonomously chosen criteria is still largely missing, these algorithms rely on a clustering criteria defined beforehand. For this end, pattern discovery methods are at play. Detecting patterns in data is of high importance in many processes: for example, intrusion detection in Internet traffic, monitoring the movement of people traced by cellphone call logs, making sense of gene transcription maps, protein interactions, money transfers, credit history, and other artificial intelligence tasks.

Techniques for statistical pattern discovery

Techniques for pattern discovery consist in the ability to detect patterns in noisy and missing data – which makes them essentially probabilistic – and to detect high order patterns based on the interrelationship between data features. Classical techniques rely on the detection of statistically significant features based on frequency analysis of their occurrence in the data. To make the discovered pattern intuitive to human understanding, the relationship between the features detected must also be given.

This requirement brings naturally the tree (graph) representation of the data features for pattern discovery. Much like in random forest, nodes of the graph hold decision rules which maximize the information gained from a statistically feature and its relationship with others. Statistical tests placed at each node allow us to prune the graph’s node until a decision is made regarding a set of features making up a pattern. Of course, this forces us to define an underlying basic statistical relationship between features which is used in the hypothesis tests.

Non-parametric tests can be incorporated into the decision function at each node, relieving us from assigning a model to describe the data. However, well defined statistical models are useful when modeling a pattern of a rare event, or an event occurring in the heavy tails of a distribution representation of the data. Pattern, that is, a tree representation of the interrelation between events, can then be assigned weight and probability; it can also be searched for iteratively, as the database fills up, or offline periodically.

The algorithm developer stands at the intersection between statistics, machine learning, signal processing and pattern recognition. We at RSIP Vision require our algorithm developers to possess broad interdisciplinary knowledge, the integration of which allows us to construct cutting edge algorithms for many leading companies around the globe.

Share The Story