Tytuł pozycji:
A lightweight approach to two-person interaction classification in sparse image sequences
A lightweight neural network-based approach to two-person interaction classification in sparse image sequences, based on predetection of human skeletons in video frames, is proposed. The idea is to use an ensemble of “weak” pose classifiers, where every classifier is trained on a different time-phase of the same set of actions. Thus, differently than in typical assembly classifiers the expertise of “weak” classifiers is distributed over time and not over the feature domain. Every classifier is trained independently to classify time-indexed snapshots of a visual action, while the overall classification result is a weighted combination of their results. The training data need not any extra labeling effort, as the particular frames are automatically adjusted with time indices. The use of pose classifiers for video classification is key to achieve a lightweight solution, as it limits the motion-based feature space in the deep encoding stage. Another important element is the exploration of the semantics of the skeleton data, which turns the input data into reliable and powerful feature vectors. In other words, we avoid to spent ANN resources to learn feature-related information, that can be already analytically extracted from the skeleton data. An algorithm for merging-elimination and normalization of skeleton joints is developed. Our method is trained and tested on the interaction subset of the well-known NTU-RGB+D dataset , although only 2D skeleton information is used, typical in video analysis. The test results show comparable performance of our method with some of the best so far reported STM and CNN-based classifiers for this dataset, when they process sparse frame sequences, like we did. The recently proposed multistream Graph CNNs have shown superior results but only when processing dense frame sequences. Considering the dominating processing time and resources needed for skeleton estimation in every frame of the sequence, the key to real-time interaction recognition is to limit the number of processed frames.
1. Track 3: 4th International Workshop on Artificial Intelligence in Machine Vision and Graphics
2. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).