Transformer-based Lightweight Architectures for Real-Time Open-Set Human Activity Recognition in Unconstrained Video Environments

Main Article Content

Prakhar Kumar Agarwal, Partap Singh

Abstract

Human Activity Recognition (HAR) in videos has become a vital component of intelligent systems used in surveillance, healthcare, and smart environments. However, real-world HAR presents unique challenges, including the need for real-time performance, adaptability to previously unseen actions (open-set recognition), and robustness in unconstrained environments characterized by occlusion, clutter, and variable motion. Existing deep learning approaches, particularly convolutional and recurrent neural networks, often fall short due to their high computational demands and limited generalization capability. This paper proposes a novel Transformer-based Lightweight Architecture tailored for Real-Time Open-Set Human Activity Recognition in Unconstrained Video Environments. Our model integrates a dual-stream attention mechanism that captures fine-grained spatial and temporal dependencies while maintaining a low computational footprint through efficient transformer modules. To address open-set classification, we introduce an uncertainty-aware recognition module that dynamically distinguishes known from unknown actions using statistical embedding separation. The framework further incorporates a dynamic frame selection strategy to eliminate redundancy and enhance temporal saliency. Extensive experiments on benchmark datasets such as UCF101, HMDB51, NTU RGB+D, and Kinetics-400 demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy, open-set detection capability, and real-time efficiency. With a model size under 15 MB and latency below 30 ms/frame on edge devices, the proposed architecture is well-suited for deployment in resource-constrained, real-world HAR applications.


 

Article Details

Section
Articles