Abstract
Vision–language models (VLMs) offer a promising route to fall detection for socially assistive robots in home and care settings, where timely recognition can trigger assistance or further verification by carers or interactive robots. Most vision-based fall detectors are supervised and require task-specific labelled data and/or robust pose estimation, which can be brittle under occlusion and viewpoint changes and costly
to adapt across deployments. This paper investigates whether pretrained VLMs can enable data-free fall detection via zero-shot prompting, and how much a lightweight few-shot calibration step improves performance without requiring backbone tuning. We present (i) a zero-shot detector based on a balanced contrastive prompt bank, and (ii) a few-shot variant that trains only a linear classifier on frozen VLM embeddings. We evaluate these alongside three skeleton-based supervised baselines (2D CNN,
3D CNN, ViT) and a rule-based heuristic on a balanced test set of 40 single-person videos (20 fall, 20 non-fall), with identical windowing (32 frames, 50% overlap) and video-level aggregation (majority vote). The few-shot VLM achieves 100% accuracy, while the zero-shot VLM reaches 92.5% accuracy without fall-specific training data (3 false positives on non-fall videos). Skeleton-based baselines achieve 97.5–100% accuracy but require pose extraction, increasing pipeline complexity. These results
suggest that pretrained VLMs can provide a practical perception trigger for robot-in-the-loop verification and escalation in assistive care, with zero-shot prompting achieving high recall at the cost of a small number of false alarms.
to adapt across deployments. This paper investigates whether pretrained VLMs can enable data-free fall detection via zero-shot prompting, and how much a lightweight few-shot calibration step improves performance without requiring backbone tuning. We present (i) a zero-shot detector based on a balanced contrastive prompt bank, and (ii) a few-shot variant that trains only a linear classifier on frozen VLM embeddings. We evaluate these alongside three skeleton-based supervised baselines (2D CNN,
3D CNN, ViT) and a rule-based heuristic on a balanced test set of 40 single-person videos (20 fall, 20 non-fall), with identical windowing (32 frames, 50% overlap) and video-level aggregation (majority vote). The few-shot VLM achieves 100% accuracy, while the zero-shot VLM reaches 92.5% accuracy without fall-specific training data (3 false positives on non-fall videos). Skeleton-based baselines achieve 97.5–100% accuracy but require pose extraction, increasing pipeline complexity. These results
suggest that pretrained VLMs can provide a practical perception trigger for robot-in-the-loop verification and escalation in assistive care, with zero-shot prompting achieving high recall at the cost of a small number of false alarms.
| Original language | English |
|---|---|
| Title of host publication | Social Robotics + Art |
| Subtitle of host publication | 18th International Conference, ICSR+Art 2026, London, UK, Proceedings |
| Publication status | Accepted/In press - 19 Apr 2026 |
Fingerprint
Dive into the research topics of 'Vision–Language Model for Fall Detection in Socially Assistive Robotics: Zero-Shot Prompting and Few-Shot Calibration'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver