This paper was accepted on the Studying from Time Sequence for Well being workshop at NeurIPS 2025.
Sensor information streams present priceless data round actions and context for downstream functions, although integrating complementary data might be difficult. We present that enormous language fashions (LLMs) can be utilized for late fusion for exercise classification from audio and movement time sequence information. We curated a subset of knowledge for various exercise recognition throughout contexts (e.g., family actions, sports activities) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores considerably above probability, with no task-specific coaching. Zero-shot classification through LLM-based fusion from modality-specific fashions can allow multimodal temporal functions the place there may be restricted aligned coaching information for studying a shared embedding house. Moreover, LLM-based fusion can allow mannequin deploying with out requiring further reminiscence and computation for focused application-specific multimodal fashions.

