Auscultation, notably coronary heart sound, is a non-invasive
method that gives important important signal info.
Lately, self-supervised acoustic illustration founda-
tion fashions (FMs) have been proposed to supply insights
into acoustics-based important indicators. Nevertheless, there was
little exploration of the extent to which auscultation is
encoded in these pre-trained FM representations. On this
work, utilizing a publicly accessible phonocardioram (PCG)
dataset and a coronary heart charge (HR) estimation mannequin, we con-
duct a layer-wise investigation of six acoustic representa-
tion FMs: HuBERT, wav2vec2, wavLM, Whisper, Con-
trastive Language-Audio Pretraining (CLAP), and an in-
home CLAP mannequin. Moreover, we implement the
baseline methodology from [1] (which depends on acoustic fea-
tures), and present that general, illustration vectors from
pre-trained basis fashions (FMs) provide comparable
efficiency to the baseline. Notably, HR estimation
utilizing the representations from the audio encoder of the
in-house CLAP mannequin outperforms the outcomes obtained
from the baseline, reaching a decrease imply absolute error
(MAE) throughout numerous prepare/validation/take a look at splits regardless of
the area mismatch.
- † College of North Carolina at Chapel Hill
- § Johns Hopkins College
- ‡ Work carried out whereas at Apple