The composition of objects and their elements, together with object-object positional relationships, offers a wealthy supply of knowledge for illustration studying. Therefore, spatial-aware pretext duties have been actively explored in self-supervised studying. Current works generally begin from a grid construction, the place the aim of the pretext activity includes predicting absolutely the place index of patches inside a set grid. Nevertheless, grid-based approaches fall in need of capturing the fluid and steady nature of real-world object compositions. We introduce PART, a self-supervised studying method that leverages steady relative transformations between off-grid patches to beat these limitations. By modeling how elements relate to one another in a steady area, PART learns the relative composition of images-an off-grid structural relative positioning that’s much less tied to absolute look and might stay coherent below variations comparable to partial visibility or stylistic adjustments. In duties requiring exact spatial understanding comparable to object detection and time collection prediction, PART outperforms grid-based strategies like MAE and DropPos, whereas sustaining aggressive efficiency on international classification duties. By breaking free from grid constraints, PART opens up a brand new trajectory for common self-supervised pretraining throughout numerous datatypes-from photos to EEG signals-with potential in medical imaging, video, and audio.
- † College of Amsterdam

