Multimodal vision-language fashions (VLMs) proceed to realize ever-improving scores on chart understanding benchmarks. But, we discover that this progress doesn’t totally seize the breadth of visible reasoning capabilities important for decoding charts. We introduce EncQA, a novel benchmark knowledgeable by the visualization literature, designed to offer systematic protection of visible encodings and analytic duties which might be essential for chart understanding. EncQA gives 2,076 artificial question-answer pairs, enabling balanced protection of six visible encoding channels (place, size, space, coloration quantitative, coloration nominal, and form) and eight duties (discover extrema, retrieve worth, discover anomaly, filter values, compute derived worth actual, compute derived worth relative, correlate values, and correlate values relative). Our analysis of 9 state-of-the-art VLMs reveals that efficiency varies considerably throughout encodings throughout the identical process, in addition to throughout duties. Opposite to expectations, we observe that efficiency doesn’t enhance with mannequin dimension for a lot of task-encoding pairs. Our outcomes counsel that advancing chart understanding requires focused methods addressing particular visible reasoning gaps, quite than solely scaling up mannequin or dataset dimension.
- † Stanford College
- ‡ Work carried out whereas at Apple

