Evaluating BLIP-2, LLaVA, and GPT-4V for Accessible Artwork Description Generation in Virtual Museums: A Comparative Study Based on WCAG-Aligned Evaluation and User Satisfaction
Keywords:
accessible image description, vision-language evaluation, virtual museum accessibility, WCAG-aligned assessmentAbstract
Virtual museums have expanded cultural access beyond physical boundaries, yet people with visual impairments remain excluded from artwork experiences due to insufficient image descriptions. While vision-language approaches offer the potential to automate accessible content generation, their effectiveness in art-specific contexts has not been rigorously assessed. This study presents a comparative empirical evaluation of three representative vision-language approaches available as of early 2024 --- BLIP-2, LLaVA, and GPT-4V --- for generating accessible artwork descriptions in virtual museum environments. Using a curated evaluation set of 250 artworks spanning six genres from the SemArt dataset, we compare descriptions produced under baseline and art-optimized prompt conditions. Evaluation combines automated captioning metrics (BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr-D), a WCAG 2.1-aligned evaluation rubric scored by trained accessibility evaluators, and a user study with 18 blind and low vision participants. Results indicate that GPT-4V with art-optimized prompts achieves the highest CIDEr-D score (0.476) and WCAG sufficiency rating (3.87/5.00), while all three approaches exhibit notable performance degradation on abstract artworks. User preference data and qualitative feedback suggest that contextual richness, in addition to factual accuracy, may play an important role in shaping satisfaction among visually impaired users. These findings provide practical guidance for virtual museum developers seeking to deploy AI-generated accessible content at scale.Downloads
Published
2026-05-06