Lia will be presenting one of her latest work.
Abstract:
Automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and for supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, often referred to as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark.
Since a univocal definition of long tail remains an open challenge, I will also present the first step of a research line aimed at clearly identifying the sociodemographic variables that tend to characterize underrepresented entities. This also aims to understand whether LLMs produce different outputs when dealing with well-known versus less common entities, and what implications this may have in everyday life.
