Categories
Meetings

The problem of Long-tail entities: definition and benchmarking

Lia will be presenting one of her latest work.

Abstract:
Automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and for supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, often referred to as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark.

Since a univocal definition of long tail remains an open challenge, I will also present the first step of a research line aimed at clearly identifying the sociodemographic variables that tend to characterize underrepresented entities. This also aims to understand whether LLMs produce different outputs when dealing with well-known versus less common entities, and what implications this may have in everyday life.

Categories
Meetings

Pushing the Boundaries of Fixed Budget Best Arm Identification

CCC Seminar by Luigi Sauro, Professor of Computer Science at Università degli Studi di Napoli “Federico II”.

Abstract (Italian): Supponete di avere un insieme finito di meccanismi stocastici (arms), ognuno dei quali restituisce ad ogni interazione un reward sulla base di una distribuzione a voi sconosciuta. Lo scopo del Fixed-Budget Best Arm Identification è quello di identificare, sulla base di un numero prefissato di possibili interazioni sequenziali, l’arm con expected reward massimo. Questo problema induce un tipico dilemma exploration vs exploitation che si riscontra in molti contesti applicativi (clinical trials, wireless network selection, recommender systems, A/B testing). In questo seminario illustrerò una nuova strategia di interazione che si applica sotto le ipotesi che gli arm siano governati da distribuzioni sub-gaussiane e che fa uso di tecniche di non-linear convex optimization. Mostrerò inoltre un upper bound teorico che evidenzia il perché questa strategia raffina lo stato dell’arte ed un’analisi sperimentale che supporta la sua efficacia.

Bio: https://www.docenti.unina.it/teacher/4c55494749534155524f5352414c475537345432394638333941/profile/references

When: June 5th, 15:00
Where: Sala Riunioni, Primo Piano

Categories
Meetings

What the harm? Gender Bias and Inclusive Language in Multilingual Translation Technologies.

Seminar by Beatrice Savoldi, researcher at Fondazione Bruno Kessler (FBK) within the MT research unit.

Abstract
Societal gender asymmetries, inequalities and stereotypes can be embedded in our communication practices and perpetuated in language technologies, including multilingual and Machine Translation (MT) systems used as scale. In this presentation, we will delve into the current landscape of MT and multilingual gender bias, as well as current proposals towards more inclusive language. We will discuss the challenges and opportunities — both theoretical, technical but also linguistic — in fostering a more equitable technology for different groups of users.

Bio
Beatrice Savoldi is a researcher at Fondazione Bruno Kessler (FBK) within the MT research unit, where she mainly works on gender inclusive and human-centered translation technologies.
Beatrice carried out a joint international PhD at the University of Trento and Augsburg with a dissertation on gender bias in Machine Translation, which was awarded the 2023 best thesis Research Prize from the Augsburg University Foundation.
Her research interests broadly encompass ethical and social considerations of (multilingual) language technologies.

When: 29/05/2026 11.00
Where: Aula 3.06 Thin Client (3rd floor) – Via Sant’Ottavio 54, Torino.

Categories
Meetings

HurtLens: A Perspectivist Corpus Analysis of Hurtful Language


Samuele D’Avenia and Eliana di Palma will be presenting one of their latest work.

Abstract:
Offensive language detection systems often rely on majority-aggregated annotations, overlooking the diversity of perspectives that shape how different communities perceive harm. In this contribution, we introduce HurtLens, a perspectivist corpus of hurtful language leveraging four disaggregated datasets which are automatically enriched through HurtLex lemmas, a multilingual resource of offensive and derogatory terms. Using mixed-effects modeling, we investigate how annotators’ sociodemographic backgrounds, the presence of specific types of offensive language (through Hurtlex categories) and their interaction influence offensiveness ratings. Our analysis reveals that offensiveness ratings are influenced both by annotators’ sociodemographic characteristics (particularly when considering them in intersection) and by the presence of specific types of offensive language. Additionally, we identify significant interaction effects showing that different demographic groups vary in their sensitivity to texts containing particular types of offensive language.

When: Friday, 15/05 at 11:30
Where: Sala Conferenze, 3rd Floor