Categories
Meetings

The problem of Long-tail entities: definition and benchmarking

Lia will be presenting one of her latest work.

Abstract:
Automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and for supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, often referred to as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark.

Since a univocal definition of long tail remains an open challenge, I will also present the first step of a research line aimed at clearly identifying the sociodemographic variables that tend to characterize underrepresented entities. This also aims to understand whether LLMs produce different outputs when dealing with well-known versus less common entities, and what implications this may have in everyday life.

Categories
Meetings

Pushing the Boundaries of Fixed Budget Best Arm Identification

CCC Seminar by Luigi Sauro, Professor of Computer Science at Università degli Studi di Napoli “Federico II”.

Abstract (Italian): Supponete di avere un insieme finito di meccanismi stocastici (arms), ognuno dei quali restituisce ad ogni interazione un reward sulla base di una distribuzione a voi sconosciuta. Lo scopo del Fixed-Budget Best Arm Identification è quello di identificare, sulla base di un numero prefissato di possibili interazioni sequenziali, l’arm con expected reward massimo. Questo problema induce un tipico dilemma exploration vs exploitation che si riscontra in molti contesti applicativi (clinical trials, wireless network selection, recommender systems, A/B testing). In questo seminario illustrerò una nuova strategia di interazione che si applica sotto le ipotesi che gli arm siano governati da distribuzioni sub-gaussiane e che fa uso di tecniche di non-linear convex optimization. Mostrerò inoltre un upper bound teorico che evidenzia il perché questa strategia raffina lo stato dell’arte ed un’analisi sperimentale che supporta la sua efficacia.

Bio: https://www.docenti.unina.it/teacher/4c55494749534155524f5352414c475537345432394638333941/profile/references

When: June 5th, 15:00
Where: Sala Riunioni, Primo Piano

Categories
Meetings

What the harm? Gender Bias and Inclusive Language in Multilingual Translation Technologies.

Seminar by Beatrice Savoldi, researcher at Fondazione Bruno Kessler (FBK) within the MT research unit.

Abstract
Societal gender asymmetries, inequalities and stereotypes can be embedded in our communication practices and perpetuated in language technologies, including multilingual and Machine Translation (MT) systems used as scale. In this presentation, we will delve into the current landscape of MT and multilingual gender bias, as well as current proposals towards more inclusive language. We will discuss the challenges and opportunities — both theoretical, technical but also linguistic — in fostering a more equitable technology for different groups of users.

Bio
Beatrice Savoldi is a researcher at Fondazione Bruno Kessler (FBK) within the MT research unit, where she mainly works on gender inclusive and human-centered translation technologies.
Beatrice carried out a joint international PhD at the University of Trento and Augsburg with a dissertation on gender bias in Machine Translation, which was awarded the 2023 best thesis Research Prize from the Augsburg University Foundation.
Her research interests broadly encompass ethical and social considerations of (multilingual) language technologies.

When: 29/05/2026 11.00
Where: Aula 3.06 Thin Client (3rd floor) – Via Sant’Ottavio 54, Torino.

Categories
Meetings

HurtLens: A Perspectivist Corpus Analysis of Hurtful Language


Samuele D’Avenia and Eliana di Palma will be presenting one of their latest work.

Abstract:
Offensive language detection systems often rely on majority-aggregated annotations, overlooking the diversity of perspectives that shape how different communities perceive harm. In this contribution, we introduce HurtLens, a perspectivist corpus of hurtful language leveraging four disaggregated datasets which are automatically enriched through HurtLex lemmas, a multilingual resource of offensive and derogatory terms. Using mixed-effects modeling, we investigate how annotators’ sociodemographic backgrounds, the presence of specific types of offensive language (through Hurtlex categories) and their interaction influence offensiveness ratings. Our analysis reveals that offensiveness ratings are influenced both by annotators’ sociodemographic characteristics (particularly when considering them in intersection) and by the presence of specific types of offensive language. Additionally, we identify significant interaction effects showing that different demographic groups vary in their sensitivity to texts containing particular types of offensive language.

When: Friday, 15/05 at 11:30
Where: Sala Conferenze, 3rd Floor

Categories
Meetings

Narrative Spread and Audience Targeting in Social Media: Evidence from Multiple Domains

CCC Seminar by Jussara M. Almeida, Full Professor of Computer Science at the University of Minas Gerais, Brazil.

Abstract:
This talk examines how information spreads, clusters, and targets specific audiences across social media platforms. Drawing on three empirical studies, I analyze (i) the structure and discourse of misogynistic communities in Brazil, (ii) content directed at children on Instagram, and (iii) the emergence and alignment of political narratives in large-scale Telegram conversations during the 2024 U.S. elections. Together, these findings highlight relevant patterns underlying information diffusion across platforms and raise important questions about exposure, influence, and the amplification of targeted content in online ecosystems.

Short bio:
Jussara M Almeida is Full Professor of Computer Science at the University of Minas Gerais, Brazil, where she currently leads the Laboratory of Social Computing at the Department of Computer Science. She holds a PhD from the University of Wisconsin-Madison (US) and is a former affiliate member of the Brazilian Academy of Sciences. Her main research interests are social computing, user behavior analysis, as well as performance analysis and modeling of large-scale distributed systems.

When: 14/04/2026, 15:00-16:00
Where: Sala Riunioni, 3rd Floor

Categories
Meetings

Multilingual Knowledge Graphs and NLP: Bridging Text and Structured Semantics for Scalable, Inclusive AI

CCC Seminar by Virginia Ramón-Ferrer, PhD student and researcher in the Ontology Engineering Group from the Artificial Intelligence (AI) Department at the Universidad Politécnica de Madrid.

Abstract:
The presentation introduces the research activities of the Ontology Engineering Group, outlining its core areas (ontologies, knowledge graphs, NLP/NLG, and open science) and focusing on work in multilingual methods. In particular, it presents contributions on bridging structured data and text through data-to-text generation, multilingual benchmarks (such as Spanish WebNLG), and studies on multilingual and code-switched information retrieval, aiming to support more inclusive AI across languages.

Short Bio:
Virginia Ramón-Ferrer is a PhD student and researcher in the Ontology Engineering Group from the Artificial Intelligence (AI) Department at the Universidad Politécnica de Madrid, with a background in Computer Engineering, specifically in Computer Vision (CV) and Natural Language Processing (NLP). Her current work focuses on multilingual NLP, with an emphasis on multilingual data-to-text generation and Information Retrieval (IR), specifically in the intersection of structured data and text for IR, exploring how structured representations can better support retrieval and generation in multilingual settings.

When: 24/03/2026, h 14:30-16:00
Where: Sala Conferenze, 3rd Floor

Categories
Meetings

The Duality of Social Media Discourse: Characterizing Polluted and Supportive Online Behaviors

CCC Seminar by Virginia Morini, Postdoc @CS Department at the University of Pisa (Italy).

Abstract:
Social media platforms have drastically changed how people interact, share information, and form relationships online, generating massive amounts of behavioral data. In this talk, I will present research examining how homophilic mechanisms – the tendency to interact with similar others – can produce radically different outcomes in online spaces. Through data-driven case studies on Reddit and X/Twitter, I employ a multidisciplinary approach combining network science and natural language processing with psychosociological insights to investigate both potentially harmful environments where cognitive biases are exacerbated and beneficial environments where users provide mutual support. My research characterizes the emergent phenomena in these contrasting spaces, examines the underlying user behaviors and group dynamics, and measures their effects across different platforms. The results demonstrate how online spaces can simultaneously foster problematic phenomena like echo chambers in sociopolitical discussions, while enabling supportive communities around mental health issues. I will highlight how community norms and interaction patterns, rather than platform architecture alone, play a crucial role in determining these divergent outcomes. The presentation will also introduce practical, open-source tools for studying online social phenomena while ensuring reproducibility and privacy protection.

When: 10/03/2026, h11:00
Where: Sala Conferenze, 3rd Floor

Categories
Meetings

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives

CCC Seminar by Janet Yu Wang, PhD, visiting for 3 months from Polytechnic University Hong Kong.

Abstract
Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives,  fundamental spatial expressions like “this/that” in English and “这/那” in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal–distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal–distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) demonstratives as a new lens for evaluating embodied cognition and cultural conventions, (ii) empirical evidence of cross-cultural asymmetries in human interpretation, (iii) a new perspective on the egocentric–sociocentric debate, showing both orientations coexist but vary across languages, and (iv) a call to address individual variation in future model design.

When 03/03/2026, h14:00
Where Sala Conferenze, 3rd Floor

Categories
Meetings

LLM Beliefs Are in Their Heads

CCC Seminar by Alessandro Corona Mendozza, predoc researcher at the center for language Technology (Copenhagen University) and visiting at the University of Turin.

Abstract:
We investigate belief-like representations in decoder-only autoregressive LLMs using linear controlled probes on residual stream activations and single attention heads. Following Herrmann and Levinstein’s (2025) criteria (Accuracy, Use, Coherence, and Uniformity) we find that large models exhibit strong truth sensitivity (Accuracy), and steering activations along probe directions reliably changes downstream behavior (Use). Coherence, measured via calibrated probes and cross-dataset probing, is moderate across models, while training on diverse data yields domain-consistent truth directions (Uniformity). The results are particularly encouraging at the head level and align with some standard philosophical accounts of belief, e.g., minimal functionalism, supporting the view that LLMs can maintain propositional attitudes under such theoretical frameworks.

Short bio:
Alessandro Corona Mendozza is a predoc researcher working at the intersection of LLM interpretability, AI epistemology and philosophy of mind/language. He is currently assisting in research for an eye-tracking project at the center for language technology (Copenhagen University) and University of Turin (visiting). 

When: 18/02/2026, h 14:00
Where: Sala Conferenze, 3rd Floor

Categories
Meetings

Evaluation Under Variation: References, Annotators, and Languages

CCC Seminar by Silvia Casola, postdoc researcher at the MaiNLP group of the Ludwig Maximilian University of Munich and Munich Center for Machine Learning.

Abstract:
Automatic evaluation in NLP often assumes a single ground truth, such as a reference or a gold label. However, language is inherently variable: multiple outputs can be valid, annotators frequently disagree, and metric behaviours can differ across languages. In this talk, I will present three case studies showing how evaluation can fail and how it can be improved under such variation. Focusing on NLG, I will show that metrics can be highly sensitive to the choice of reference, leading to large changes in system rankings. I will then examine classification evaluation under annotator disagreement and present an approach for accounting for systematic disagreement. Finally, I will discuss recent work on steering multilingual neural metrics to improve their correlation with humans.
Starting from these failure modes, the talk shows how studying and modeling variation in references, annotations, and languages can improve the stability and reliability of automatic evaluation.

Short Bio:
I am a postdoctoral researcher in the MaiNLP group at Ludwig Maximilian University of Munich and the Munich Center for Machine Learning, supervised by Barbara Plank. I was recently awarded a Marie Skłodowska-Curie Postdoctoral Fellowship for my project GenEval, to be hosted at Universitat Pompeu Fabra (UPF), which investigates the relationship between generation and evaluation in Large Language Models. Previously, I was a postdoctoral researcher at the University of Turin, where I worked on perspective-aware NLP. I completed my PhD at the University of Padua and Fondazione Bruno Kessler, focusing on natural language generation. During my PhD, I was a visiting researcher at UPF and interned with Spotify and Huawei Research. My research interests lie in NLP, with a focus on natural language generation and evaluation.

When: 31/03/2026, h 15:00-16:00
Where: Sala Conferenze, 3rd Floor