Author: Samuele D'Avenia

What’s going on with Statistics in NLP

Post author By Samuele D'Avenia
Post date 07/06/2026

Samuele D’Avenia will provide a brief overview on statistical methods commonly used in NLP, with a focus on common errors.

Abstract
Statistical significance testing is widely used in Natural Language Processing, yet its application is often incomplete or misleading.
This talk provides a high-level overview of common statistical tools employed in NLP research and highlights several recurring issues that can compromise the interpretation of experimental results.

We will mainly focus on hypothesis testing, highlighting the multiple comparison and the large sample issues, while also discussing current practices.

Additionally, we will provide a brief overview of the current practices within the Italian NLP community.

When: Friday, 10th of July, 11:30
Where: Sala Conferenze, 3rd floor

Meetings

Advocacy Radar: Monitoring Advocacy for Digital Rights

Post author By Samuele D'Avenia
Post date 06/12/2026

Giulio Corradi (Privacy Network), and Marta Marchiori Manerba (University of Turin) will present one of their latest work.

Abstract:
In this talk we present Advocacy Radar, a platform that allows civil society practitioners, journalists, and policy researchers to monitor the state of digital rights advocacy in Italy and Europe from an interactive interface.
Advocacy Radar adapts NLP methods to collect and process heterogeneous public sources – including regulatory authorities, NGOs, and technology media – spanning both Italian and multilingual documents.
Analyses are presented through a dynamic multi-view dashboard. A human-in-the-loop correction through the interface allows domain experts to refine model outputs and incrementally improve models over time.
In this seminar, we describe the system architecture, its main dashboard components, and representative use cases for different user profiles.

Remark: The project is developed in partnership with Privacy Network, an association committed to safeguarding digital rights in Italy and beyond.
The work is ongoing; a version of this paper has been submitted to CLiC-it 2026.
We welcome informal feedback and suggestions from the CCC audience: your expertise and perspective would be super precious in shaping the next stages of development.

Short Bio:
Giulio Corradi is a Management Engineer based in Milan, working as a consultant at SDG Group. He holds an MSc in Management Engineering from the University of Bergamo, with a thesis on algorithmic accountability under the EU AI Act, developed in collaboration with the Department of Law.

His interests sit at the intersection of data-driven analysis, NLP, and business processes, with a particular focus on the social and political implications of technology. Currently He is a Research Officer at Privacy Network, one of Italy’s leading digital rights organizations, he researches automated surveillance systems and their impact on fundamental rights and democratic processes. He also writes for Scomodo, contributing to public discourse on technology, power, and society.

Marta Marchiori Manerba is a Postdoc at the Computer Science Department of the University of Turin, where she works on perspectivist approaches to dialogue modeling. She holds a Ph.D. in AI & Society from the University of Pisa, with a dissertation focused on fairness auditing through explainability particularly in the context of abusive language detection.

When: Friday, 19th of June, 11:30
Where: Sala Riunioni, 1st floor

Meetings

From Crowdsourcing to Large Language Models: Advancing Turkish Word Sense Disambiguation

Post author By Samuele D'Avenia
Post date 06/01/2026

The seminar is presented by Dr Dilara Torunoğlu Selamet, Lecturer in the Department of Computer Engineering at Istanbul Technical University (ITU), Türkiye.

Abstract:
Word Sense Disambiguation (WSD) is a fundamental Natural Language Processing (NLP) task that aims to identify the intended meaning of an ambiguous word in context. While substantial progress has been achieved for high-resource languages, WSD remains challenging for morphologically rich and low-resource languages such as Turkish due to the scarcity of large-scale annotated datasets.
In this talk, I will present my PhD research on Turkish Word Sense Disambiguation, which combines data-centric and model-centric approaches. First, I will introduce DodoMe, a large-scale gamified crowdsourcing platform developed to collect sense-annotated Turkish sentences. The resulting dataset contains more than 158,000 annotations covering 30 highly ambiguous Turkish words and represents one of the largest publicly available WSD resources for Turkish.
Second, I will discuss a systematic comparison of modern approaches to WSD, including contextual embedding-based methods, prompting-based inference with Large Language Models (LLMs), and instruction-based fine-tuning of open-source LLMs. The results demonstrate that recent LLMs substantially outperform traditional embedding-based approaches and that instruction tuning can further improve performance when sufficient high-quality annotated data is available.
Finally, I will discuss the broader implications of combining human computation, crowdsourcing, and large language models for developing semantic resources and language technologies for under-resourced languages.

Short Bio:
Dilara Torunoğlu Selamet is a Lecturer in the Department of Computer Engineering at Istanbul Technical University (ITU), Türkiye, and recently defended her PhD in Computer Engineering. She is a member of the ITU Natural Language Processing Research Group, and her research focuses on Natural Language Processing, lexical semantics, word sense disambiguation, meaning representation, and multilingual language technologies.
Her doctoral research investigates Turkish Word Sense Disambiguation through the integration of large-scale crowdsourced datasets and Large Language Models. She is the creator of DodoMe, a gamified crowdsourcing platform designed for collecting semantic annotations in Turkish. Her work explores contextual embeddings, prompting strategies, and instruction-based fine-tuning approaches for semantic disambiguation tasks.
Dilara is actively involved in international collaborations through the UniDive COST Action, contributing to multilingual and multimodal language technology initiatives. She serves as a language leader and coordinator in the AdMiRe (Advancing Multimodal Idiomaticity Representation) shared task series, which focuses on multilingual and multimodal idiomaticity understanding across dozens of languages.
Her recent work includes contributions to the AdMiRe shared tasks at EACL 2026 and LREC-COLING 2026, as well as research on Turkish Word Sense Disambiguation, Abstract Meaning Representation (AMR), Uniform Meaning Representation (UMR), and multilingual idiomaticity understanding. She has co-authored large-scale international publications involving researchers from more than 30 languages and actively contributes to the development of multilingual benchmarks, linguistic resources, and evaluation campaigns for Natural Language Processing.

When: 10/06/2026, h 11:00 am
Where: Sala Conferenze, third floor

Meetings

Pushing the Boundaries of Fixed Budget Best Arm Identification

Post author By Samuele D'Avenia
Post date 05/25/2026

CCC Seminar by Luigi Sauro, Professor of Computer Science at Università degli Studi di Napoli “Federico II”.

Abstract (Italian): Supponete di avere un insieme finito di meccanismi stocastici (arms), ognuno dei quali restituisce ad ogni interazione un reward sulla base di una distribuzione a voi sconosciuta. Lo scopo del Fixed-Budget Best Arm Identification è quello di identificare, sulla base di un numero prefissato di possibili interazioni sequenziali, l’arm con expected reward massimo. Questo problema induce un tipico dilemma exploration vs exploitation che si riscontra in molti contesti applicativi (clinical trials, wireless network selection, recommender systems, A/B testing). In questo seminario illustrerò una nuova strategia di interazione che si applica sotto le ipotesi che gli arm siano governati da distribuzioni sub-gaussiane e che fa uso di tecniche di non-linear convex optimization. Mostrerò inoltre un upper bound teorico che evidenzia il perché questa strategia raffina lo stato dell’arte ed un’analisi sperimentale che supporta la sua efficacia.

Bio: https://www.docenti.unina.it/teacher/4c55494749534155524f5352414c475537345432394638333941/profile/references

When: June 5th, 15:00
Where: Sala Riunioni, Primo Piano

Meetings

What the harm? Gender Bias and Inclusive Language in Multilingual Translation Technologies.

Post author By Samuele D'Avenia
Post date 05/21/2026

Seminar by Beatrice Savoldi, researcher at Fondazione Bruno Kessler (FBK) within the MT research unit.

Abstract
Societal gender asymmetries, inequalities and stereotypes can be embedded in our communication practices and perpetuated in language technologies, including multilingual and Machine Translation (MT) systems used as scale. In this presentation, we will delve into the current landscape of MT and multilingual gender bias, as well as current proposals towards more inclusive language. We will discuss the challenges and opportunities — both theoretical, technical but also linguistic — in fostering a more equitable technology for different groups of users.

Bio
Beatrice Savoldi is a researcher at Fondazione Bruno Kessler (FBK) within the MT research unit, where she mainly works on gender inclusive and human-centered translation technologies.
Beatrice carried out a joint international PhD at the University of Trento and Augsburg with a dissertation on gender bias in Machine Translation, which was awarded the 2023 best thesis Research Prize from the Augsburg University Foundation.
Her research interests broadly encompass ethical and social considerations of (multilingual) language technologies.

When: 29/05/2026 11.00
Where: Aula 3.06 Thin Client (3rd floor) – Via Sant’Ottavio 54, Torino.

Meetings

HurtLens: A Perspectivist Corpus Analysis of Hurtful Language

Post author By Samuele D'Avenia
Post date 05/14/2026

Samuele D’Avenia and Eliana di Palma will be presenting one of their latest work.

Abstract:
Offensive language detection systems often rely on majority-aggregated annotations, overlooking the diversity of perspectives that shape how different communities perceive harm. In this contribution, we introduce HurtLens, a perspectivist corpus of hurtful language leveraging four disaggregated datasets which are automatically enriched through HurtLex lemmas, a multilingual resource of offensive and derogatory terms. Using mixed-effects modeling, we investigate how annotators’ sociodemographic backgrounds, the presence of specific types of offensive language (through Hurtlex categories) and their interaction influence offensiveness ratings. Our analysis reveals that offensiveness ratings are influenced both by annotators’ sociodemographic characteristics (particularly when considering them in intersection) and by the presence of specific types of offensive language. Additionally, we identify significant interaction effects showing that different demographic groups vary in their sensitivity to texts containing particular types of offensive language.

When: Friday, 15/05 at 11:30
Where: Sala Conferenze, 3rd Floor

Meetings

Narrative Spread and Audience Targeting in Social Media: Evidence from Multiple Domains

Post author By Samuele D'Avenia
Post date 03/30/2026

CCC Seminar by Jussara M. Almeida, Full Professor of Computer Science at the University of Minas Gerais, Brazil.

Abstract:
This talk examines how information spreads, clusters, and targets specific audiences across social media platforms. Drawing on three empirical studies, I analyze (i) the structure and discourse of misogynistic communities in Brazil, (ii) content directed at children on Instagram, and (iii) the emergence and alignment of political narratives in large-scale Telegram conversations during the 2024 U.S. elections. Together, these findings highlight relevant patterns underlying information diffusion across platforms and raise important questions about exposure, influence, and the amplification of targeted content in online ecosystems.

Short bio:
Jussara M Almeida is Full Professor of Computer Science at the University of Minas Gerais, Brazil, where she currently leads the Laboratory of Social Computing at the Department of Computer Science. She holds a PhD from the University of Wisconsin-Madison (US) and is a former affiliate member of the Brazilian Academy of Sciences. Her main research interests are social computing, user behavior analysis, as well as performance analysis and modeling of large-scale distributed systems.

When: 14/04/2026, 15:00-16:00
Where: Sala Riunioni, 3rd Floor

Meetings

The Duality of Social Media Discourse: Characterizing Polluted and Supportive Online Behaviors

Post author By Samuele D'Avenia
Post date 02/27/2026

CCC Seminar by Virginia Morini, Postdoc @CS Department at the University of Pisa (Italy).

Abstract:
Social media platforms have drastically changed how people interact, share information, and form relationships online, generating massive amounts of behavioral data. In this talk, I will present research examining how homophilic mechanisms – the tendency to interact with similar others – can produce radically different outcomes in online spaces. Through data-driven case studies on Reddit and X/Twitter, I employ a multidisciplinary approach combining network science and natural language processing with psychosociological insights to investigate both potentially harmful environments where cognitive biases are exacerbated and beneficial environments where users provide mutual support. My research characterizes the emergent phenomena in these contrasting spaces, examines the underlying user behaviors and group dynamics, and measures their effects across different platforms. The results demonstrate how online spaces can simultaneously foster problematic phenomena like echo chambers in sociopolitical discussions, while enabling supportive communities around mental health issues. I will highlight how community norms and interaction patterns, rather than platform architecture alone, play a crucial role in determining these divergent outcomes. The presentation will also introduce practical, open-source tools for studying online social phenomena while ensuring reproducibility and privacy protection.

When: 10/03/2026, h11:00
Where: Sala Conferenze, 3rd Floor

Meetings

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives

Post author By Samuele D'Avenia
Post date 02/27/2026

CCC Seminar by Janet Yu Wang, PhD, visiting for 3 months from Polytechnic University Hong Kong.

Abstract
Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like “this/that” in English and “这/那” in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal–distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal–distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) demonstratives as a new lens for evaluating embodied cognition and cultural conventions, (ii) empirical evidence of cross-cultural asymmetries in human interpretation, (iii) a new perspective on the egocentric–sociocentric debate, showing both orientations coexist but vary across languages, and (iv) a call to address individual variation in future model design.

When 03/03/2026, h14:00
Where Sala Conferenze, 3rd Floor

Meetings

LLM Beliefs Are in Their Heads

Post author By Samuele D'Avenia
Post date 02/11/2026

CCC Seminar by Alessandro Corona Mendozza, predoc researcher at the center for language Technology (Copenhagen University) and visiting at the University of Turin.

Abstract:
We investigate belief-like representations in decoder-only autoregressive LLMs using linear controlled probes on residual stream activations and single attention heads. Following Herrmann and Levinstein’s (2025) criteria (Accuracy, Use, Coherence, and Uniformity) we find that large models exhibit strong truth sensitivity (Accuracy), and steering activations along probe directions reliably changes downstream behavior (Use). Coherence, measured via calibrated probes and cross-dataset probing, is moderate across models, while training on diverse data yields domain-consistent truth directions (Uniformity). The results are particularly encouraging at the head level and align with some standard philosophical accounts of belief, e.g., minimal functionalism, supporting the view that LLMs can maintain propositional attitudes under such theoretical frameworks.

Short bio:
Alessandro Corona Mendozza is a predoc researcher working at the intersection of LLM interpretability, AI epistemology and philosophy of mind/language. He is currently assisting in research for an eye-tracking project at the center for language technology (Copenhagen University) and University of Turin (visiting).

When: 18/02/2026, h 14:00
Where: Sala Conferenze, 3rd Floor