CCC Seminar by Silvia Casola, postdoc researcher at the MaiNLP group of the Ludwig Maximilian University of Munich and Munich Center for Machine Learning.
Abstract:
Automatic evaluation in NLP often assumes a single ground truth, such as a reference or a gold label. However, language is inherently variable: multiple outputs can be valid, annotators frequently disagree, and metric behaviours can differ across languages. In this talk, I will present three case studies showing how evaluation can fail and how it can be improved under such variation. Focusing on NLG, I will show that metrics can be highly sensitive to the choice of reference, leading to large changes in system rankings. I will then examine classification evaluation under annotator disagreement and present an approach for accounting for systematic disagreement. Finally, I will discuss recent work on steering multilingual neural metrics to improve their correlation with humans.
Starting from these failure modes, the talk shows how studying and modeling variation in references, annotations, and languages can improve the stability and reliability of automatic evaluation.
Short Bio:
I am a postdoctoral researcher in the MaiNLP group at Ludwig Maximilian University of Munich and the Munich Center for Machine Learning, supervised by Barbara Plank. I was recently awarded a Marie Skłodowska-Curie Postdoctoral Fellowship for my project GenEval, to be hosted at Universitat Pompeu Fabra (UPF), which investigates the relationship between generation and evaluation in Large Language Models. Previously, I was a postdoctoral researcher at the University of Turin, where I worked on perspective-aware NLP. I completed my PhD at the University of Padua and Fondazione Bruno Kessler, focusing on natural language generation. During my PhD, I was a visiting researcher at UPF and interned with Spotify and Huawei Research. My research interests lie in NLP, with a focus on natural language generation and evaluation.
When: 31/03/2026, h 15:00-16:00
Where: Sala Conferenze, 3rd Floor