Categories
Meetings

Whose Opinions Matter?

Perspective-aware Models to Identify Opinions of Hate Speech Victims in Abusive Language Detection.

Sohail Akhtar will present an in-depth study of the novel approaches to detect hate speech focusing on the development of approaches to leverage fine-grained knowledge derived from the annotations of individual annotators.

Title: Whose Opinions Matter? Perspective-aware Models to Identify Opinions of Hate Speech Victims in Abusive Language Detection.

Hate Speech (HS) is a form of abusive language and its detection on social media platforms is a rather difficult but important task. The sudden rise in hate speech related incidents on social media is considered a major issue. The technologies being developed for HS detection mainly employ supervised machine learning approaches in Natural Language Processing (NLP). Training such models require manually annotated data by humans, either by crowd-sourcing paid workers or by domain experts, for training and bench-marking purposes.

Because abusive language is subjective in nature, there might be highly polarizing topics or events involved in the annotation of abusive contents such as HS. Therefore, novel approaches are required to model conflicting perspectives and opinions coming from people with different personal and demographic backgrounds which raise issues concerning the quality of the annotation itself and might also impact the gold standard data to train NLP models. The annotators might also show different sensitivity levels against particular forms of hate, which results in low inter-annotators agreements. The online platforms used for the HS annotation does not provide any background information about the annotators and the views and personal opinions of the victims of online hate are often ignored in HS detection tasks.

In this talk, he will present an in-depth study of the novel approaches to detect various forms of abusive language against minorities. The work is focused on developing approaches to leverage fine-grained knowledge derived from the annotations of individual annotators, before a gold standard is created in which the subjectivity of the annotators is averaged out.

The research work aimed at developing approaches to model the polarized opinions coming from different communities under the hypothesis that similar characteristics (ethnicity, social background, culture etc.) can influence the perspectives of the annotators on a certain phenomenon and based on such information, they can be grouped together.

The institution is that by relying on such information, it is possible to divide the annotators into separate groups. Based on this grouping, separate gold standards are crated for individual to train state-of-the-art deep learning models for abusive language detection. Additionally, an ensemble approach is implemented to combine the perspective-aware classifiers from different groups into an inclusive model.

The research proposed a novel resource, a multi-perspective English language dataset annotated according to different sub-categories relevant for characterizing online abuse: HS, aggressiveness, offensiveness and stereotype. Unlike previous work, where the annotations were based on crowd-sourcing, here, the study involved the victims of targeted communities in the annotation process, who volunteered to annotate the dataset, providing a natural selection of the annotator groups based on their personal characteristics.  These annotators are from different cultural and social background and demographics. These annotated datasets and one of the groups involve the members of targeted communities.

By training state-of-the-art deep learning models on this novel resource, the results showed that how the proposed approach improves the prediction performance of a state-of-the-art supervised classifier.

Moreover, there is an in-depth qualitative analysis of the novel dataset by analyzing the individual instances of the tweets to identify and understand the topics and events causing polarization among the annotators. The analysis proved that the keywords (unigram features) are indeed strongly linked with and influenced by the culture, religion and demographic background of annotators.

When: On 2nd July at 11.30 am

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7_20210702T093000Z?from_login=true

Categories
Meetings

Weights & Biases

Mattia Cerrato presents a tutorial about the use of Weights & Biases platform useful to keep track of results, hyperparameters and random seeds in ML experiments.

Title: Experiment tracking with Weights & Biases

Performing experiments is perhaps the most time consuming activity in ML research, especially at the junior level. Often too little effort is spent in understanding how to optimize this process. The Weights & Biases (W&B) platform provides a simple Python interface which may be used to keep track of results, hyperparameters and random seeds. It has intuitive visualization utilities which may be used to write experimental reports starting from raw performance metric data. Furthermore, it provides an easy way to perform hyperparameter search (random, grid and even Bayesian search strategies are available) and even some light training orchestration capabilities. In this talk, we will see how to extend our experimental scripts so that W&B can help us keep our sanity during the experimental phase of a project.

When: On 4th June at 11.30

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7_20210604T093000Z

Categories
Meetings

Topic Shift in online debates on Twitter

Komal Florio presents an interesting investigation on the topic in the public discourse on Social Media.

Title: Topic shift in the public discourse on Social Media: a case study about the covid-19 induced lockdown in Italy in 2020

In this work she tried to tackle the challenge of measuring and quantifying the topic shift in the public discourse on Social Media, using as a case study the online debate on  Twitter following the covid-19 related lockdown in Italy in 2020, by means of a dedicated filtering of TWITA, a  dataset of tweets in Italian.

At first she tried to predict  which messages contained hate speech using AlBERTo, BERT fine tuned on Italian social media language, but the results were far from satisfying. She then tried a lexicon based approach and found that the dominant categories were derogatory words, insults regarding moral or behavioural defects and cognitive disabilities or diversity. Nevertheless the accuracy of this classification was not very high, and analysing the words in the lexicon that determined the classification for the top 3 categories it is possible to conclude that a manual revision of the list of words per each category could improve the outcome of this task.

She then moved to the most powerful classification tool that was used on these data: topic modeling. A first classification with a  Latent Dirichlet Allocation algorithm (LDA) proved valid in extracting the conversation around specific relevant events that happened in Italy in the time from between February 2020 and April 2020. To obtain consistent topics over time she then moved to a Dynamic Topic Modeling, which extracted  “healthcare” and “quarantine”  as consistently the predominant one in the corpus. She analyzed the peaks in documents related to this topic and to the mentioned lexicon categories and found out that they happened around  the same time slices were the topics “quarantine” and “healthcare” have spikes as well, showing that the most heated debates happened around public measures that affected directly and immediately on both the collectivity (“healthcare”)and personal life (“quarantine”).

She then tried to use all the information gained so far to enhance the hate speech prediction performed by means of AlBERTo. Unfortunately this experiment did not lead to significant results due to the very small size of the  resulting training dataset. Infusing deep learning model with information extracted from topic modeling sounds certainly a promising way to enhance the accuracy of hate speech prediction, but she feels like a further investigation on size and characteristics of datasets is absolutely essential to gain better results.

When: On 21st May at 11.30 am

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7_20210521T093000Z

Categories
Meetings

The Octopus Paper

Valerio Basile presents an interesting consideration about the difference between form and meaning of language in neural language models.

Title: Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

The success of the large neural language models on many NLP tasks is exciting. However, these successes sometimes lead to hype in which these models are being described as “understanding” language or capturing “meaning”. In this position paper it is argued that a system trained only on form has a priori no way to learn meaning, and that a clear understanding of the distinction between form and meaning will help guide the field towards better science around natural language understanding.

Emily M. Bender and Alexander Koller make their point through an incredibly witty story involving a very curious sea creature and a couple of castaways on bear-ridden tropical islands.

Related Paper: Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

When: On 7th May at 11.30 am

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7

Categories
Meetings

How to avoid “Sorry, I don’t understand. Can you repeat please?” in a dialogue system

In this talk, Alessandro Mazzei will present the results of a project developed with TIM about the improvement of a dialogue system in the domain of customer service for TELCO. The idea is to compensate for the lack of linguistic information by predicting the intentions of the humans on the basis of domain knowledge.

When: On 23th April at 11.30 am

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7_20210423T093000Z?from_login=true

Categories
Meetings

LessLex

Davide Colla presents a novel multilingual lexical resource called LessLex.

Title: LessLex: Linking Multilingual Embeddings to SenSe Representations of LEXical Items

LessLex is a novel multilingual lexical resource. Different from the vast majority of existing approaches, he grounds the embeddings on a sense inventory made available from the BabelNet semantic network. In this setting, multilingual access is governed by the mapping of terms onto their underlying sense descriptions, such that all vectors co-exist in the same semantic space. As a result, for each term there are thus the “blended” terminological vector along with those describing all senses associated to that term. LessLex has been tested on three tasks relevant to lexical semantics: conceptual similarity, contextual similarity, and semantic text similarity. He experimented over the principal data sets for such tasks in their multilingual and crosslingual variants, improving on or closely approaching state-of-the-art results. He concludes by arguing that LessLex vectors may be relevant for practical applications and for research on conceptual and lexical access and competence.

Related Paper: LessLex: Linking Multilingual Embeddings to SenSe Representations of LEXical Items 

When: On 26th March at 11.30 am

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7

Categories
Meetings

NLP for Music Information Retrieval

Michael Kurt Fell presents an interesting analysis of Lyrics Structure and Content.

Title: Natural Language Processing for Music Information Retrieval: Deep Analysis of Lyrics Structure and Content

Applications in Music Information Retrieval and Computational Musicology have traditionally relied on features extracted from the music content in the form of audio, but mostly ignored the song lyrics. More recently, improvements in fields such as music recommendation have been made by taking into account external metadata related to the song. In this talk, he will demonstrate that extracting knowledge from the song lyrics is the next step to improve the user’s experience when interacting with music. To extract knowledge from vast amounts of song lyrics, he will show for different textual aspects (their structure, content, and perception) how Natural Language Processing (NLP) methods can be adapted and successfully applied to lyrics. For the structural aspect of lyrics, a structural description of it is obtained by introducing a model that efficiently segments the lyrics into its characteristic parts (e.g. intro, verse, chorus). In a second stage, the content of lyrics is represented by means of summarizing the lyrics in a way that respects the characteristic lyrics structure. Finally, on the perception of lyrics he faced the problem of detecting explicit content in a song text. This task proves to be very hard and he will show that the difficulty partially arises from the subjective nature of perceiving lyrics in one way or another depending on the context. As a consequence of this work, he has also created the annotated WASABI Song Corpus, a dataset of two million songs with NLP lyrics annotations on various levels.

Related Work: Michael Fell. Natural Language Processing for Music Information Retrieval: Deep Analysis of Lyrics Structure and Content. Computation and Language [cs.CL]. Université Côte D’Azur, 2020.

When: On 26th February at 11.30 am

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7

Categories
Meetings

The Ontology of Migrant Writers

Marco Antonio Stranisci presents a new Computational Ontology of Migrant Writers.

Title: The Ontology of Migrant Writers

Narratives have become a pervasive, and multifaceted presence in social media. Within these communicative contexts, journalists and other influential people use them to frame specific and often conflicting points of view on the world. Correspondingly, users are an active part of this creative process because they interact and redefine narratives through their sentiment on specific topics.

However, social media are often affected by stereotypical narratives that increase the level of aggressiveness and verbal violence online, often at the expense of people vulnerable to discrimination. Many of these narratives are mainstream and strongly related to the spreading of Hate Speech (HS). Unfortunately, similar stereotypes are also present in positive narratives, which in several cases depict people vulnerable to HS exclusively as victims. Instead, stories directly created by minorities have poor visibility in the public debate even if the social web hosts a lot of them.

In order to reduce this underrepresentation, a computational ontology of migrant writers has been developed. This resource is aimed at representing people who created literary works and are or have been migrant during their life. It will be used to collect, organize, and make publicly available knowledge about migrant writers, and their narratives. The ontology design focused on two research questions:

  • how to model  the concept of migrant;
  • how to represent biographical events in their temporal succession.

In the presentation, he will first introduce the backbone ontology of migrant writers, highlighting the most challenging aspects he faced during its development. Then, he will show a series of data collection strategies he implemented to gather contents from Wikidata, DBpedia, and Wikipedia.

When: On 12th February, 2021 at 11.30 am

Where: https://unito.webex.com/webappng/sites/unito/meeting/info/910eaf7ad0534d1ba92c5dde0a66a9a7_20210212T103000Z

Categories
Meetings

Zero-Shot Cross-Lingual Hate Speech Detection

Endang Wahyu Pamungkas presents new experiments and challenges in Hate Speech Detection in a multi-lingual context.

Title: Zero-Shot Cross-Lingual Hate Speech Detection

Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting.

In this talk, he will present an ongoing work on hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. He will present experiments with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. The results of the experiment highlight a number of challenges and issues in this particular task.

One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remains an open problem. However, the experimental evaluation and the qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.

When: On 29th January, 2021 at 11.30 am

Categories
Meetings

VALICO-UD

Elisa Di Nuovo presents a new resource for NLP “VALICO-UD”.

Title: VALICO-UD, an Italian Learner Treebank in Universal Dependencies for NLP tasks

In this talk, a novel parallel treebank made of texts written by learners of Italian and their grammatically corrected versions will be presented. The treebank is annotated according to Universal Dependencies formalism and is composed of a silver standard (automatically parsed) and a core gold standard which was manually corrected and error annotated. In addition, the evaluation of three different UDPipe models will be presented, measuring also the impact of gold tokenisation and PoS tagging. To conclude, its applications and annotation choices will be discussed.

Paper: Towards an Italian Learner Treebank in Universal Dependencies

When: On 15th January, 2021 at 11.30 am