Research

My current research is organized around three main axes: LLM safety, model specialization, and social and cultural robustness. On the safety side, I am interested in harmful or uncontrolled model behaviors, including the detection of LLM-generated content, the identification of manipulated pre-training data, the study of backdoored models, and, more broadly, questions of interpretability and control.

A second line of my work focuses on the specialization of language models, through post-training, alignment, and instruction-tuning methods, with applications to domains such as finance, education, fact-checking, and code generation.

Finally, I work on the social and cultural robustness of NLP systems, especially in the hate speech and radicalization domain, with a particular interest in representation biases, fairness in evaluation, and methods for making models more reliable across different linguistic, social, and cultural contexts.

Overall, the common thread across these three directions is the same: how to build language models that are safer, more useful in practice, and better able to handle the diversity of real-world data and users. Doing so requires building new models, new data sets, and new ways of investigating their behavior.

One natural next step is to use our backdoored models as controlled playgrounds for interpretability, and in particular to investigate whether sparse autoencoders can help us better isolate and understand the internal mechanisms behind triggered malicious behaviors.
Good news: we finally finished training SAEs for GAPeron, our backdoored model. A few papers should follow soon, stay tuned!