Home
Contact
Contact
Contact
Contact
HRK
Hello World
I’m Hannah Rose Kirk.

keywords = {
Large Language Models
;
Online Safety
;
Bias Mitigation
;
Statistics
;
China AI
;
Large Language Models
;
Oxford Internet Institute
;
UK AI Safety Institute
;
Oxford AI Society
;
Cambridge University
;
Peking University
;
Oxford Internet Institute
;
Sci-Fi Books
;
Poke Bowls
;
Documentaries
;
Emoji 😺😸
;
Running (slowly!)
;
Sci-Fi Books
;
}
LEARN MORE

print MySummary

I currently research AI alignment in my PhD @ the University of Oxford and as a Research Scientist at the UK AI Safety Institute.

My current research centres on human-and-model-in-the-loop feedback and data-centric alignment of AI. I am passionate about the societal impact of AI systems as we scale across model capabilities, domains and human populations.

My body of published work spans computational linguistics, economics, ethics and sociology, addressing a broad range of issues such as alignment, bias, fairness and hate speech from a multidisciplinary perspective. Alongside academia, I collaborate often with industry and policymakers.

Education

.class GetDegrees

2021 - 2025

Oxford Internet Institute, University of Oxford

DPhil in Social Data Science
Fully-funded scholarship
Supervised by Dr Scott A. Hale & Dr Bertie Vidgen
2020 - 2021

Oxford Internet Institute, University of Oxford

MSc in Social Data Science
Distinction, 77%
Awarded the Oxford Internet Institute Thesis Prize for best graduate dissertation
2018-2020

Yenching Academy, Peking University

MA in China Studies and Economics
GPA: 3.99, Rank: 2/99
2015 - 2018

Trinity College, University of Cambridge

BA in Economics
Double First Class Honours
Awarded the Roger Dennis Prize for best undergraduate dissertation

Positions

.class AddExperience

September 2024 - Now

UK AI Safety Institute, His Majesty's Government

Research Scientist, Societal Impacts
Investigating the social and psychological capabilities of frontier AI
September 2023 - December 2023

New York University

Visiting Academic in Data Science
Collaborating on human-AI coordination and LLM alignment with Professor He & Professor Bowman
February 2023 - October 2023

Google

External Student Researcher
Co-hosting an adversarial challenge to identify unsafe failure modes in text2image models
August 2023 - December 2023

OpenAI

Red-Teamer + Consultant
Improving the safety of OpenAI models (DALL-E & GPT-4)
September 2021 - Septembe 2023

The Alan Turing Institute

Data Scientist in Online Safety
Monitoring and detecting harmful language
September 2021 - July 2023

Rewire Online

Research Scientist
Implementing NLP solutions for online safety
October 2020 - October 2023

Oxford Artificial Intelligence Society

Research Labs Manager
Leading student research projects on AI bias
September 2019 - September 2020

The Berggruen Institute, China Center

Research Scholar
Linking Chinese philosophy to AI and privacy

Grants

.class Find$$$

2023-2024

Microsoft Accelerating Foundation Models Research Programme

Project title: “Personalised and diverse feedback for humans-and-models-in-the-loop"
2022-2024

MetaAI Dynabench Grant

Project title: “Optimizing feedback between humans-and-model-in-the-loop
2020-2024

Economic and Social Science Research Council

PhD scholarship, Digitial Social Science Pathway

(Selected) Publications

return Output

October 2024

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

NeurIPs 2024 (Oral, Top 0.5% Submissions)
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale
October 2024

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

NeurIPs 2024 (Oral, Top 0.5% Submissions)
Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk
April 2024

The benefits, risks and bounds of personalizing the alignment of large language models to individuals

Nature Machine Intelligence
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale
December 2023

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models

SOLAR @ NeurIPs 2023
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A Hale
December 2023

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

EMNLP 2023
Hannah Rose Kirk, Andrew M Bean, Bertie Vidgen, Paul Röttger, Scott A Hale
November 2023

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

NAACL 2024
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy
November 2023

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

ArXiv
Bertie Vidgen, Hannah Rose Kirk, Rebecca Qian, Nino Scherrer, Anand Kannappan, Scott A Hale, Paul Röttger

Auditing large language models: a three-layered approach

AI & Ethics
Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, Luciano Floridi
March 2023

SemEval-2023 Task 10: Explainable Detection of Online Sexism

SemEval @ ACL 2023
Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, Paul Röttger
January 2023

VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution

NeurIPs 2023
Siobhan Mackenzie Hall, Fernanda Gonçalves Abrantes, Hanwen Zhu, Grace Sodunke, Aleksandar Shtedritski, Hannah Rose Kirk
November 2022

Handling and Presenting Harmful Text in NLP Research

EMNLP 2022
Hannah Rose Kirk, Abeba Birhane, Bertie Vidgen, Leon Derczynski
September 2022

A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning

AACL 2022
Hugo Berg, Siobhan Mackenzie Hall, Yash Bhalgat, Wonsuk Yang, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain
September 2022

Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate

NAACL 2022
Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush & Scott A. Hale
August 2022

Tracking abuse on Twitter against football players in the 2021-22 Premier League season

Policy Report
Bertie Vidgen, Yi-Ling Chung, Pica Johansson, Hannah Rose Kirk, Angus Williams, Scott A. Hale, Helen Margetts, Paul Röttger, Laila Sprejer
May 2022

Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements

GeBNLP @ NAACL 2022
Conrad Borchers, Dalia Sara Gala, Benjamin Gilburt, Eduard Oravkin, Wilfried Bounsi, Yuki M. Asano, Hannah Rose Kirk
December 2021

Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

NeurIPS 2021
Hannah Rose Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo Volpin, Frederic A. Dreyer, Aleksandar Shtedritski & Yuki M. Asano
August 2021

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

WOAH @ ACL 2021
Hannah Rose Kirk, Yennie Jun, Paulius Rauba, Gal Wachtel, Ruining Li, Xingjian Bai, Noah Broestl, Martin Doff-Sotta, Aleksandar Shtedritski, & Yuki M Asano
August 2020

The Nuances of Confucianism in Technology Policy: an Inquiry into the Interaction Between Cultural and Political Systems in Chinese Digital Ethics

International Journal of Politics, Culture, and Society
Hannah Rose Kirk, Kangkyu Lee & Carlisle Micallef
10
/
01
/
24

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

NeurIPs 2024 (Oral, Top 0.5% Submissions)
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale
10
/
01
/
24

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

NeurIPs 2024 (Oral, Top 0.5% Submissions)
Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk
04
/
23
/
24

The benefits, risks and bounds of personalizing the alignment of large language models to individuals

Nature Machine Intelligence
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale
12
/
16
/
23

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models

SOLAR @ NeurIPs 2023
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A Hale
12
/
06
/
23

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

EMNLP 2023
Hannah Rose Kirk, Andrew M Bean, Bertie Vidgen, Paul Röttger, Scott A Hale
11
/
15
/
23

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

NAACL 2024
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy
11
/
14
/
23

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

ArXiv
Bertie Vidgen, Hannah Rose Kirk, Rebecca Qian, Nino Scherrer, Anand Kannappan, Scott A Hale, Paul Röttger

Auditing large language models: a three-layered approach

AI & Ethics
Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, Luciano Floridi
03
/
07
/
23

SemEval-2023 Task 10: Explainable Detection of Online Sexism

SemEval @ ACL 2023
Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, Paul Röttger
01
/
09
/
23

VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution

NeurIPs 2023
Siobhan Mackenzie Hall, Fernanda Gonçalves Abrantes, Hanwen Zhu, Grace Sodunke, Aleksandar Shtedritski, Hannah Rose Kirk
11
/
16
/
22

Handling and Presenting Harmful Text in NLP Research

EMNLP 2022
Hannah Rose Kirk, Abeba Birhane, Bertie Vidgen, Leon Derczynski
09
/
23
/
22

A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning

AACL 2022
Hugo Berg, Siobhan Mackenzie Hall, Yash Bhalgat, Wonsuk Yang, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain
09
/
06
/
22

Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate

NAACL 2022
Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush & Scott A. Hale
08
/
02
/
22

Tracking abuse on Twitter against football players in the 2021-22 Premier League season

Policy Report
Bertie Vidgen, Yi-Ling Chung, Pica Johansson, Hannah Rose Kirk, Angus Williams, Scott A. Hale, Helen Margetts, Paul Röttger, Laila Sprejer
05
/
23
/
22

Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements

GeBNLP @ NAACL 2022
Conrad Borchers, Dalia Sara Gala, Benjamin Gilburt, Eduard Oravkin, Wilfried Bounsi, Yuki M. Asano, Hannah Rose Kirk
12
/
01
/
21

Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

NeurIPS 2021
Hannah Rose Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo Volpin, Frederic A. Dreyer, Aleksandar Shtedritski & Yuki M. Asano
08
/
01
/
21

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

WOAH @ ACL 2021
Hannah Rose Kirk, Yennie Jun, Paulius Rauba, Gal Wachtel, Ruining Li, Xingjian Bai, Noah Broestl, Martin Doff-Sotta, Aleksandar Shtedritski, & Yuki M Asano
08
/
19
/
20

The Nuances of Confucianism in Technology Policy: an Inquiry into the Interaction Between Cultural and Political Systems in Chinese Digital Ethics

International Journal of Politics, Culture, and Society
Hannah Rose Kirk, Kangkyu Lee & Carlisle Micallef

In the News

display Headlines