Hannah Rose Kirk

I’m Hannah Rose Kirk.

keywords = {

AI Security

;

Online Safety

;

Human-AI interaction

;

Statistics

;

Data Science

;

AI Security

;

Oxford Internet Institute

;

UK AI Security Institute

;

Oxford AI Society

;

Cambridge University

;

Peking University

;

Oxford Internet Institute

;

Sci-Fi Books

;

Poke Bowls

;

Documentaries

;

Emoji 😺😸

;

Running (slowly!)

;

Sci-Fi Books

;

}

Publications

print MySummary

I currently research AI alignment in my PhD @ the University of Oxford and lead the Human Influence Team at the UK AI Security Institute.

My current research centres on human-and-model-in-the-loop feedback and data-centric alignment of AI. I am passionate about the societal impact of AI systems as we scale across model capabilities, domains and human populations.

My body of research spans AI, data science, computational linguistics, computer vision, ethics and sociology, addressing a broad range of issues such as AI safety and security, alignment, bias, fairness, and hate speech from a multidisciplinary perspective.

Education

.class GetDegrees

2021 - 2026

Oxford Internet Institute, University of Oxford

DPhil in Social Data Science

Fully-funded scholarship
Supervised by Dr Scott A. Hale & Dr Bertie Vidgen

2020 - 2021

Oxford Internet Institute, University of Oxford

MSc in Social Data Science

Distinction, 77%
Awarded the Oxford Internet Institute Thesis Prize for best graduate dissertation

2018-2020

Yenching Academy, Peking University

MA in China Studies and Economics

GPA: 3.99, Rank: 2/99

2015 - 2018

Trinity College, University of Cambridge

BA in Economics

Double First Class Honours
Awarded the Roger Dennis Prize for best undergraduate dissertation

Positions

.class AddExperience

September 2024 - Now

UK AI Security Institute, His Majesty's Government

Workstream Lead, Human Influence

Investigating the social and psychological capabilities of frontier AI

September 2023 - December 2023

New York University

Visiting Academic in Data Science

Collaborating on human-AI coordination and LLM alignment with Professor He & Professor Bowman

February 2023 - October 2023

Google

External Student Researcher

Co-hosting an adversarial challenge to identify unsafe failure modes in text2image models

August 2023 - December 2023

OpenAI

Red-Teamer + Consultant

Improving the safety of OpenAI models (DALL-E & GPT-4)

September 2021 - Septembe 2023

The Alan Turing Institute

Data Scientist in Online Safety

Monitoring and detecting harmful language

September 2021 - July 2023

Rewire Online

Research Scientist

Implementing NLP solutions for online safety

October 2020 - October 2023

Oxford Artificial Intelligence Society

Research Labs Manager

Leading student research projects on AI bias

September 2019 - September 2020

The Berggruen Institute, China Center

Research Scholar

Linking Chinese philosophy to AI and privacy

Grants

.class Find$$$

2023-2024

Microsoft Accelerating Foundation Models Research Programme

Project title: “Personalised and diverse feedback for humans-and-models-in-the-loop"

2022-2024

MetaAI Dynabench Grant

Project title: “Optimizing feedback between humans-and-model-in-the-loop

2020-2024

Economic and Social Science Research Council

PhD scholarship, Digitial Social Science Pathway

(Selected) Publications

return Output

December 2025

Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships

Pre-print

Hannah Rose Kirk, Henry Davidson, Ed Saunders, Lennart Luettgau, Bertie Vidgen, Scott A. Hale, Christopher Summerfield

October 2024

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

NeurIPs 2024 (Oral, Top 0.5% Submissions)

Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk

October 2024

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

NeurIPs 2024 (Oral, Top 0.5% Submissions)

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale

April 2024

The benefits, risks and bounds of personalizing the alignment of large language models to individuals

Nature Machine Intelligence

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale

December 2023

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models

SOLAR @ NeurIPs 2023

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A Hale

December 2023

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

EMNLP 2023

Hannah Rose Kirk, Andrew M Bean, Bertie Vidgen, Paul Röttger, Scott A Hale

November 2023

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

NAACL 2024

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy

November 2023

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

ArXiv

Bertie Vidgen, Hannah Rose Kirk, Rebecca Qian, Nino Scherrer, Anand Kannappan, Scott A Hale, Paul Röttger

May 2023

https://link.springer.com/article/10.1007/s43681-023-00289-2

Auditing large language models: a three-layered approach

AI & Ethics

Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, Luciano Floridi

March 2023

SemEval-2023 Task 10: Explainable Detection of Online Sexism

SemEval @ ACL 2023

Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, Paul Röttger

January 2023

VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution

NeurIPs 2023

Siobhan Mackenzie Hall, Fernanda Gonçalves Abrantes, Hanwen Zhu, Grace Sodunke, Aleksandar Shtedritski, Hannah Rose Kirk

November 2022

Handling and Presenting Harmful Text in NLP Research

EMNLP 2022

Hannah Rose Kirk, Abeba Birhane, Bertie Vidgen, Leon Derczynski

September 2022

A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning

AACL 2022

Hugo Berg, Siobhan Mackenzie Hall, Yash Bhalgat, Wonsuk Yang, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain

September 2022

Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning

TRAC @ COLING 2022

Hannah Rose Kirk, Bertie Vidgen, Scott A. Hale

September 2022

Cite

Hello

Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate

NAACL 2022

Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush & Scott A. Hale

August 2022

Tracking abuse on Twitter against football players in the 2021-22 Premier League season

Policy Report

Bertie Vidgen, Yi-Ling Chung, Pica Johansson, Hannah Rose Kirk, Angus Williams, Scott A. Hale, Helen Margetts, Paul Röttger, Laila Sprejer

May 2022

Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements

GeBNLP @ NAACL 2022

Conrad Borchers, Dalia Sara Gala, Benjamin Gilburt, Eduard Oravkin, Wilfried Bounsi, Yuki M. Asano, Hannah Rose Kirk

January 2022

DOI

The mediation of matchmaking: a comparative study of gender and generational preference in online dating websites and offline blind date markets in Chengdu

Journal of Chinese Sociology

Hannah Rose Kirk & Shriyam Gupta

December 2021

Cite

Also

Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

NeurIPS 2021

Hannah Rose Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo Volpin, Frederic A. Dreyer, Aleksandar Shtedritski & Yuki M. Asano

August 2021

Cite

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

WOAH @ ACL 2021

Hannah Rose Kirk, Yennie Jun, Paulius Rauba, Gal Wachtel, Ruining Li, Xingjian Bai, Noah Broestl, Martin Doff-Sotta, Aleksandar Shtedritski, & Yuki M Asano

August 2020

Cite

The Nuances of Confucianism in Technology Policy: an Inquiry into the Interaction Between Cultural and Political Systems in Chinese Digital Ethics

International Journal of Politics, Culture, and Society

Hannah Rose Kirk, Kangkyu Lee & Carlisle Micallef

November 2019

Cooperation and Creed: An Experimental Study of Religious Affiliation in Strategic and Societal Interactions

Cambridge Working Papers in Economics

Hannah Rose Kirk

Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships

Pre-print

Hannah Rose Kirk, Henry Davidson, Ed Saunders, Lennart Luettgau, Bertie Vidgen, Scott A. Hale, Christopher Summerfield

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

NeurIPs 2024 (Oral, Top 0.5% Submissions)

Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

NeurIPs 2024 (Oral, Top 0.5% Submissions)

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale

The benefits, risks and bounds of personalizing the alignment of large language models to individuals

Nature Machine Intelligence

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models

SOLAR @ NeurIPs 2023

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A Hale

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

EMNLP 2023

Hannah Rose Kirk, Andrew M Bean, Bertie Vidgen, Paul Röttger, Scott A Hale

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

NAACL 2024

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

ArXiv

Bertie Vidgen, Hannah Rose Kirk, Rebecca Qian, Nino Scherrer, Anand Kannappan, Scott A Hale, Paul Röttger

https://link.springer.com/article/10.1007/s43681-023-00289-2

Auditing large language models: a three-layered approach

AI & Ethics

Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, Luciano Floridi

SemEval-2023 Task 10: Explainable Detection of Online Sexism

SemEval @ ACL 2023

Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, Paul Röttger

VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution

NeurIPs 2023

Siobhan Mackenzie Hall, Fernanda Gonçalves Abrantes, Hanwen Zhu, Grace Sodunke, Aleksandar Shtedritski, Hannah Rose Kirk

Handling and Presenting Harmful Text in NLP Research

EMNLP 2022

Hannah Rose Kirk, Abeba Birhane, Bertie Vidgen, Leon Derczynski

A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning

AACL 2022

Hugo Berg, Siobhan Mackenzie Hall, Yash Bhalgat, Wonsuk Yang, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain

Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning

TRAC @ COLING 2022

Hannah Rose Kirk, Bertie Vidgen, Scott A. Hale

Cite

Hello

Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate

NAACL 2022

Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush & Scott A. Hale

Tracking abuse on Twitter against football players in the 2021-22 Premier League season

Policy Report

Bertie Vidgen, Yi-Ling Chung, Pica Johansson, Hannah Rose Kirk, Angus Williams, Scott A. Hale, Helen Margetts, Paul Röttger, Laila Sprejer

Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements

GeBNLP @ NAACL 2022

Conrad Borchers, Dalia Sara Gala, Benjamin Gilburt, Eduard Oravkin, Wilfried Bounsi, Yuki M. Asano, Hannah Rose Kirk

DOI

The mediation of matchmaking: a comparative study of gender and generational preference in online dating websites and offline blind date markets in Chengdu

Journal of Chinese Sociology

Hannah Rose Kirk & Shriyam Gupta

Cite

Also

Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

NeurIPS 2021

Hannah Rose Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo Volpin, Frederic A. Dreyer, Aleksandar Shtedritski & Yuki M. Asano

Cite

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

WOAH @ ACL 2021

Hannah Rose Kirk, Yennie Jun, Paulius Rauba, Gal Wachtel, Ruining Li, Xingjian Bai, Noah Broestl, Martin Doff-Sotta, Aleksandar Shtedritski, & Yuki M Asano

Cite