Publications

2025

Paper

RG3

Z. Cao, S. Apel, A. Singla, and V. Demberg

“Pragmatic Reasoning improves LLM Code Generation,” 2025. [Online]. Available: https://arxiv.org/abs/2502.15835.

Abstract

Large Language Models (LLMs) have demonstrated impressive potential in
translating natural language (NL) instructions into program code. However, user
instructions often contain inherent ambiguities, making it challenging for LLMs
to generate code that accurately reflects the user's true intent. To address
this challenge, researchers have proposed to produce multiple candidates of the
program code and then rerank them to identify the best solution. In this paper,
we propose CodeRSA, a novel code candidate reranking mechanism built upon the
Rational Speech Act (RSA) framework, designed to guide LLMs toward more
comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using
one of the latest LLMs on a popular code generation dataset. Our experiment
results show that CodeRSA consistently outperforms common baselines, surpasses
the state-of-the-art approach in most cases, and demonstrates robust overall
performance. These findings underscore the effectiveness of integrating
pragmatic reasoning into code candidate reranking, offering a promising
direction for enhancing code generation quality in LLMs.

BibTeX

@online{Cao_2502.15835,
TITLE = {Pragmatic Reasoning improves {LLM} Code Generation},
AUTHOR = {Cao, Zhuchen and Apel, Sven and Singla, Adish and Demberg, Vera},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2502.15835},
EPRINT = {2502.15835},
EPRINTTYPE = {arXiv},
YEAR = {2025},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Large Language Models (LLMs) have demonstrated impressive potential in<br>translating natural language (NL) instructions into program code. However, user<br>instructions often contain inherent ambiguities, making it challenging for LLMs<br>to generate code that accurately reflects the user's true intent. To address<br>this challenge, researchers have proposed to produce multiple candidates of the<br>program code and then rerank them to identify the best solution. In this paper,<br>we propose CodeRSA, a novel code candidate reranking mechanism built upon the<br>Rational Speech Act (RSA) framework, designed to guide LLMs toward more<br>comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using<br>one of the latest LLMs on a popular code generation dataset. Our experiment<br>results show that CodeRSA consistently outperforms common baselines, surpasses<br>the state-of-the-art approach in most cases, and demonstrates robust overall<br>performance. These findings underscore the effectiveness of integrating<br>pragmatic reasoning into code candidate reranking, offering a promising<br>direction for enhancing code generation quality in LLMs.<br>},
}

Endnote

%0 Report
%A Cao, Zhuchen
%A Apel, Sven
%A Singla, Adish
%A Demberg, Vera
%+ Multimodal Language Processing, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
Multimodal Language Processing, MPI for Informatics, Max Planck Society
%T Pragmatic Reasoning improves LLM Code Generation : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-FB6A-D
%U https://arxiv.org/abs/2502.15835
%D 2025
%X   Large Language Models (LLMs) have demonstrated impressive potential in<br>translating natural language (NL) instructions into program code. However, user<br>instructions often contain inherent ambiguities, making it challenging for LLMs<br>to generate code that accurately reflects the user's true intent. To address<br>this challenge, researchers have proposed to produce multiple candidates of the<br>program code and then rerank them to identify the best solution. In this paper,<br>we propose CodeRSA, a novel code candidate reranking mechanism built upon the<br>Rational Speech Act (RSA) framework, designed to guide LLMs toward more<br>comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using<br>one of the latest LLMs on a popular code generation dataset. Our experiment<br>results show that CodeRSA consistently outperforms common baselines, surpasses<br>the state-of-the-art approach in most cases, and demonstrates robust overall<br>performance. These findings underscore the effectiveness of integrating<br>pragmatic reasoning into code candidate reranking, offering a promising<br>direction for enhancing code generation quality in LLMs.<br>
%K Computer Science, Computation and Language, cs.CL,Computer Science, Artificial Intelligence, cs.AI,Computer Science, Software Engineering, cs.SE

Paper

RG3

D. Liu, C. Whitehouse, X. Yu, L. Mahon, R. Saxena, Z. Zhao, Y. Qiu, M. Lapata, and V. Demberg

“What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08279.

Abstract

Transforming recorded videos into concise and accurate textual summaries is a
growing challenge in multimodal learning. This paper introduces VISTA, a
dataset specifically designed for video-to-text summarization in scientific
domains. VISTA contains 18,599 recorded AI conference presentations paired with
their corresponding paper abstracts. We benchmark the performance of
state-of-the-art large models and apply a plan-based framework to better
capture the structured nature of abstracts. Both human and automated
evaluations confirm that explicit planning enhances summary quality and factual
consistency. However, a considerable gap remains between models and human
performance, highlighting the challenges of scientific video summarization.

BibTeX

@online{Liu_2502.08279,
TITLE = {What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations},
AUTHOR = {Liu, Dongqi and Whitehouse, Chenxi and Yu, Xi and Mahon, Louis and Saxena, Rohit and Zhao, Zheng and Qiu, Yifu and Lapata, Mirella and Demberg, Vera},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2502.08279},
EPRINT = {2502.08279},
EPRINTTYPE = {arXiv},
YEAR = {2025},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Transforming recorded videos into concise and accurate textual summaries is a<br>growing challenge in multimodal learning. This paper introduces VISTA, a<br>dataset specifically designed for video-to-text summarization in scientific<br>domains. VISTA contains 18,599 recorded AI conference presentations paired with<br>their corresponding paper abstracts. We benchmark the performance of<br>state-of-the-art large models and apply a plan-based framework to better<br>capture the structured nature of abstracts. Both human and automated<br>evaluations confirm that explicit planning enhances summary quality and factual<br>consistency. However, a considerable gap remains between models and human<br>performance, highlighting the challenges of scientific video summarization.<br>},
}

Endnote

%0 Report
%A Liu, Dongqi
%A Whitehouse, Chenxi
%A Yu, Xi
%A Mahon, Louis
%A Saxena, Rohit
%A Zhao, Zheng
%A Qiu, Yifu
%A Lapata, Mirella
%A Demberg, Vera
%+ External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
Multimodal Language Processing, MPI for Informatics, Max Planck Society
%T What Is That Talk About? A Video-to-Text Summarization Dataset for
  Scientific Presentations : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-FB46-5
%U https://arxiv.org/abs/2502.08279
%D 2025
%X   Transforming recorded videos into concise and accurate textual summaries is a<br>growing challenge in multimodal learning. This paper introduces VISTA, a<br>dataset specifically designed for video-to-text summarization in scientific<br>domains. VISTA contains 18,599 recorded AI conference presentations paired with<br>their corresponding paper abstracts. We benchmark the performance of<br>state-of-the-art large models and apply a plan-based framework to better<br>capture the structured nature of abstracts. Both human and automated<br>evaluations confirm that explicit planning enhances summary quality and factual<br>consistency. However, a considerable gap remains between models and human<br>performance, highlighting the challenges of scientific video summarization.<br>
%K Computer Science, Computation and Language, cs.CL,Computer Science, Artificial Intelligence, cs.AI,Computer Science, Computer Vision and Pattern Recognition, cs.CV

Paper

D6RG3

V. Suresh, M. H. Mughal, C. Theobalt, and V. Demberg

“Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues,” 2025. [Online]. Available: https://arxiv.org/abs/2503.03474.

Abstract

Research in linguistics shows that non-verbal cues, such as gestures, play a
crucial role in spoken discourse. For example, speakers perform hand gestures
to indicate topic shifts, helping listeners identify transitions in discourse.
In this work, we investigate whether the joint modeling of gestures using human
motion sequences and language can improve spoken discourse modeling in language
models. To integrate gestures into language models, we first encode 3D human
motion sequences into discrete gesture tokens using a VQ-VAE. These gesture
token embeddings are then aligned with text embeddings through feature
alignment, mapping them into the text embedding space. To evaluate the
gesture-aligned language model on spoken discourse, we construct text infilling
tasks targeting three key discourse cues grounded in linguistic research:
discourse connectives, stance markers, and quantifiers. Results show that
incorporating gestures enhances marker prediction accuracy across the three
tasks, highlighting the complementary information that gestures can offer in
modeling spoken discourse. We view this work as an initial step toward
leveraging non-verbal cues to advance spoken language modeling in language
models.

BibTeX

@online{Suresh2503.03474,
TITLE = {Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues},
AUTHOR = {Suresh, Varsha and Mughal, Muhammad Hamza and Theobalt, Christian and Demberg, Vera},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2503.03474},
EPRINT = {2503.03474},
EPRINTTYPE = {arXiv},
YEAR = {2025},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Research in linguistics shows that non-verbal cues, such as gestures, play a<br>crucial role in spoken discourse. For example, speakers perform hand gestures<br>to indicate topic shifts, helping listeners identify transitions in discourse.<br>In this work, we investigate whether the joint modeling of gestures using human<br>motion sequences and language can improve spoken discourse modeling in language<br>models. To integrate gestures into language models, we first encode 3D human<br>motion sequences into discrete gesture tokens using a VQ-VAE. These gesture<br>token embeddings are then aligned with text embeddings through feature<br>alignment, mapping them into the text embedding space. To evaluate the<br>gesture-aligned language model on spoken discourse, we construct text infilling<br>tasks targeting three key discourse cues grounded in linguistic research:<br>discourse connectives, stance markers, and quantifiers. Results show that<br>incorporating gestures enhances marker prediction accuracy across the three<br>tasks, highlighting the complementary information that gestures can offer in<br>modeling spoken discourse. We view this work as an initial step toward<br>leveraging non-verbal cues to advance spoken language modeling in language<br>models.<br>},
}

Endnote

%0 Report
%A Suresh, Varsha
%A Mughal, Muhammad Hamza
%A Theobalt, Christian
%A Demberg, Vera
%+ External Organizations
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society
Multimodal Language Processing, MPI for Informatics, Max Planck Society
%T Enhancing Spoken Discourse Modeling in Language Models Using Gestural
  Cues : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0011-0D79-8
%U https://arxiv.org/abs/2503.03474
%D 2025
%X   Research in linguistics shows that non-verbal cues, such as gestures, play a<br>crucial role in spoken discourse. For example, speakers perform hand gestures<br>to indicate topic shifts, helping listeners identify transitions in discourse.<br>In this work, we investigate whether the joint modeling of gestures using human<br>motion sequences and language can improve spoken discourse modeling in language<br>models. To integrate gestures into language models, we first encode 3D human<br>motion sequences into discrete gesture tokens using a VQ-VAE. These gesture<br>token embeddings are then aligned with text embeddings through feature<br>alignment, mapping them into the text embedding space. To evaluate the<br>gesture-aligned language model on spoken discourse, we construct text infilling<br>tasks targeting three key discourse cues grounded in linguistic research:<br>discourse connectives, stance markers, and quantifiers. Results show that<br>incorporating gestures enhances marker prediction accuracy across the three<br>tasks, highlighting the complementary information that gestures can offer in<br>modeling spoken discourse. We view this work as an initial step toward<br>leveraging non-verbal cues to advance spoken language modeling in language<br>models.<br>
%K Computer Science, Computation and Language, cs.CL

Paper

D2RG3

Y. Wang, S. Rao, J.-U. Lee, M. Jobanputra, and V. Demberg

“B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12992.

Abstract

Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
anonymous.4open.science/r/bcos_lm.

BibTeX

@online{Wang2502.12992,
TITLE = {B-cos {LM}: Efficiently Transforming Pre-trained Language Models for Improved Explainability},
AUTHOR = {Wang, Yifan and Rao, Sukrut and Lee, Ji-Ung and Jobanputra, Mayank and Demberg, Vera},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2502.12992},
EPRINT = {2502.12992},
EPRINTTYPE = {arXiv},
YEAR = {2025},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Post-hoc explanation methods for black-box models often struggle with<br>faithfulness and human interpretability due to the lack of explainability in<br>current neural models. Meanwhile, B-cos networks have been introduced to<br>improve model explainability through architectural and computational<br>adaptations, but their application has so far been limited to computer vision<br>models and their associated training pipelines. In this work, we introduce<br>B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly<br>transforms pre-trained language models into B-cos LMs by combining B-cos<br>conversion and task fine-tuning, improving efficiency compared to previous<br>B-cos methods. Our automatic and human evaluation results demonstrate that<br>B-cos LMs produce more faithful and human interpretable explanations than post<br>hoc methods, while maintaining task performance comparable to conventional<br>fine-tuning. Our in-depth analysis explores how B-cos LMs differ from<br>conventionally fine-tuned models in their learning processes and explanation<br>patterns. Finally, we provide practical guidelines for effectively building<br>B-cos LMs based on our findings. Our code is available at<br>https://anonymous.4open.science/r/bcos_lm.<br>},
}

Endnote

%0 Report
%A Wang, Yifan
%A Rao, Sukrut
%A Lee, Ji-Ung
%A Jobanputra, Mayank
%A Demberg, Vera
%+ External Organizations
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
Multimodal Language Processing, MPI for Informatics, Max Planck Society
%T B-cos LM: Efficiently Transforming Pre-trained Language Models for
  Improved Explainability : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-C156-3
%U https://arxiv.org/abs/2502.12992
%D 2025
%X   Post-hoc explanation methods for black-box models often struggle with<br>faithfulness and human interpretability due to the lack of explainability in<br>current neural models. Meanwhile, B-cos networks have been introduced to<br>improve model explainability through architectural and computational<br>adaptations, but their application has so far been limited to computer vision<br>models and their associated training pipelines. In this work, we introduce<br>B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly<br>transforms pre-trained language models into B-cos LMs by combining B-cos<br>conversion and task fine-tuning, improving efficiency compared to previous<br>B-cos methods. Our automatic and human evaluation results demonstrate that<br>B-cos LMs produce more faithful and human interpretable explanations than post<br>hoc methods, while maintaining task performance comparable to conventional<br>fine-tuning. Our in-depth analysis explores how B-cos LMs differ from<br>conventionally fine-tuned models in their learning processes and explanation<br>patterns. Finally, we provide practical guidelines for effectively building<br>B-cos LMs based on our findings. Our code is available at<br>https://anonymous.4open.science/r/bcos_lm.<br>
%K Computer Science, Computation and Language, cs.CL,Computer Science, Artificial Intelligence, cs.AI

2024

Conference paper

RG3

A. Chingacham, M. Zhang, V. Demberg, and D. Klakow

“Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?,” in Proceedings of the 1st Human-Centered Large Language Modeling Workshop (HuCLLM 2024), Bangkok, Thailand, 2024.

Abstract

Large Language Models (LLMs) can generate text by transferring style
attributes like formality resulting in formal or informal text. However,
instructing LLMs to generate text that when spoken, is more intelligible in an
acoustically difficult environment, is an under-explored topic. We conduct the
first study to evaluate LLMs on a novel task of generating acoustically
intelligible paraphrases for better human speech perception in noise. Our
experiments in English demonstrated that with standard prompting, LLMs struggle
to control the non-textual attribute, i.e., acoustic intelligibility, while
efficiently capturing the desired textual attributes like semantic equivalence.
To remedy this issue, we propose a simple prompting approach,
prompt-and-select, which generates paraphrases by decoupling the desired
textual and non-textual attributes in the text generation pipeline. Our
approach resulted in a 40% relative improvement in human speech perception, by
paraphrasing utterances that are highly distorted in a listening condition with
babble noise at a signal-to-noise ratio (SNR) -5 dB. This study reveals the
limitation of LLMs in capturing non-textual attributes, and our proposed method
showcases the potential of using LLMs for better human speech perception in
noise.

BibTeX

@inproceedings{Chingacham_2408.04029,
TITLE = {Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?},
AUTHOR = {Chingacham, Anupama and Zhang, Miaoran and Demberg, Vera and Klakow, Dietrich},
LANGUAGE = {eng},
ISBN = {979-8-89176-152-0},
DOI = {10.18653/v1/2024.hucllm-1.1},
PUBLISHER = {ACL},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Large Language Models (LLMs) can generate text by transferring style<br>attributes like formality resulting in formal or informal text. However,<br>instructing LLMs to generate text that when spoken, is more intelligible in an<br>acoustically difficult environment, is an under-explored topic. We conduct the<br>first study to evaluate LLMs on a novel task of generating acoustically<br>intelligible paraphrases for better human speech perception in noise. Our<br>experiments in English demonstrated that with standard prompting, LLMs struggle<br>to control the non-textual attribute, i.e., acoustic intelligibility, while<br>efficiently capturing the desired textual attributes like semantic equivalence.<br>To remedy this issue, we propose a simple prompting approach,<br>prompt-and-select, which generates paraphrases by decoupling the desired<br>textual and non-textual attributes in the text generation pipeline. Our<br>approach resulted in a 40% relative improvement in human speech perception, by<br>paraphrasing utterances that are highly distorted in a listening condition with<br>babble noise at a signal-to-noise ratio (SNR) -5 dB. This study reveals the<br>limitation of LLMs in capturing non-textual attributes, and our proposed method<br>showcases the potential of using LLMs for better human speech perception in<br>noise.<br>},
BOOKTITLE = {Proceedings of the 1st Human-Centered Large Language Modeling Workshop (HuCLLM 2024)},
EDITOR = {Soni, Nikita and Flek, Lucie and Sharma, Ashsih and Yang, Diyi and Hooker, Sara and Schwartz, H. Andrew},
PAGES = {1--15},
ADDRESS = {Bangkok, Thailand},
}

Endnote

%0 Conference Proceedings
%A Chingacham, Anupama
%A Zhang, Miaoran
%A Demberg, Vera
%A Klakow, Dietrich
%+ External Organizations
External Organizations
Multimodal Language Processing, MPI for Informatics, Max Planck Society
External Organizations
%T Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It? : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-43EE-7
%R 10.18653/v1/2024.hucllm-1.1
%D 2024
%B 1st Human-Centered Large Language Modeling Workshop
%Z date of event: 2024-08-15 - 2024-08-15
%C Bangkok, Thailand
%X   Large Language Models (LLMs) can generate text by transferring style<br>attributes like formality resulting in formal or informal text. However,<br>instructing LLMs to generate text that when spoken, is more intelligible in an<br>acoustically difficult environment, is an under-explored topic. We conduct the<br>first study to evaluate LLMs on a novel task of generating acoustically<br>intelligible paraphrases for better human speech perception in noise. Our<br>experiments in English demonstrated that with standard prompting, LLMs struggle<br>to control the non-textual attribute, i.e., acoustic intelligibility, while<br>efficiently capturing the desired textual attributes like semantic equivalence.<br>To remedy this issue, we propose a simple prompting approach,<br>prompt-and-select, which generates paraphrases by decoupling the desired<br>textual and non-textual attributes in the text generation pipeline. Our<br>approach resulted in a 40% relative improvement in human speech perception, by<br>paraphrasing utterances that are highly distorted in a listening condition with<br>babble noise at a signal-to-noise ratio (SNR) -5 dB. This study reveals the<br>limitation of LLMs in capturing non-textual attributes, and our proposed method<br>showcases the potential of using LLMs for better human speech perception in<br>noise.<br>
%K Computer Science, Computation and Language, cs.CL
%B Proceedings of the 1st Human-Centered Large Language Modeling Workshop
%E Soni, Nikita; Flek, Lucie; Sharma, Ashsih; Yang, Diyi; Hooker, Sara; Schwartz, H. Andrew
%P 1 - 15
%I ACL
%@ 979-8-89176-152-0

Conference paper

RG3

T. Liu, I. Škrjanec, and V. Demberg

“Temperature-scaling Surprisal Estimates Improve Fit to Human Reading Times – But Does it Do so for the ‘Right Reasons’?,” in The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), Bangkok, Thailand, 2024.

@inproceedings{Liu_ACL24,
TITLE = {Temperature-scaling Surprisal Estimates Improve Fit to Human Reading Times -- But Does it Do so for the {\textquotedblleft}Right Reasons{\textquotedblright}?},
AUTHOR = {Liu, Tong and {\v S}krjanec, Iza and Demberg, Vera},
LANGUAGE = {eng},
DOI = {10.18653/v1/2024.acl-long.519},
PUBLISHER = {ACL},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
BOOKTITLE = {The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
EDITOR = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek},
PAGES = {9598--9619},
ADDRESS = {Bangkok, Thailand},
}

Endnote

%0 Conference Proceedings
%A Liu, Tong
%A &#352;krjanec, Iza
%A Demberg, Vera
%+ External Organizations
External Organizations
Multimodal Language Processing, MPI for Informatics, Max Planck Society
%T Temperature-scaling Surprisal Estimates Improve Fit to Human Reading Times &#8211; But Does it Do so for the &#8220;Right Reasons&#8221;? : 
%G eng
%U http://hdl.handle.net/21.11116/0000-000F-EAA1-3
%R 10.18653/v1/2024.acl-long.519
%D 2024
%B 62nd Annual Meeting of the Association for Computational Linguistic
%Z date of event: 2024-08-11 - 2024-08-16
%C Bangkok, Thailand
%B The 62nd Annual Meeting of the Association for Computational Linguistics
%E Ku, Lun-Wei; Martins, Andre; Srikumar, Vivek
%P 9598 - 9619
%I ACL

Paper

RG3

T. Liu, Z. Lai, G. Zhang, P. Torr, V. Demberg, V. Tresp, and J. Gu

“Multimodal Pragmatic Jailbreak on Text-to-image Models,” 2024. [Online]. Available: https://arxiv.org/abs/2409.19149.

Abstract

Diffusion models have recently achieved remarkable advancements in terms of
image quality and fidelity to textual prompts. Concurrently, the safety of such
generative models has become an area of growing concern. This work introduces a
novel type of jailbreak, which triggers T2I models to generate the image with
visual text, where the image and the text, although considered to be safe in
isolation, combine to form unsafe content. To systematically explore this
phenomenon, we propose a dataset to evaluate the current diffusion-based
text-to-image (T2I) models under such jailbreak. We benchmark nine
representative T2I models, including two close-source commercial models.
Experimental results reveal a concerning tendency to produce unsafe content:
all tested models suffer from such type of jailbreak, with rates of unsafe
generation ranging from 8\% to 74\%. In real-world scenarios, various filters
such as keyword blocklists, customized prompt filters, and NSFW image filters,
are commonly employed to mitigate these risks. We evaluate the effectiveness of
such filters against our jailbreak and found that, while current classifiers
may be effective for single modality detection, they fail to work against our
jailbreak. Our work provides a foundation for further development towards more
secure and reliable T2I models.

BibTeX

@online{Liu_2409.19149,
TITLE = {Multimodal Pragmatic Jailbreak on Text-to-image Models},
AUTHOR = {Liu, Tong and Lai, Zhixin and Zhang, Gengyuan and Torr, Philip and Demberg, Vera and Tresp, Volker and Gu, Jindong},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2409.19149},
EPRINT = {2409.19149},
EPRINTTYPE = {arXiv},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
ABSTRACT = {Diffusion models have recently achieved remarkable advancements in terms of<br>image quality and fidelity to textual prompts. Concurrently, the safety of such<br>generative models has become an area of growing concern. This work introduces a<br>novel type of jailbreak, which triggers T2I models to generate the image with<br>visual text, where the image and the text, although considered to be safe in<br>isolation, combine to form unsafe content. To systematically explore this<br>phenomenon, we propose a dataset to evaluate the current diffusion-based<br>text-to-image (T2I) models under such jailbreak. We benchmark nine<br>representative T2I models, including two close-source commercial models.<br>Experimental results reveal a concerning tendency to produce unsafe content:<br>all tested models suffer from such type of jailbreak, with rates of unsafe<br>generation ranging from 8\% to 74\%. In real-world scenarios, various filters<br>such as keyword blocklists, customized prompt filters, and NSFW image filters,<br>are commonly employed to mitigate these risks. We evaluate the effectiveness of<br>such filters against our jailbreak and found that, while current classifiers<br>may be effective for single modality detection, they fail to work against our<br>jailbreak. Our work provides a foundation for further development towards more<br>secure and reliable T2I models.<br>},
}

Endnote

%0 Report
%A Liu, Tong
%A Lai, Zhixin
%A Zhang, Gengyuan
%A Torr, Philip
%A Demberg, Vera
%A Tresp, Volker
%A Gu, Jindong
%+ External Organizations
External Organizations
External Organizations
External Organizations
Multimodal Language Processing, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
%T Multimodal Pragmatic Jailbreak on Text-to-image Models : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-43FC-7
%U https://arxiv.org/abs/2409.19149
%D 2024
%X   Diffusion models have recently achieved remarkable advancements in terms of<br>image quality and fidelity to textual prompts. Concurrently, the safety of such<br>generative models has become an area of growing concern. This work introduces a<br>novel type of jailbreak, which triggers T2I models to generate the image with<br>visual text, where the image and the text, although considered to be safe in<br>isolation, combine to form unsafe content. To systematically explore this<br>phenomenon, we propose a dataset to evaluate the current diffusion-based<br>text-to-image (T2I) models under such jailbreak. We benchmark nine<br>representative T2I models, including two close-source commercial models.<br>Experimental results reveal a concerning tendency to produce unsafe content:<br>all tested models suffer from such type of jailbreak, with rates of unsafe<br>generation ranging from 8\% to 74\%. In real-world scenarios, various filters<br>such as keyword blocklists, customized prompt filters, and NSFW image filters,<br>are commonly employed to mitigate these risks. We evaluate the effectiveness of<br>such filters against our jailbreak and found that, while current classifiers<br>may be effective for single modality detection, they fail to work against our<br>jailbreak. Our work provides a foundation for further development towards more<br>secure and reliable T2I models.<br>
%K Computer Science, Computer Vision and Pattern Recognition, cs.CV,Computer Science, Artificial Intelligence, cs.AI,Computer Science, Cryptography and Security, cs.CR,Computer Science, Learning, cs.LG

Paper

D6RG3

M. H. Mughal, R. Dabral, M. C. J. Scholman, V. Demberg, and C. Theobalt

“Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06786.

Abstract

Non-verbal communication often comprises of semantically rich gestures that
help convey the meaning of an utterance. Producing such semantic co-speech
gestures has been a major challenge for the existing neural systems that can
generate rhythmic beat gestures, but struggle to produce semantically
meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based
gesture generation approach that leverages Retrieval Augmented Generation (RAG)
to produce natural-looking and semantically rich gestures. Our neuro-explicit
gesture generation approach is designed to produce semantic gestures grounded
in interpretable linguistic knowledge. We achieve this by using explicit domain
knowledge to retrieve exemplar motions from a database of co-speech gestures.
Once retrieved, we then inject these semantic exemplar gestures into our
diffusion-based gesture generation pipeline using DDIM inversion and retrieval
guidance at the inference time without any need of training. Further, we
propose a control paradigm for guidance, that allows the users to modulate the
amount of influence each retrieval insertion has over the generated sequence.
Our comparative evaluations demonstrate the validity of our approach against
recent gesture generation approaches. The reader is urged to explore the
results on our project page.

BibTeX

@online{Mughal2412.06786,
TITLE = {Retrieving Semantics from the Deep: an {RAG} Solution for Gesture Synthesis},
AUTHOR = {Mughal, Muhammad Hamza and Dabral, Rishabh and Scholman, Merel C. J. and Demberg, Vera and Theobalt, Christian},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2412.06786},
EPRINT = {2412.06786},
EPRINTTYPE = {arXiv},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Non-verbal communication often comprises of semantically rich gestures that<br>help convey the meaning of an utterance. Producing such semantic co-speech<br>gestures has been a major challenge for the existing neural systems that can<br>generate rhythmic beat gestures, but struggle to produce semantically<br>meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based<br>gesture generation approach that leverages Retrieval Augmented Generation (RAG)<br>to produce natural-looking and semantically rich gestures. Our neuro-explicit<br>gesture generation approach is designed to produce semantic gestures grounded<br>in interpretable linguistic knowledge. We achieve this by using explicit domain<br>knowledge to retrieve exemplar motions from a database of co-speech gestures.<br>Once retrieved, we then inject these semantic exemplar gestures into our<br>diffusion-based gesture generation pipeline using DDIM inversion and retrieval<br>guidance at the inference time without any need of training. Further, we<br>propose a control paradigm for guidance, that allows the users to modulate the<br>amount of influence each retrieval insertion has over the generated sequence.<br>Our comparative evaluations demonstrate the validity of our approach against<br>recent gesture generation approaches. The reader is urged to explore the<br>results on our project page.<br>},
}

Endnote

%0 Report
%A Mughal, Muhammad Hamza
%A Dabral, Rishabh
%A Scholman, Merel C. J.
%A Demberg, Vera
%A Theobalt, Christian
%+ Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society
Multimodal Language Processing, MPI for Informatics, Max Planck Society
Multimodal Language Processing, MPI for Informatics, Max Planck Society
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society
%T Retrieving Semantics from the Deep: an RAG Solution for Gesture
  Synthesis : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-C006-E
%U https://arxiv.org/abs/2412.06786
%D 2024
%X   Non-verbal communication often comprises of semantically rich gestures that<br>help convey the meaning of an utterance. Producing such semantic co-speech<br>gestures has been a major challenge for the existing neural systems that can<br>generate rhythmic beat gestures, but struggle to produce semantically<br>meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based<br>gesture generation approach that leverages Retrieval Augmented Generation (RAG)<br>to produce natural-looking and semantically rich gestures. Our neuro-explicit<br>gesture generation approach is designed to produce semantic gestures grounded<br>in interpretable linguistic knowledge. We achieve this by using explicit domain<br>knowledge to retrieve exemplar motions from a database of co-speech gestures.<br>Once retrieved, we then inject these semantic exemplar gestures into our<br>diffusion-based gesture generation pipeline using DDIM inversion and retrieval<br>guidance at the inference time without any need of training. Further, we<br>propose a control paradigm for guidance, that allows the users to modulate the<br>amount of influence each retrieval insertion has over the generated sequence.<br>Our comparative evaluations demonstrate the validity of our approach against<br>recent gesture generation approaches. The reader is urged to explore the<br>results on our project page.<br>
%K Computer Science, Computer Vision and Pattern Recognition, cs.CV