Publications

2025

EMNLP 2025

# Embodied AI # LLM # Safety

Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making

Yejin Son*, Minseo Kim*, Sungwoong Kim, Seungju Han, Jian Kim, Dongju Jang, Youngjae Yu, Chanyoung Park

Arxiv

EMNLP 2025

# Multimodal # Agent # Reasoning

VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms

Seungwon Lim, Sungwoong Kim, Jihwan Yu, Sungjae Lee, Jiwan Chung, Youngjae Yu

Arxiv

EMNLP2025

# Multimodal # Document # Information Retrieval

Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

Yejin Choi*, Jaewoo Park*, Janghan Yoon, Saejin Kim, Jaehyun Jeon, Youngjae Yu

EMNLP2025

# Multimodal # Audio # Video

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

Arxiv

EMNLP2025 (Findings)

# Multimodal # Commonsense Reasoning # Abductive Reasoning

Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd

Yejin Son*, Saejin Kim*, Dongjun Min, Youngjae Yu

COLM2025

# Multimodal # Safety # Societal Implications

G1yphD3c0de: Towards Safer Language Models on Visually Perturbed Texts

Yejin Choi, Yejin Yeo, Yejin Son, Seungju Han, Youngjae Yu

COLM2025

# NLP # Fact Verification

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

Wooseok Seo*, Seungju Han*, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu

Arxiv

COLM2025

# Multimodal # Video

HIPPO-VIDEO : Simulating Watch Histories with Large Language Models for History-Driven Video Highlighting

Jeongeun Lee, Youngjae Yu, Dongha Lee

Arxiv

ICCV2025

# Video Generation # Distillation # Preference Learning

V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models

Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, Youngjae Yu

Arxiv

ICCV2025

# 3D # Human Motion # Generation

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Jungbin Cho*, Junwan Kim*, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu

Arxiv

ICCV2025

# Multimodal # Ambiguity

VAGUE: Visual Contexts Clarify Ambiguous Expressions

Heejeong Nam, Jinwoo Ahn, Keummin Ka, Jiwan Chung, Youngjae Yu

Arxiv

MICCAI2025

# Computer Vision # Scalp Diagnosis # Image Translation

Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation

Youngmin Kim*, Saejin Kim*, Hoyeon Moon, Youngjae Yu, Junhyug Noh

Arxiv

ACL2025

# Multimodal # Nonverbal Conversation # Video # 3D

Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues

Youngmin Kim*, Jiwan Chung*, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, Youngjae Yu

Arxiv

ACL2025 (Oral)

# NLP # Personality # Reinforcement Learning

Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games

Seungwon Lim, Seungbeen Lee, Dongjun Min, Youngjae Yu

Arxiv

ACL2025

# Multimodal # MLLM

Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

Jiwan Chung, Janghan Yoon, Junhyeong Park, Sangeyl Lee, Joowon Yang, Sooyeon Park, Youngjae Yu

Arxiv

ACL2025

# NLP # LLM # Safety

Representation Bending for Large Language Model Safety

Ashkan Yousefpour*, Taeheon Kim*, Ryan S. Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Harrison Ngan, Youngjae Yu, Jonghyun Choi

Arxiv

# Computer Vision # Video # Industrial Application

SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis

Youngmin Kim*, Giyeong Oh*, Kwangsoo Youm, Youngjae Yu

Arxiv

# Computer Vision

Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

Giyeong Oh, Woohyun Cho, Siyeol Kim, Suhwan Choi, Younjae Yu

Arxiv

# Multimodal # Reasoning

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Jiwan Chung*, Junhyeok Kim*, Siyeol Kim, Jaeyoung Lee, Minsoo Kim, Youngjae Yu

Arxiv

# multimodal # MLLM # AI for Science

When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu

Arxiv

# Multimodal # UI

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

Jaehyun Jeon, Minsoo Kim, Janghan Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu

Arxiv

# NLP # Math # Education

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Jaewoo Park*, Jungyang Park*, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu

Arxiv

# Multimodal # MLLM

Teaching Metric Distance to Autoregressive Multimodal Foundational Models

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu

Arxiv

# Multimodal # Video # Egocentric

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Junhyeok Kim*, Jaewoo Park*, Junhee Park, Sangeyl Lee, Jiwan Chung, Jisung Kim, Ji Hoon Joung, Youngjae Yu

Arxiv

# LLM # DPO # Human Preference

KL Penalty Control via Perturbation for Direct Preference Optimization

Sangkyu Lee, Janghoon Han, Hosung Song, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu

Arxiv

# LLM # Watermark # Low-rank Adaptation

SEAL: Entangled White-box Watermarks on Low-Rank Adaptation

Giyeong Oh, Saejin Kim, Woohyun Cho, Sangkyu Lee, Jiwan Chung, Dokyung Song, Youngjae Yu

Arxiv

ICRA2025

# Embodied AI # Robotics # Navigation

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

Suhwan Choi, Yongjun Cho, Minchan Kim, Jaeyoon Jung, Myunchul Joe, Yubeen Park, Minseo Kim, Sungwoong Kim, Sungjae Lee, Hwiseong Park, Jiwan Chung, Youngjae Yu

Arxiv

NAACL2025 (Oral)

# Multimodal # LLM # Chart Generation

C^2 : Scalable Auto-Feedback for LLM-based Chart Generation

Woosung Koh*, Janghan Yoon*, Minhyung Lee, Youngjin Song, Jaegwan Cho, Jaehyun Kang, Taehyeon Kim, Seyoung Yun, Youngjae Yu, Bongshin Lee

Arxiv

NAACL2025 (Findings)

# NLP # Personality # Psychometrics

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

Seungbeen Lee*, Seungwon Lim*, Seungju Han, Giyeong Oh, Jiwan Chung, Minju Kim, Yeonsoo Lee, Dongha Lee, Jinyoung Yeo, Youngjae Yu

Arxiv

NAACL2025 (Findings)

# Multimodal # Egocentric # Dialogue System

EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild

Junhyeok Kim, Minsoo Kim, Jiwan Chung, Jungbin Cho, Jisoo Kim, Sungwoong Kim, Gyeongbo Sim, Youngjae Yu

Arxiv

AAAI2025

# 3D # Speech # Facial expression

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

Jisoo Kim*, Jungbin Cho*, Joonho Park, Soonmin Hwang, Da Eun Kim, Geon Kim, Youngjae Yu

Arxiv

AAAI2025

# Multimodal # Debiasing

MASS: Overcoming Language Bias in Image-Text Matching

Jiwan Chung, Seungwon Lim, Sangkyu Lee, Youngjae Yu

Arxiv

AAAI2025

# Multimodal # Video LLM # Preference

i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

Arxiv

2024

DMFR2024

# LLM # Chatbot # Medical

How well do large language model-based chatbots perform in oral and maxillofacial radiology?

Hui Jeong, Sang-Sun Han, Minhyung Lee, Youngjae Yu, Saejin Kim, Kug Jin Jeon

Paper

# Image Generation # Diffusion # Prompt Optimization

TIPO: Text to Image with Text Presampling for Prompt Optimization

Shih-Ying Yeh, Sang-Hyun Park, Giyeong Oh, Min Song, Youngjae Yu

Arxiv

Neurips2024

# Multimodal # Creative AI

Towards Visual Text Design Transfer Across Languages

Yejin Choi*, Jiwan Chung*, Sumin Shim, Giyeong Oh, Youngjae Yu

Arxiv

Neurips2024

# NLP # AI Safety # LLM # Jailbreaking # Alignment

WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Liwei Jiang, Kavel Rao, Seungju Han, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Marteen Sap, Yejin Choi, Nouha Dziri

Arxiv

Neurips2024

# NLP # AI Safety # LLM # Moderation

WILDGUARD: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

Arxiv

EMNLP2024

# Multimodal # Ambiguity

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

Jiwan Chung, Seungwon Lim, Jaehyun Jeon, Seungbeen Lee, Youngjae Yu

Arxiv

# Image Generation # Diffusion # Personalization

Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation

Kangyeol Kim*, Wooseok Seo*, Sehyun Nam, Bodam Kim, Suhyeon Jeong, Wonwoo Cho, Jaegul Choo, Youngjae Yu

Arxiv

ECCV2024

# Multimodal # Video LMM # Preference

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

Hyolim Kang, Jeongseok Hyun, Joungbin An, Youngjae Yu, Seon Joo Kim

Arxiv

EMNLP2024 (Findings)

# NLP # Psychological Counseling # Dialogue

CACTUS: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory

Suyeon Lee, Sunghwan Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, Minseok Kang, Dayi Jung, Min Hee Kim, Seungbeen Lee, Kyoung-Mee Chung, Youngjae Yu, Dongha Lee, Jinyoung Yeo

Arxiv

EMNLP2024 (Findings)

# Multimodal # Fact checking # Misinformation

How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models

Jaeyoung Lee, Ximing Lu, Jack Hessel, Faeze Brahman, Youngjae Yu, Yonatan Bisk, Yejin Choi, Saadia Gabriel

Arxiv

EMNLP2024 (Oral)

# Multimodal Understanding # Visual Reasoning

Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

Jiwan Chung*, Sungjae Lee*, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu

Arxiv

ICRA2024

# Robotics # NLP # uncertainty estimation

CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents

Jeongeun Park, Seungwon Lim, Joonhyung Lee, Sangbeom Park, Minsuk Chang, Youngjae Yu, Sungjoon Choi

Arxiv

EMNLP2024 (Findings)

# NLP # Education # Question Difficulty Estimation

Large Language Models are Students at Various Levels: Zero-shot Question Difficulty Estimation

Jaewoo Park, Seongjin Park, Hyun Sik Won, Kang Min Kim

Arxiv

ACL2024 (Oral)

# Multimodal # RLAIF

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Daechul Ahn, Yura Choi, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

Arxiv

ACL2024

# NLP # Reward Modeling

Aligning Large Language Models by On-Policy Self-Judgment

Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu

Arxiv

ACL2024 (Findings)

# NLP # Conversation # Recommendation

Pearl: A Review-driven Persona-Knowledge Grounded Conversational Recommendation Dataset

Minjin Kim, Minju Kim, Hana Kim, Beong-woo Kwak, Soyeon Chun, Hyunseo Kim, SeongKu Kang, Youngjae Yu, Jinyoung Yeo, Dongha Lee

Arxiv

ACL2024 (Outstanding)

# NLP # Conversation

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

Dongjin Kang, Sunghwan Kim, Taeyoon Kwon, Seungjun Moon, Hyunsouk Cho, Youngjae Yu, Dongha Lee, Jinyoung Yeo

Arxiv

# korean-LLM # Naver

HyperCLOVA X Technical Report

Jiwan Chung, Sangkyu Lee, Youngjae Yu contributed.

Arxiv

EMNLP2024

# NLP # Reasoning # Code Generation

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Hyungjoo Chae, Yeonghyeon Kim, Seungone Kim, Kai Tzu-iunn Ong, Beong-woo Kwak, Seonghwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, Jinyoung Yeo

Arxiv

NAACL2024

# multimodal # Commonsense # Video Understaning

SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models

Hyun Lee, Kim Sung-Bin, Seungju Han, Youngjae Yu, Tae-Hyun Oh

Arxiv

ICLR2024

# Text-to-Image # PEFT

Navigating Text-To-Image Customization:From LyCORIS Fine-Tuning to Model Evaluation

Shin-Ying Yeh, Yu-Guan Hsieh, Zhidong Gao, Bernard B W Yang, Giyeong Oh, Yanmin Gong

Arxiv

Publications

2025

Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making

VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms

Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd

G1yphD3c0de: Towards Safer Language Models on Visually Perturbed Texts

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

HIPPO-VIDEO : Simulating Watch Histories with Large Language Models for History-Driven Video Highlighting

V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

VAGUE: Visual Contexts Clarify Ambiguous Expressions

Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation

Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues

Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games

Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

Representation Bending for Large Language Model Safety

SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis

Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Teaching Metric Distance to Autoregressive Multimodal Foundational Models

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

KL Penalty Control via Perturbation for Direct Preference Optimization

SEAL: Entangled White-box Watermarks on Low-Rank Adaptation

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

C^2 : Scalable Auto-Feedback for LLM-based Chart Generation

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

MASS: Overcoming Language Bias in Image-Text Matching

i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

2024

How well do large language model-based chatbots perform in oral and maxillofacial radiology?

TIPO: Text to Image with Text Presampling for Prompt Optimization

Towards Visual Text Design Transfer Across Languages

WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

WILDGUARD: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

CACTUS: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory

How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models

Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents

Large Language Models are Students at Various Levels: Zero-shot Question Difficulty Estimation

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Aligning Large Language Models by On-Policy Self-Judgment

Pearl: A Review-driven Persona-Knowledge Grounded Conversational Recommendation Dataset

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

HyperCLOVA X Technical Report

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models

Navigating Text-To-Image Customization:From LyCORIS Fine-Tuning to Model Evaluation

2023

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms

VLIS: Unimodal Language Models Guide Multimodal Language Generation

SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents

Long Story Short: a Summarize-then-Search Method for Long Video Question Answering

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning

Zero-shot Active Visual Search (ZAVIS): Intelligent Object Search for Robotic Assistants

2022

NeuroLogic A* esque Decoding: Constrained Text Generation with Lookahead Heuristics

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

PROSOCIALDIALOG:A Prosocial Backbone for Conversational Agents

2021

Dual Compositional Learning in Interactive Image Retrieval

Parameter Efficient Multimodal Transformers for Video Representation Learning

Self-Supervised Learning of Compressed Video Representations

Transitional Adaptation of Pretrained Models for Visual Storytelling

Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos