張嘉玲 CHIA-LING CHANG | Two-Stage Prompt Engineering for Graphical Abstract Interpretation Using Multimodal Large Language Models: An Evaluation of Adaptability and Performance

會議論文

學年	113
學期	2
發表日期	2025-07-02
作品名稱	Two-Stage Prompt Engineering for Graphical Abstract Interpretation Using Multimodal Large Language Models: An Evaluation of Adaptability and Performance
作品名稱（其他語言）
著者	Chia-Ling Chang and Yi-Lung Lin
作品所屬單位
出版者
會議名稱	The 10th International Conferendce on Advanced Technology Innovation 2025
會議地點	日本函館
摘要	Graphical abstracts play a crucial role in scholarly communication and information visualization by facilitating the rapid comprehension and structured organization of academic content. However, their multimodal nature poses challenges to automated analysis and information extraction, particularly in terms of cross-disciplinary adaptability and semantic understanding. As multimodal large language models are increasingly applied in academic information processing, optimizing graphical abstract interpretation through prompt engineering has emerged as a significant research focus. Despite these advances, systematic frameworks that integrate visual-semantic cues with prompt-based reasoning for precise domain classification remain scarce. To address this gap, this study proposes a two-stage prompt engineering framework to enhance the interpretive and classificatory capabilities of large language models (LLMs) in analysing graphical abstracts. In the initial stage, LLMs leverage their multimodal capacity to extract key visual-semantic features, such as concepts and relationships, directly from graphical content, independent of textual metadata. The subsequent stage deploys four prompting strategies—Zero-Shot, Few-Shot, Chain-of-Thought, and Structured Prompting—to formulate preliminary classification hypotheses. These hypotheses are then fused with the extracted visual features through compositional prompting, enabling context-sensitive domain classification with explainable inference pathways. To validate this framework, we curated a dataset of 177 graphical abstracts from Scopus-indexed journals, each manually annotated with disciplinary labels. Supplementary textual data, including titles, abstracts, keywords, and journal names, served as ground truths for robust evaluation. Performance was assessed using nDCG@k, MAP@k, and Precision@k metrics at k = 1, 2, and 3. Results indicate that Few-Shot and Chain-of-Thought strategies consistently outperform their counterparts, achieving peak scores of 0.510 (MAP@3) and 0.533 (nDCG@3), validated by Friedman and Nemenyi tests (p < 0.05). Notably, the study reveals that prompting strategies interact meaningfully with domain-specific visual conventions, indicating the framework’s potential for robust cross-disciplinary deployment. These findings highlight the pivotal role of prompt architecture in multimodal reasoning and offer actionable insights for designing effective prompt strategies to advance scholarly knowledge extraction.
關鍵字	graphical abstract, multimodal large language models, prompt engineering, cross-disciplinary adaptability.
語言	en_US
收錄於
會議性質	國際
校內研討會地點	無
研討會時間	20250702~20250706
通訊作者
國別	JPN
公開徵稿
出版型式
出處

張嘉玲

CHIA-LING CHANG

論文著作

會議論文