摘要
|
Graphical abstracts play a crucial role in scholarly communication and information visualization by facilitating the rapid comprehension and structured organization of academic content. However, their multimodal nature poses challenges to automated analysis and information extraction, particularly in terms of cross-disciplinary adaptability and semantic understanding. As multimodal large language models are increasingly applied in academic information processing, optimizing graphical abstract interpretation through prompt engineering has emerged as a significant research focus. Despite these advances, systematic frameworks that integrate visual-semantic cues with prompt-based reasoning for precise domain classification remain scarce.
To address this gap, this study proposes a two-stage prompt engineering framework to enhance the interpretive and classificatory capabilities of large language models (LLMs) in analysing graphical abstracts. In the initial stage, LLMs leverage their multimodal capacity to extract key visual-semantic features, such as concepts and relationships, directly from graphical content, independent of textual metadata. The subsequent stage deploys four prompting strategies—Zero-Shot, Few-Shot, Chain-of-Thought, and Structured Prompting—to formulate preliminary classification hypotheses. These hypotheses are then fused with the extracted visual features through compositional prompting, enabling context-sensitive domain classification with explainable inference pathways.
To validate this framework, we curated a dataset of 177 graphical abstracts from Scopus-indexed journals, each manually annotated with disciplinary labels. Supplementary textual data, including titles, abstracts, keywords, and journal names, served as ground truths for robust evaluation. Performance was assessed using nDCG@k, MAP@k, and Precision@k metrics at k = 1, 2, and 3. Results indicate that Few-Shot and Chain-of-Thought strategies consistently outperform their counterparts, achieving peak scores of 0.510 (MAP@3) and 0.533 (nDCG@3), validated by Friedman and Nemenyi tests (p < 0.05). Notably, the study reveals that prompting strategies interact meaningfully with domain-specific visual conventions, indicating the framework’s potential for robust cross-disciplinary deployment. These findings highlight the pivotal role of prompt architecture in multimodal reasoning and offer actionable insights for designing effective prompt strategies to advance scholarly knowledge extraction. |