When Large Multimodal Models Confront Evolving Knowledge: Challenges and Pathways

When Large Multimodal Models Confront Evolving Knowledge

Challenges and Pathways

Background

"The up-to-date events and entities are constantly emerging on the Internet."

– Evolving Knowledge

Introduction

Large language/multimodal models (LLMs/LMMs) store extensive pre-trained knowledge but struggle to maintain consistency with real-world updates, making it difficult to avoid catastrophic forgetting while acquiring evolving knowledge. Previous work focused on constructing textual knowledge datasets and exploring knowledge injection in LLMs, lacking exploration of multimodal evolving knowledge injection in LMMs. To address this, we propose the EVOKE benchmark to evaluate LMMs' ability to inject multimodal evolving knowledge in real-world scenarios. Meanwhile, a comprehensive evaluation of multimodal evolving knowledge injection revealed two challenges: (1) Existing knowledge injection methods perform terribly on evolving knowledge. (2) Supervised fine-tuning causes catastrophic forgetting, particularly instruction following ability is severely compromised. Additionally, we provide pathways and find that: (1) Text knowledge augmentation during the training phase improves performance, while image augmentation cannot achieve it. (2) Continual learning methods, especially Replay and MoELoRA, effectively mitigate forgetting. Our findings indicate that current knowledge injection methods have many limitations on evolving knowledge, which motivates further research on more efficient and stable knowledge injection methods.

Contribution 1: Evolving Knowledge Benchmark (EVOKE)

We propose an automated pipeline for collecting evolving knowledge to construct EVOKE.
A benchmark for evaluating evolving knowledge injection in real-world scenarios.

Contribution 2: Challenges of Evolving Knowledge Injection

Extensive experiments have been conducted on evolving knowledge injection, revealing two challenges.
Terrible performance of existing knowledge injection methods and catastrophic forgetting caused by supervised fine-tuning.

Contribution 3: Pathways of Evolving Knowledge Injection

We provide the pathways and demonstrate that text knowledge augmentation during the training phase improves performance.
And continual learning methods effectively mitigate forgetting.

EVOlving KnowledgE Benchmark

The EVOKE benchmark comprises 9,422 knowledge-image pairs for LMM knowledge injection, spanning 159 fine-grained types (29 New types and 130 Entity types).

Overall pipeline of the construction for EVOKE benchmark. (a) Firstly, collect original data from CNN and Wikipedia, and filter popular data. (b) Secondly, we use GPT-4o to summarize the textual content of the original data. (c) Subsequently, QA pairs are generated by GPT-4o and query images are downloaded from Google. (d) Lastly, we manually review the original knowledge image and query image. The source of heuristic query: we manually write multiple templates and randomly select one template for each piece of data.

Knowledge Injection Challenge 1: Terrible Performance

This findings indicates that although there is no explicit exposure to query related knowledge during LMM training, model has the ability to leverage existing knowledge for reasonable inference in question answering.

Experimental results indicate that none of the evaluated approaches demonstrate superior effectiveness in multimodal evolving knowledge injection tasks. The best-performing method, LLaVA's MM-RAG-Gloden Context, achieves only 56.13 in accuracy score, which falls short of expectations. Notably, some methods is extremely poor, for example, Qwen-VL-Chat's LoRA only has a accuracy 1.11 higher than Vanilla. These findings underscore the significant potential for advancement in the field of evolving knowledge injection.

As shown in Table1, MM-RAG performs well on the LLaVA-v1.5 and Qwen-VL-Chat without updating LLM parameters, avoiding potential side effects. Additionally, the results for Text-Only and Image-Only are similar but both are significantly lower than UniIR.

By observation, Perplexity AI outperforms Gemini and approaches the Gloden Context'performance of the Qwen-VL-Chat. This highlights the internet retrieval capabilities of commercial LMMs. IAG methods independently retrieve critical information from the Internet without relying on external injection data, serving as input for model reasoning. They achieve goodish performance without side effects, motivating further research on more efficient IAG methods.

Knowledge Injection Performance on Fine-grained Types

LMM's previous capability evaluation benchmarks

To systematically evaluate the side effect of knowledge injection on the general capabilities of LMMs, we conduct comprehensive assessments using 12 benchmark datasets spanning 7 distinct capability dimensions:

Comprehensive Evaluation
Optical Character Recognition
Multidisciplinary
Instruction Following
Multi-Round QA
Mathematical Reasoning
Hallucination

MME is a comprehensive evaluation benchmark designed to assess the performance of LMMs across 14 distinct tasks, encompassing both perception and cognition abilities. To ensure fair and accurate comparisons, MME provides concise, manually designed instruction-answer pairs, eliminating the need for extensive prompt engineering.

MMBench is a bilingual benchmark designed to evaluate the comprehensive capabilities of LMMs across multiple modalities. It offers a meticulously curated dataset with over 3,000 multiple-choice questions covering 20 distinct ability dimensions, such as object localization and social reasoning. Additionally, MMBench provides questions in both English and Chinese, enabling comparative evaluations of LMM performance across these languages.

SEED-Bench-2-Plus is a comprehensive benchmark designed to evaluate the performance of LMMs in understanding text-rich visual content, such as charts, maps, and web pages. It consists of 2,300 multiple-choice questions spanning three broad categories: Charts, Maps, and Webs, each covering a wide range of real-world scenarios where text and visual elements are intertwined. The benchmark aims to address the gap in evaluating LMMs' ability to comprehend and reason about visual data that contains significant textual information, which is crucial for practical applications like document analysis, navigation, and web content understanding.

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of LMMs. It encompasses 29 datasets across five key tasks: Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark aims to provide a thorough assessment of LMMs' performance in various text-related visual tasks, highlighting their strengths and weaknesses, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expressions.

MMMU is a comprehensive benchmark designed to evaluate LMMs on tasks that require college-level subject knowledge and deliberate reasoning. It comprises 11,500 meticulously curated multimodal questions sourced from college exams, quizzes, and textbooks, spanning six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering. These questions cover 30 subjects and 183 subfields, featuring 30 diverse image types such as charts, diagrams, maps, tables, music sheets, and chemical structures.

ScienceQA is a benchmark designed to evaluate AI models' abilities in scientific question answering. It includes multiple-choice and free-response questions across core subjects like Mathematics, Physics, Chemistry, and Biology. The benchmark provides knowledge points and detailed explanations for each problem, facilitating comprehensive assessment of reasoning capabilities.

MIA-Bench is a benchmark designed to evaluate the ability of LMMs to adhere strictly to complex instructions. It comprises a diverse set of 400 image-prompt pairs, each crafted to challenge models' compliance with layered instructions, requiring accurate and contextually

MMDU is a comprehensive evaluation framework designed to assess the capabilities of LMMs in handling multi-turn, multi-image dialog scenarios. It focuses on understanding complex interactions involving multiple images and sequential dialog turns, which are critical for real-world applications like visual storytelling, medical diagnosis, and interactive AI systems. The benchmark includes a diverse dataset with rich annotations, enabling models to be fine-tuned and evaluated on tasks requiring contextual reasoning, image-text alignment, and temporal coherence.

MathVista is a benchmark designed to evaluate the mathematical reasoning capabilities of foundation models within visual contexts. It comprises 6,141 examples drawn from 28 existing multimodal datasets and introduces three new datasets: IQTest, FunctionQA, and PaperQA. These tasks require models to perform fine-grained visual understanding and compositional reasoning.

Math-Vision is a meticulously curated dataset comprising 3,040 high-quality mathematical problems, each embedded within a visual context and sourced from real mathematics competitions. This benchmark spans 16 distinct mathematical disciplines and is organized across five levels of difficulty, offering a comprehensive platform to evaluate the mathematical reasoning abilities of LMMs.

POPE is a benchmark designed to systematically assess object hallucination in LMMs. Object hallucination refers to the tendency of these models to generate descriptions containing objects not present in the corresponding images. POPE addresses this issue by implementing a polling-based query method that evaluates models' accuracy in identifying the existence of specific objects within images. This approach provides a more stable and flexible evaluation of object hallucination, revealing that current LMMs often generate objects inconsistent with the target images.

HallusionBench is a comprehensive benchmark designed to evaluate LMMs on their ability to accurately interpret and reason about visual data, specifically addressing issues of language hallucination and visual illusion. It comprises 346 images paired with 1,129 questions among visual dependent and visual supplement. The benchmark introduces a novel structure for visual questions, enabling quantitative analysis of models' response tendencies, logical consistency, and various failure modes.

Knowledge Injection Challenge 2: Catastrophic Forgetting

Knowledge Injection Pathway 1: Knowledge Augmentation

Knowledge Injection Pathway 2: Continual Learning

Ablation Experiment

Sequential Fine-Tuning

MM-RAG

Qualitative Examples

Our Team

BibTeX

@article{jiang2025evoke,
  title = {When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways},
  author = {Kailin Jiang and Yuntao Du and Yukai Ding and Yuchen Ren and Ning Jiang and Zhi Gao and Zilong Zheng and Lei Liu and Bin Li and Qing Li},
  year = {2025}
}