PEREA 1.0 Releasedš: Towards Robust and Cognitive Task-solving via Perception Reasoning Disentanglement
- May 3
- 2 min read
Updated: May 4
Vision language model (VLM)-based artificial intelligence paradigm has revolutionized many real-world domains by achieving great performance. However, it is still far from perfect when dealing with abstract, difficult tasks that demand complicated task-solving logic, such as mathematics olympiads, scientific discovery, etc. Aiming to achieve better performance on such challenging domains, in this work, we propose a cognitive framework design called perception reasoning disentanglement (PEREA). Unlike previous method which infer answer in an end-to-end manner, PEREA draws inspiration from the human brain where the right and left hemispheres respectively handle perceptual understanding and logical reasoning, and propose disentangling the perception and reasoning process into two stages which are correspondingly solved via a perceptor module and a reasoner module.

The PEREA framework overview is shown in the above figure. In essence, PEREA explicitly disentangles task-solving process of an AI system into two stages, namely perception and reasoning. And the two stages are correspondingly performed via a perceptor module and a reasoner module implemented based on the backbone VLM models. However, we empirically show that the current state-of-the-art VLM suffers from poor visual understanding when dealing with images that contain abstract, complex patterns. Besides, their reasoning performance is still far from perfect when dealing with samples that require a long task-solving logic chain. Therefore, in order to mitigate the above-mentioned drawbacks, we introduce a knowledge recap mechanism that forces the perceptor and reasoner model to first identify what categories of expertise are required for solving the current task. Then, the VLMs would recap the fundamentals of the identified categories and summarize the corresponding knowledge into a reusable knowledge base. Conditioned on the explicit constraint formulated in the form of the knowledge base, the perceptor and reasoner modules can generate descriptive premises of the input with much higher accuracy and conduct more robust long-chain reasoning.
Results on VisuLogic Bench
We evaluate PEREA on VisuLogic Bench, a challenging benchmark designed to test visual perception and logical reasoning ability in abstract visual problem-solving tasks. The benchmark is particularly suitable for evaluating whether a VLM can correctly perceive complex visual patterns and then perform multi-step reasoning over them.

Ā
As shown in the figure, adding PEREA consistently improves the overall accuracy across different backbone VLMs. The improvements are particularly strong for Seed 2.0 Lite, Seed 2.0 Pro, and Gemini 3.1 Flash-Lite, where ZipMind brings absolute accuracy gains of +13.3, +10.0, and +11.3 points, respectively. These results suggest that many failures in difficult visual reasoning tasks arise not only from weak reasoning, but also from incomplete or inaccurate perception. By providing a more structured perceptual foundation before reasoning, ZipMind enables the model to solve abstract visual reasoning problems more robustly.
Overall, the VisuLogic Bench results demonstrate that ZipMind can substantially improve VLM performance on cognitively demanding visual reasoning tasks, with the strongest configuration even exceeding the human reference score.

Your research has greatly piqued my interest. May I ask if you have published any related papers or technical reports? Or have you made the code and model weights available in the open-source community?