A wonderful journey to discover knowledge: Get "superpowers" from pre-trained models

existMachine LearningIn the world of this, training deep networks is not easy. How to chooseNetwork architecture, data enhancement methods and optimizationalgorithm, often determines the performance of the model on a specific task. However, with the rapid development of the model, more and more researchers are paying attention to the concept of "knowledge transfer". In this regard, Karsten Roth et al. (2024) explores how to effectively transfer knowledge between arbitrary pretrained models, even obtain "superpowers" from weaker models in his paper, "The Wonderful Journey to Discover Knowledge."

Knowledge interdependence: Supplementary knowledge of pre-trained models

By analyzing a large number of publicly disclosed model libraries, researchers found that even if the overall performance of different pre-trained models differ significantly, there is still huge "supplementary knowledge" in between. This knowledge refers to the information that one model (teacher) has, while another model (student) fails to obtain. For example, in a experiment comparing 466 different models, researchers found that even weaker models can provide significant forward prediction flips in specific categories, suggesting that they have unique knowledge in some areas.

This phenomenon raises an important question: whether this additional knowledge transfer can be achieved without external rankings or test sets, so as to improve the performance of the student model without losing model performance. To this end, the author points out that knowledge transfer is not just a simple parameter adjustment, but also should take into account knowledge complementarity between models.

Continuous Learning: A New Perspective of Knowledge Transfer

To solve the problem of "catastrophic forgetting" that may arise in knowledge transfer, researchers have proposed a new method of continuous learning. Traditional knowledge distillation often assumes passing information from an untrained student to a well-trained teacher. However, in real-life applications, student models are often trained and have a certain knowledge reserve. Therefore, how to retain students' existing knowledge while absorbing new knowledge from teachers has become the core of research.

In this process, the method of data partitioning came into being. The researchers proposed to divide the training data into two parts: partly for learning from the teacher's model, and the other part is for preserving the student's model. This approach significantly reduces the constraints of model weight learning, thereby enabling successful transfer of knowledge between different pretrained models. Experimental results show that after using this data partitioning technology, the success rate of knowledge migration has increased to 92.5%, which is far higher than the success rate of traditional methods.

The relationship between model characteristics and knowledge transfer

In addition to the migration method, the researchers also explored in-depth the impact of model characteristics on knowledge transfer. The results show that there is a significant positive correlation between the capacity of the model (number of parameters) and its ability to receive new knowledge. Especially when usedConvolutional neural networkIn (CNN), the model has a strong visual induction bias, which can easily lead to the coverage of existing knowledge, so a more cautious migration strategy is required.

Knowledge transfer of multi-teacher model

With the continuous expansion of the model library, how to effectively transfer knowledge between multiple teacher models has also become a new direction of research. The researchers proposed three methods:parallelMigration, sequential migration and model soup (ModelSoups) Fusion. The core of these methods is how to effectively utilize the knowledge of different models to improve the overall performance of students' models. Experimental results show that sequential migration can often achieve better knowledge acquisition effects, while simple parallel migration may not necessarily bring gain.

Summary and prospect

Through in-depth research on supplementary knowledge, the work of Karsten Roth et al. has opened a new horizon for us: knowledge transfer is not just a simple technical problem, but can achieve cross-model knowledge sharing through reasonable methods. In the future, with the increase in the number of models and the continuous improvement of knowledge transfer methods, we may be able to go further on the road of machine learning.

References

Roth, K., Thede, L., Koepke, A., Vinyals, O., Hénaff, O., & Akata, Z. (2024). Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer Between Any Pretrained Model. ICLR 2024.
Hinton, G. E., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network.
Kirkpatrick, J., Pascanu, R., & Rabinowitz, N. (2016). Overcoming catastrophic forgetting in neural networks.
Wortsman, M., et al. (2022). Model Soups: Averaging Weights of Multiple Neural Networks Improves Generalization.
Dosovitskiy, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.