The Potential of OceanGPT
To train OceanGPT(沧渊), we collected an ocean science corpus that spans multiple fields. Since each subfield and topic has its unique data characteristics and patterns, we proposed a domain-specific instruction generation framework called DoInstruct. We trained OceanGPT based on open-source models (such as Qwen, LLaMA, MiniCPM, etc.).
Disclaimer: This project is purely an academic exploration rather than a product. Please be aware that due to the inherent limitations of large language models, there may be issues such as hallucinations.
OceanGPT(沧渊) is designed specifically for the ocean domain, which can handle various ocean science tasks, including ocean-specific question answering and content generation. Additionally, we attempt to validate the potential of OceanGPT in simulated underwater embodied intelligence.The model still has limitations such as hallucination, and we will continue to maintain OceanGPT, aiming to enhance its capabilities for real-world applications in marine research and exploration.
Model
Covering approximately 71% of the Earth’s surface, the ocean plays a crucial role in global climate regulation, weather patterns, biodiversity, and human economic development. Ocean science research focuses on the natural characteristics of the ocean, its changing patterns, and the theories, methods, and applications related to the development and utilization of ocean resources. Therefore, we propose a large language model, OceanGPT, designed specifically for the ocean domain. It can handle various ocean science tasks, including Q&A and content generation. Additionally, we attempt to validate the potential of the large language model in simulating underwater robot operations, further exploring the realization of LLM-driven underwater embodied intelligence.
Data quality is crucial for training domain large language models. To train OceanGPT, we collected an ocean science corpus spanning multiple fields. Each subfield and topic has its unique data characteristics and patterns, leading us to propose a domain-specific instruction generation framework named DoInstruct. This framework utilizes multi-agent collaboration to generate fine-tuning training data for ocean science instructions. This approach ensures both the professionalism and accuracy of the data while achieving efficient parallel data generation performance. The DoInstruct framework employs agents (such as GPT-3.5-turbo) as experts for each ocean topic, with each agent rapidly expanding instructions through mutual collaboration. The framework defines three agent:
Evolutionary Data Synthesis Agent: Specifically, the agent employs two collaborative strategies: firstly, supplementing and expanding background knowledge of seed samples, and secondly, refining analysis to enhance and improve the knowledge contained within seed data.
Fine-tuned Literature Reading Agent: Initially fine-tuning a large language model to develop an intelligent model specialized for literature extraction, enabling the agent to extract high-quality sentences from vast ocean literature.
Quality Assurance Audit Agent: Predefining specific syntactic and semantic rules related to ocean science, constructing this agent through prompting to filter data.
We trained OceanGPT based on open-source models (such as Qwen, LLaMA, MiniCPM, etc.) and instructions generated by the DoInstruct framework.
Benchmark
We have released the instruction dataset OceanInstruct and have constructed a benchmark dataset named OceanBench for the ocean domain large language model. Experimental results indicate that OceanGPT outperforms baseline language models across the vast majority of tasks. In contrast, existing open-source large language models struggle with tasks requiring specialized knowledge in ocean science. Additionally, our designed multi-agent data generation framework effectively allows OceanGPT to act as an expert in various subfields within the ocean domain. This demonstrates that OceanGPT serves as a relatively superior expert model across diverse ocean domains.
Simulated Underwater Embodied Intelligence
We evaluate OceanGPT’s preliminary capability for underwater robot control, including tasks such as trajectory planning, within simulators.
Demo
Apply for trial
Team Leader
Prof. Huajun Chen
Full Professor, College of Computer Science, Zhejiang University
CO-PI
Ningyu Zhang
Associate Professor School of Software Technology, Zhejiang University
Guozhou Zheng
Associate Professor Ocean Research Center of Zhoushan, Zhejiang University
Zhen Bi
PhD Student
Yida Xue
PhD Student
Chenxi Wang
MSc Student
Xiaozhuan Liang
MSc Student
Kangwei Liu
MSc Student
Jizhan Fang
MSc Student
Jintian Zhang
MSc Student
Zekun Xi
MSc Student
Hongjie Deng
AI Engineer
Chuankun LI
AI Engineer
Zhenghao Zhu
AI Engineer
Kun Gan
AI Engineer
Copyright ©2024 ZJUKG All Rights Reserved.