OceanGPT(沧渊)

A Large Language Model for Ocean Science Tasks

The Potential of OceanGPT

To train OceanGPT(沧渊), we collected an ocean science corpus that spans multiple fields. Since each subfield and topic has its unique data characteristics and patterns, we proposed a domain-specific instruction generation framework called DoInstruct. We trained OceanGPT based on open-source models (such as Qwen, LLaMA, MiniCPM, etc.).

Disclaimer: This project is purely an academic exploration rather than a product. Please be aware that due to the inherent limitations of large language models, there may be issues such as hallucinations.

OceanGPT(沧渊) is designed specifically for the ocean domain, which can handle various ocean science tasks, including ocean-specific question answering and content generation. Additionally, we attempt to validate the potential of OceanGPT in simulated underwater embodied intelligence.The model still has limitations such as hallucination, and we will continue to maintain OceanGPT, aiming to enhance its capabilities for real-world applications in marine research and exploration.

We have released Oceangpt (2B, 7B, 14B) at:

HuggingFace

ModelScope

WiseModel

Model

Covering approximately 71% of the Earth’s surface, the ocean plays a crucial role in global climate regulation, weather patterns, biodiversity, and human economic development. Ocean science research focuses on the natural characteristics of the ocean, its changing patterns, and the theories, methods, and applications related to the development and utilization of ocean resources. Therefore, we propose a large language model, OceanGPT, designed specifically for the ocean domain. It can handle various ocean science tasks, including Q&A and content generation. Additionally, we attempt to validate the potential of the large language model in simulating underwater robot operations, further exploring the realization of LLM-driven underwater embodied intelligence.

Data quality is crucial for training domain large language models. To train OceanGPT, we collected an ocean science corpus spanning multiple fields. Each subfield and topic has its unique data characteristics and patterns, leading us to propose a domain-specific instruction generation framework named DoInstruct. This framework utilizes multi-agent collaboration to generate fine-tuning training data for ocean science instructions. This approach ensures both the professionalism and accuracy of the data while achieving efficient parallel data generation performance. The DoInstruct framework employs agents (such as GPT-3.5-turbo) as experts for each ocean topic, with each agent rapidly expanding instructions through mutual collaboration. The framework defines three agent:

Evolutionary Data Synthesis Agent: Specifically, the agent employs two collaborative strategies: firstly, supplementing and expanding background knowledge of seed samples, and secondly, refining analysis to enhance and improve the knowledge contained within seed data.
Fine-tuned Literature Reading Agent: Initially fine-tuning a large language model to develop an intelligent model specialized for literature extraction, enabling the agent to extract high-quality sentences from vast ocean literature.
Quality Assurance Audit Agent: Predefining specific syntactic and semantic rules related to ocean science, constructing this agent through prompting to filter data.
We trained OceanGPT based on open-source models (such as Qwen, LLaMA, MiniCPM, etc.) and instructions generated by the DoInstruct framework.

Benchmark

We have released the instruction dataset OceanInstruct  and have constructed a benchmark dataset named OceanBench for the ocean domain large language model. Experimental results indicate that OceanGPT outperforms baseline language models across the vast majority of tasks. In contrast, existing open-source large language models struggle with tasks requiring specialized knowledge in ocean science. Additionally, our designed multi-agent data generation framework effectively allows OceanGPT to act as an expert in various subfields within the ocean domain. This demonstrates that OceanGPT serves as a relatively superior expert model across diverse ocean domains.

Simulated Underwater Embodied Intelligence

We evaluate OceanGPT’s preliminary capability for underwater robot control, including tasks such as trajectory planning, within simulators.

Demo

Apply for trial


Team Leader

Prof. Huajun Chen

Full Professor, College of Computer Science, Zhejiang University

CO-PI

Ningyu Zhang

Associate Professor
School of Software Technology, Zhejiang University

Guozhou Zheng

Associate Professor
Ocean Research Center of Zhoushan, Zhejiang University

Zhen Bi

PhD Student

Yida Xue

PhD Student

Chenxi Wang

MSc Student

Xiaozhuan Liang

MSc Student

Kangwei Liu

MSc Student

Jizhan Fang

MSc Student

Jintian Zhang

MSc Student

Zekun Xi

MSc Student

Hongjie Deng

AI Engineer

Chuankun LI

AI Engineer

Zhenghao Zhu

AI Engineer

Kun Gan

AI Engineer

Visitors' number:7569

Copyright ©2024 ZJUKG All Rights Reserved.