Model

Data quality is crucial for training domain large language models. To train OceanGPT, we collected an ocean science corpus spanning multiple fields. Each subfield and topic has its unique data characteristics and patterns, leading us to propose a domain-specific instruction generation framework named DoInstruct. This framework utilizes multi-agent collaboration to generate fine-tuning training data for ocean science instructions. This approach ensures both the professionalism and accuracy of the data while achieving efficient parallel data generation performance. The DoInstruct framework employs agents (such as GPT-3.5-turbo) as experts for each ocean topic, with each agent rapidly expanding instructions through mutual collaboration. The framework defines three agent

Covering approximately 71% of the Earth’s surface, the ocean plays a crucial role in global climate regulation, weather patterns, biodiversity, and human economic development. Ocean science research focuses on the natural characteristics of the ocean, its changing patterns, and the theories, methods, and applications related to the development and utilization of ocean resources. Therefore, we propose a large language model, OceanGPT, designed specifically for the ocean domain. It can handle various ocean science tasks, including Q&A and content generation. Additionally, we attempt to validate the potential of the large language model in simulating underwater robot operations, further exploring the realization of model-driven underwater embodied intelligence.

Evolutionary Data Synthesis Agent: Specifically, the agent employs two collaborative strategies: firstly, supplementing and expanding background knowledge of seed samples, and secondly, refining analysis to enhance and improve the knowledge contained within seed data.
Fine-tuned Literature Reading Agent: Initially fine-tuning a large language model to develop an intelligent model specialized for literature extraction, enabling the agent to extract high-quality sentences from vast ocean literature.
Quality Assurance Audit Agent: Predefining specific syntactic and semantic rules related to ocean science, constructing this agent through prompting to filter data and ensure the quality of generated data.
We trained OceanGPT based on open-source models (such as Qwen, LLaMA, MiniCPM, etc.) and instructions generated by the DoInstruct framework.