开放资源 – OceanGPT

Resources

OceanGPT adheres to the principle of open source and openness, promoting research on ocean large-scale models through open instruction datasets and open source models.

Models

OceanGPT-o-7B

OceanGPT-o-7B-v0.1 was trained on bilingual corpora in the marine field based on Qwen2.5-VL-7B-Instruction.

OceanGPT-coder-7B

OceanGPT-coder-7B-v0.1 was trained on its own bilingual code corpus in the marine field based on Qwen2.5-Coder-7B Instruction.

Oceangpt-basic-v0.3

OceanGPT-basic-v0.3基于Qwen，在知识增强的海洋领域双语语料上进行了训练。待发布。

Oceangpt-basic-14B-v0.1

Oceangpt-basic-14B-v0.1 was trained on marine domain corpora based on Qwen1.5-14B. Attention: This model is an early version and its performance is no longer as good as the latest model.

Oceangpt-basic-7B-v0.2

Oceangpt-basic-7B-v0.2 was trained on marine domain corpus based on Qwen2. Attention: This model is an early version and its performance is no longer as good as the latest model.

Oceangpt-basic-2B-v0.1

Oceangpt-basic-2B-v0.1 was trained on ocean domain corpus based on MiniCPM-2B. Attention: This model is an early version and its performance is no longer as good as the latest model.

Instruction Data

OceanInstruct-v0.2

Approximately 50K bilingual text instruction data in the field of marine science, constructed based on publicly available corpora.

OceanInstruct-o

基于公开语料构建的约50K中英双语海洋领域多模态指令数据。

OceanInstruct-v0.1

About 10K bilingual text instruction data in the field of ocean based on publicly available corpora. Note: This instruction data is only a partial data used by early models.

limitation

1. The model may have hallucination issues, please carefully identify them.
Due to limited computing resources, OceanGPT encoder currently only supports natural language interpretation and generation of certain types of sonar and ocean science images, while OceanGPT encoder currently only supports MOOS code generation.
3. We have not yet optimized the identity of the model, so the generated identity information may be similar to Qwen, MiniCPM, LLaMA, or GPT series models.
4. The output of the model is affected by prompt words, which may result in inconsistent results generated multiple times.
5. Some instruction data is synthesized data from a large model, which may contain errors.