【AI-Ollama进阶】文本嵌入（Embeddings）

2024-10-31

notes / AI notes

字数统计: 2.2k | 阅读时长≈ 10 分钟

¶写在前面 / 链接存档

参考教程
- 5、文本嵌入（Embeddings） - Powered by MinDoc
本篇内容基本cv+个人笔记，如果侵权了什么的求轻喷orz
参考教程的参考教程：
- Embedding models · Ollama Blog
Ollama 支持嵌入模型，从而可以构建将文本提示与现有文档或其他数据相结合的检索增强生成 (RAG) 应用程序

¶什么是Embedding模型

Embedding 模型是经过专门训练以生成向量嵌入的模型：
表示给定文本序列的语义含义的长数字数组

看不懂思密达

将生成的向量嵌入数组存储在数据库中
数据库将对它们进行比较，以搜索含义相似的数据

¶注意

当启动 Ollama 之后，Windows会有托盘图标，此时已经启动了 Ollama 的服务
访问 Embedding时不需要运行 ollama run gemma ，只有访问 chat时才需要启动一个大模型

¶嵌入模型示例

模型	参数大小	链接
shaw/dmeta-embedding-zh	4亿	shaw/dmeta-embedding-zh
mxbai-embed-large	3.34亿	shaw/dmeta-embedding-zh
nomic-embed-text	1.37 亿	nomic-embed-text
snowflake-arctic-embed	3.35亿	nomic-embed-text
all-minilm	2300 万	all-minilm

¶shaw/dmeta-embedding-zh

Dmeta-embedding 是一款跨领域、跨任务、开箱即用的中文 Embedding 模型
适用于搜索、问答、智能客服、LLM+RAG 等各种业务场景
支持使用 Transformers/Sentence-Transformers/Langchain 等工具加载推理

Huggingface：https://huggingface.co/DMetaSoul/Dmeta-embedding-zh
文档地址：https://ollama.com/shaw/dmeta-embedding-zh

¶优势特点

多任务、场景泛化性能优异
目前已取得 MTEB 中文榜单第二成绩（2024.01.25）
模型参数大小仅 400MB
对比参数量超过 GB 级模型，可以极大降低推理成本
支持上下文窗口长度达到 1024
对于长文本检索、RAG 等场景更适配

¶四个版本

dmeta-embedding-zh：shaw/dmeta-embedding-zh 是一个参数量只有400M、适用于多种场景的中文Embedding模型，在MTEB基准上取得了优异成绩，尤其适合语义检索、RAG等LLM应用。
dmeta-embedding-zh-q4：shaw/dmeta-embedding-zh 的 Q4_K_M 量化版本
dmeta-embedding-zh-small：shaw/dmeta-embedding-zh-small 是比 shaw/dmeta-embedding-zh 更轻量化的模型，参数不足300M，推理速度提升30%。
dmeta-embedding-zh-small-q4：shaw/dmeta-embedding-zh-small 的 Q4_K_M 量化版本

1	ollama pull shaw/dmeta-embedding-zh

¶mxbai-embed-large

截至 2024 年 3 月，该模型在 MTEB 上创下了 Bert-large 尺寸模型的 SOTA 性能记录。它的表现优于 OpenAIs text-embedding-3-large 模型等商业模型，并且与其尺寸 20 倍的模型的性能相当
mxbai-embed-large在没有 MTEB 数据重叠的情况下进行训练，这表明该模型在多个领域、任务和文本长度方面具有很好的泛化能力。

文档地址：https://ollama.com/library/mxbai-embed-large

1	ollama pull mxbai-embed-large

¶nomic-embed-text

nomic-embed-text 是一个大上下文长度文本编码器
超越了 OpenAI text-embedding-ada-002
并且text-embedding-3-small在短上下文和长上下文任务上表现优异。

文档地址：https://ollama.com/library/nomic-embed-text

1	ollama pull nomic-embed-text

¶snowflake-arctic-embed

snowflake-arctic-embed 是一套文本嵌入模型
专注于创建针对性能优化的高质量检索模型

这些模型利用现有的开源文本表示模型（例如 bert-base-uncased）进行训练，并在多阶段管道中进行训练以优化其检索性能

文档地址：https://ollama.com/library/snowflake-arctic-embed

1	ollama pull snowflake-arctic-embed

¶五种参数大小

snowflake-arctic-embed:335m（默认）
snowflake-arctic-embed:137m
snowflake-arctic-embed:110m
snowflake-arctic-embed:33m
snowflake-arctic-embed:22m

¶all-minilm

all-minilm 是一款在非常大的句子级数据集上训练的嵌入模型

文档地址：https://ollama.com/library/all-minilm

1	ollama pull all-minilm

¶用法

¶第一步：拉取一个模型

要生成向量嵌入，首先拉取一个模型（以mxbai-embed-large为例）：

1	ollama pull mxbai-embed-large

¶第二步：从模型生成向量嵌入

¶方式一：REST API

curl http://localhost:11434/api/embeddings -d '{
  "model": "mxbai-embed-large",
  "prompt": "Llamas are members of the camelid family"
}'

¶方式二：Python库

ollama.embeddings(
  model='mxbai-embed-large',
  prompt='Llamas are members of the camelid family',
)

¶方式三：JavaScript库

ollama.embeddings({
    model: 'mxbai-embed-large',
    prompt: 'Llamas are members of the camelid family',
})

¶集成的其他流行嵌入工具

Ollama 还集成了流行工具来支持嵌入工作流程
例如:LangChain和LlamaIndex

¶示例

原教程演示：
如何使用 Ollama 和嵌入模型构建检索增强生成 (RAG) 应用程序

¶第一步：生成嵌入

1	pip install ollama chromadb

¶源代码

创建一个名为example.py的文件

import ollama
import chromadb

documents = [
  "Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old",
]

client = chromadb.Client()
collection = client.create_collection(name="docs")

# store each document in a vector embedding database
for i, d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  collection.add(
    ids=[str(i)],
    embeddings=[embedding],
    documents=[d]
  )

¶代码解析

导入模块

1 2	import ollama import chromadb

ollama 和 chromadb 是两个用于处理自然语言处理任务的库
ollama 提供了生成和操作文本嵌入的功能
而 chromadb 是一个向量数据库，用于高效地存储和检索嵌入数据

定义文档列表

documents = [
  "Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camals",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old",
]

定义了一个包含六条关于骆驼科动物——尤其是羊驼的信息的列表

创建客户端和集合

1 2	client = chromadb.Client() collection = client.create_collection(name="docs")

使用 chromadb 创建一个客户端实例，
然后通过这个客户端创建一个名为 “docs” 的集合collection。
这个集合将用于存储文档的embeddings

存储文档的嵌入

for i, d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  collection.add(
    ids=[str(i)],
    embeddings=[embedding],
    documents=[d]
  )

对于每一条文档，首先使用 ollama 库中的 embeddings 方法生成文档的嵌入表示。这里使用的是预训练的 “mxbai-embed-large” 模型
接着，将生成的嵌入、对应的唯一标识符（ID）以及原始文档本身添加到之前创建的集合中。这一步骤将文档转换为可以进行快速相似性搜索的形式

¶第二步：检索

添加代码以根据示例提示检索相关性最高的文档

¶源代码

# an example prompt
prompt = "What animals are llamas related to?"

# generate an embedding for the prompt and retrieve the most relevant doc
response = ollama.embeddings(
  prompt=prompt,
  model="mxbai-embed-large"
)
results = collection.query(
  query_embeddings=[response["embedding"]],
  n_results=1
)
data = results['documents'][0][0]

¶代码解析

示例查询

1	prompt = "What animals are llamas related to?"

生成查询的embedding并检索相关文档

# 为查询提示生成embedding
response = ollama.embeddings(
  prompt=prompt,
  model="mxbai-embed-large"
)

# 使用这个embedding在先前创建的集合中查询最相似的文档
results = collection.query(
  query_embeddings=[response["embedding"]],
  n_results=1 
)  # n_results=1 表示只返回最相关的单个文档

# 从查询结果中提取出文档内容
data = results['documents'][0][0]

¶第三步：生成

使用提示和上一步检索到的文档来生成答案

¶源代码

# generate a response combining the prompt and data we retrieved in step 2
output = ollama.generate(
  model="llama2",
  prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)

print(output['response'])

¶代码解析

# 使用ollama库中的generate方法，
# 基于检索到的相关文档生成对原始查询的响应
output = ollama.generate(
  model="llama2",  # 用来生成响应的模型
  prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)
print(output['response'])  # 打印生成的响应

¶第四步：运行代码

1	python example.py

询问的问题：What animals are llamas related to?
Llama2预计生成的回答：

Llamas are members of the camelid family, which means they are closely related to two other animals: vicuñas and camels. All three species belong to the same evolutionary lineage and share many similarities in terms of their physical characteristics, behavior, and genetic makeup. Specifically, llamas are most closely related to vicuñas, with which they share a common ancestor that lived around 20-30 million years ago. Both llamas and vicuñas are members of the family Camelidae, while camels belong to a different family (Dromedary).