【AI-Snova】数据提取(Data Extraction)--RTF extraction

写在前面

  • examples of text extraction from RTF files with different packages

    从具有不同包的 RTF 文件中提取文本的示例

  • RTF(Rich Text Format)文件

    RTF 文件的结构基于一种标记语言,其中包含一系列控制字符和关键字,用于定义文档的格式\

    • {\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
      {\colortbl ;\red255\green0\blue0;}
      \viewkind4\uc1\pard\cf1\b\f0\fs24 This is a \ul rich text \ul0\b0\cf0 file.\par
      }
      
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37

      - `\rtf1`: 表示这是一个 RTF 文件。
      - `\ansi\ansicpg1252`: 指定字符编码。
      - `\deff0{\fonttbl{\f0\fswiss\fcharset0 Arial;}}`: 定义字体表。
      - `\colortbl ;\red255\green0\blue0;`: 定义颜色表。
      - `\viewkind4\uc1\pard\cf1\b\f0\fs24`: 开始定义段落格式。
      - `This is a \ul rich text \ul0\b0\cf0 file.`: 实际的文本内容,其中包含格式标记

      ### Table of contents

      - [Methods to load RTF files](https://github.com/sambanova/ai-starter-kit/blob/5ab65bce8fd5950390bee52a0a429eaff9d6c91e/data_extraction/notebooks/#toc1_)
      - [Load from unstructured local RTF loader](https://github.com/sambanova/ai-starter-kit/blob/5ab65bce8fd5950390bee52a0a429eaff9d6c91e/data_extraction/notebooks/#toc1_1_)
      - [Load from unstructured io API](https://github.com/sambanova/ai-starter-kit/blob/5ab65bce8fd5950390bee52a0a429eaff9d6c91e/data_extraction/notebooks/#toc1_2_)
      - [Evaluate loded docs by embedding similarity](https://github.com/sambanova/ai-starter-kit/blob/5ab65bce8fd5950390bee52a0a429eaff9d6c91e/data_extraction/notebooks/#toc2_)
      - [Embedding & Storage](https://github.com/sambanova/ai-starter-kit/blob/5ab65bce8fd5950390bee52a0a429eaff9d6c91e/data_extraction/notebooks/#toc2_1_)
      - [Similarity search](https://github.com/sambanova/ai-starter-kit/blob/5ab65bce8fd5950390bee52a0a429eaff9d6c91e/data_extraction/notebooks/#toc2_2_)



      ### 依赖导入

      ```python
      import os
      import sys

      current_dir = os.getcwd()
      kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
      repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

      sys.path.append(kit_dir)
      sys.path.append(repo_dir)

      import glob
      import pandas as pd
      from dotenv import load_dotenv
      from langchain.text_splitter import RecursiveCharacterTextSplitter
      from tqdm.autonotebook import trange

加载RTF文件的方法

1
2
3
folder_loc = os.path.join(kit_dir,'data/sample_data/sample_files/')
rtf_files = list(glob.glob(f'{folder_loc}/*.rtf'))
file_path = rtf_files[0]

加载分词器

1
2
3
4
5
6
7
8
text_splitter = RecursiveCharacterTextSplitter(
# Set a small chunk size, just to make splitting evident.
chunk_size = 200,
chunk_overlap = 20,
length_function = len,
add_start_index = True,
separators = ["\n\n\n","\n\n", "\n", "."]
)

从非结构化本地 RTF 加载程序加载

  • 要使用pypandoc,需要安装pandoc -> https://pandoc.org/installing.html
1
2
3
4
5
6
from langchain.document_loaders import UnstructuredRTFLoader

loader = UnstructuredRTFLoader(file_path, mode="elements")
docs_unstructured_local = loader.load_and_split(text_splitter = text_splitter)
for doc in docs_unstructured_local:
print(f'{doc.page_content}\n---')

从非结构化io API加载程序加载

1
2
3
4
5
6
7
8
9
10
11
from langchain.document_loaders import UnstructuredAPIFileLoader
# register at Unstructured.io to get a free API Key
load_dotenv(os.path.join(repo_dir,'.env'))

loader = UnstructuredAPIFileLoader(file_path,
mode="elements",
api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
url=os.environ.get("UNSTRUCTURED_URL"))
docs_unstructured_api = loader.load_and_split(text_splitter = text_splitter)
for doc in docs_unstructured_api:
print(f'{doc.page_content}\n---')

通过嵌入相似性来评估 loded docs

嵌入和存储

1
2
3
4
5
6
7
8
9
10
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

encode_kwargs = {'normalize_embeddings': True}
embd_model = HuggingFaceInstructEmbeddings( model_name='intfloat/e5-large-v2',
embed_instruction="", # no instructions needed for candidate passages
query_instruction="Represent this sentence for searching relevant passages: ",
encode_kwargs=encode_kwargs)
vectorstore_unstructured_local = FAISS.from_documents(documents=docs_unstructured_local, embedding=embd_model)
vectorstore_unstructured_api = FAISS.from_documents(documents=docs_unstructured_api, embedding=embd_model)

相似性搜索

1
2
3
4
5
6
7
8
9
10
query = "how many columns are?"  # query→要搜索的关键词

ans = vectorstore_unstructured_local.similarity_search(query) # 在本地加载器中执行相似性搜索,返回一个包含匹配结果的列表
print("-------Unstructured local Loader----------\n")
print(ans[0].page_content)


ans_2 = vectorstore_unstructured_api.similarity_search(query) # 在API加载器中执行相似性搜索,返回一个包含匹配结果的列表
print("--------Unstructured api loader------------\n")
print(ans_2[0].page_content)
  • Copyrights © 2024-2025 brocademaple
  • 访问人数: | 浏览次数:

      请我喝杯咖啡吧~

      支付宝
      微信