【AI-Snova】数据提取(Data Extraction)--HTML extraction

写在前面

  • examples of text extraction from DOCX files with different packages

    从具有不同包的 DOCX 文件中提取文本的示例

Table of contents

依赖导入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

import pandas as pd
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm.autonotebook import trange

从网页加载的方法

1
2
3
4
urls = [
"https://en.wikipedia.org/wiki/Unstructured_data",
"https://unstructured-io.github.io/unstructured/introduction.html",
]

加载分词器

1
2
3
4
5
6
7
8
text_splitter = RecursiveCharacterTextSplitter(
# Set a small chunk size, just to make splitting evident.
chunk_size = 500,
chunk_overlap = 100,
length_function = len,
add_start_index = True,
separators = ["\n\n\n","\n\n", "\n", "."]
)

从 Async 本地加载程序加载

1
2
3
4
5
6
7
from langchain.document_loaders import AsyncHtmlLoader

loader = AsyncHtmlLoader(urls, verify_ssl=False)
docs = loader.load()

for doc in docs:
print(f'{doc.page_content}\n---')

使用 Html2Text 转换器清理异步加载程序的 html 输出

1
2
3
4
5
6
from langchain.document_transformers import Html2TextTransformer

html2text_transformer = Html2TextTransformer()
docs=html2text_transformer.transform_documents(documents=docs)
for doc in docs:
print(f'{doc.page_content}\n---')

使用递归字符文本拆分器从 Html2Text 转换器拆分输出

1
2
3
docs = text_splitter.split_documents(docs)
for doc in docs:
print(f'{doc.page_content}\n---')

通过嵌入相似性来评估 loded docs

嵌入和存储

1
2
3
4
5
6
7
8
9
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

encode_kwargs = {'normalize_embeddings': True}
embd_model = HuggingFaceInstructEmbeddings( model_name='intfloat/e5-large-v2',
embed_instruction="", # no instructions needed for candidate passages
query_instruction="Represent this sentence for searching relevant passages: ",
encode_kwargs=encode_kwargs)
vectorstore = FAISS.from_documents(documents=docs, embedding=embd_model)

相似度搜索

1
2
3
4
5
query = "how unstructured deal witn ambiguities?"

ans = vectorstore.similarity_search(query)
print("-------Async local Loader + html2text transformer----------\n")
print(ans[0].page_content)
  • Copyrights © 2024-2025 brocademaple
  • 访问人数: | 浏览次数:

      请我喝杯咖啡吧~

      支付宝
      微信