【AI-Assistant】AI早报Bot【未完成】

2024-10-18

notes / AI notes

字数统计: 893 | 阅读时长≈ 4 分钟

¶写在前面 / 链接存档

20个群都来问我的AI早报，是这么做的。 (qq.com)

AI早报Bot

¶数据源和数据抓取工具探索

AIbase基地 - 让更多人看到未来通往AGI之路

发现：
网页的页面结构比较规律，新闻链接都按数字递增 → 适合抓取

发现的好用的工具：Crawl4ai

Crawl4AI：开源LLM友好的网络爬虫和抓取器

仓库地址：unclecode/crawl4ai： 🔥🕷️ Crawl4AI：开源LLM友好型网络爬虫和抓取器 (github.com)

¶数据爬取工具【Crawl4ai】使用指南

¶第一步：安装

结合原代码仓库的README和ChatGPT的回答

基本安装

1	pip install crawl4ai

使用同步版本安装

1	pip install crawl4ai[sync]

请注意：安装 Crawl4AI 时，安装脚本应自动安装并设置 Playwright。但是，如果您遇到任何与 Playwright 相关的错误，可以使用以下方法之一手动安装它

通过命令行

1	playwright install

或

1	pip install playwright

来确保playwrite已经成功安装

¶第二步：快速开始

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

将url替换为特定的信息源的文章详情地址

¶第二点五步：高级用法

¶第二点六步：执行JavaScript和使用CSS选择器

执行JavaScript和使用CSS选择器

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            css_selector=".wide-tease-item__description",
            bypass_cache=True
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

¶第二点七步：使用代理

使用代理

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

¶第三步：文章中使用的爬取AIbase新闻文章数据的代码

希望标题单独一行，时间单独一行，内容单独一行

import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_ai_news_article():
    print("\n--- 使用 JsonCssExtractionStrategy 提取 AIbase 新闻文章数据 ---")

    # 定义提取 schema
    schema = {
        "name": "AIbase News Article",
        "baseSelector": "div.pb-32",  # 主容器的 CSS 选择器
        "fields": [
            {
                "name": "title",
                "selector": "h1",
                "type": "text",
            },
            {
                "name": "publication_date",
                "selector": "div.flex.flex-col > div.flex.flex-wrap > span:nth-child(6)",
                "type": "text",
            },
            {
                "name": "content",
                "selector": "div.post-content",
                "type": "text",  
            },
        ],
    }

    # 创建提取策略
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    # 使用 AsyncWebCrawler 进行爬取
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.aibase.com/zh/news/12386",  # 替换为实际的目标 URL
            extraction_strategy=extraction_strategy,
            bypass_cache=True,  # 忽略缓存，确保获取最新内容
        )

        if not result.success:
            print("页面爬取失败")
            return

        # 解析提取的内容
        extracted_data = json.loads(result.extracted_content)
        print(f"成功提取 {len(extracted_data)} 条记录")
        print(json.dumps(extracted_data, indent=2, ensure_ascii=False))

    return extracted_data

# 运行异步函数
if __name__ == "__main__":
    asyncio.run(extract_ai_news_article())

这段代码爬取了特定链接新闻文章的：
新闻标题
发布日期
新闻内容

脚本运行成功效果

爬取AIBase的新闻文章

¶第四步：几个科技新闻网站

TechCrunch | Startup and Technology News
AIbase基地 - 让更多人看到未来通往AGI之路
to be continue……

¶后续

发现这位作者并没有公开后续进行早报整理和自动化操作的代码……
遂在此暂停！……