Files
03Rag/12使用递归字符文本分割器分割.py
heyong.fu a17c65c4bc feat: rag
2026-05-06 11:35:10 +08:00

25 lines
1.0 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RecursiveCharacterTextSplitter 是 LangChain 中最常用的文本分割器,它实现了基于文本结构的分割策略。对于大多数应用场景,这是推荐的默认选择。
# 为什么推荐?
# 在保持上下文完整性和管理块大小之间取得了良好的平衡
# 开箱即用,默认配置就能很好地工作
# 只有在需要针对特定应用进行微调时才需要调整参数
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 创建递归文本分割器对象,指定参数
# chunk_size 表示每块最大允许的字符数100
# chunk_overlap 表示块与块之间没有重叠(重叠字符数0)
text_splitters = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
document = f"""{"1"*100}\n{"2"*99}\n\n{"3"*99}\n{"4"*99}"""
# 使用文本分割器的split_text 方法将document进行分割成多个字符串的块
texts = text_splitters.split_text(document)
print(f"共分割出{len(texts)}个块")
for i, text in enumerate(texts):
print(f"\n{i}({len(text)}字符){repr(text)}")