#19 ChatGPT에서 질문과 관련된 웹페이지 크롤링하기

빅데이타 & 머신러닝/생성형 AI (ChatGPT etc)

#19 ChatGPT에서 질문과 관련된 웹페이지 크롤링하기

Terry Cho 2024. 2. 21. 15:49

조대협 (http://bcho.tistory.com)

Langchain 에서 Agent가 사용하는 Tool을 사용자가 쉽게 개발해서 추가할 수 있다. 이번 예제에서는 DuckDuckSearch Tool을 이용하여, 질문에 관련된 웹사이트를 검색한후, 그 중 한 웹사이트의 내용을 크롤링해서 웹페이지 내용을 읽어온후에, 이를 요약하는 예제를 만들어 본다.

이를 위해서 웹페이지를 크롤링하는 툴을 BeautifulSoup 을 이용해서 만들어 본다.

커스텀 툴을 정의하는 방법은 몇가지가 있는데, 이 예제에서는 데코레이터를 사용하는 방법과 StructuredTool을 사용하는 방법 두가지를 살펴보자.

먼저 decorator를 사용하는 방법이다.

HEADERS = {

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:90.0) Gecko/20100101 Firefox/90.0'

}

def parse_html(content) -> str:

soup = BeautifulSoup(content, 'html.parser')

text_content_with_links = soup.get_text()[:3000]

return text_content_with_links

@tool

def web_fetch_tool(url:str) -> str:

"""Useful to fetches the contents of a web page"""

if isinstance(url,list):

url = url[0]

print("Fetch_web_page URL :",url)

response = requests.get(url, headers=HEADERS)

return parse_html(response.content)

web_fetch_tool 라는 이름으로 툴을 만들었는데, 툴을 만들기 위해서는 함수에 @tool 이라는 데코레이터를 선언해주면 된다. 이때 input 과 return 에 대한 데이터 타입을 반드시 지정해줘야 한다. 입출력 인자는 함수 선언시에 정의된 입출력의 변수명과 변수 타입을 tool의 입출력 정보로 사용하기 때문이.

그리고 함수 첫줄에 “”” 으로 주석을 달아주면, 주석이 툴에 대한 description이 된다.

즉 위의 예제에서는 툴에 대한 정보는 아래와 같이 정의 된다.

Tool name : web_fetch_tool

Tool description : web_fetch_tool(url: str) -> str - Useful to fetches the contents of a web page

Tool argument : {'url': {'title': 'Url', 'type': 'string'}}

web_fetch_tool은 url을 인자로 받은 후에, request.get(url)을 통해서 url에 있는 웹페이지를 크롤링한다. 크롤링을 위해서 HTTP Header의 내용을 HEADERS 변수에 저장하여 전달하였다.

이렇게 크롤링 된 HTML은 HTML 태그 부분을 제외하고, 텍스트 부분만 추출하기 위해서 parse_html에서 BeautifulSoup 의 HTML Parser를 이용하, 텍스트 부분만 추출하여 리턴한다.

decorator를 사용하는 방법 이외에도 StructuredTool 을 이용하는 방법이 있다. 아래는 StructuredTool을 이용하여 fetch_web_page 함수를 툴로 등록하는 코드이다. func에 툴로 등록할 함수 이름을 지정하고, name에 툴의 이름, 그리고 마지막으로 description에 툴에 대한 설명을 추가한다.

def fetch_web_page(url:str) -> str:

if isinstance(url,list):

url = url[0]

print("Fetch_web_page URL :",url)

response = requests.get(url, headers=HEADERS)

return parse_html(response.content)

web_fetch_tool = StructuredTool.from_function(

func=fetch_web_page,

name="WebFetcher",

description="Useful to fetches the contents of a web page"

)

지정된 웹 페이지 URL을 크롤링 하는 툴을 만들었으면, 이제 전체 애플리케이션을 만들어보자.

import requests

from bs4 import BeautifulSoup

from dotenv import load_dotenv

from langchain.tools import Tool, DuckDuckGoSearchResults

from langchain.prompts import PromptTemplate

from langchain.chat_models import ChatOpenAI

from langchain.chains import LLMChain

from langchain.agents import AgentExecutor, create_react_agent

from langchain.tools import BaseTool, StructuredTool, tool

import os

os.environ["LANGCHAIN_TRACING_V2"]="true"

os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"

os.environ["LANGCHAIN_API_KEY"]="{YOUR_LANGCHAIN_APIKEY}"

os.environ["LANGCHAIN_PROJECT"]="{YOUR_LANGCHAIN_PROJECT}"

os.environ["OPENAI_API_KEY"] = "{YOUR_OPENAI_KEY}"

model = ChatOpenAI(model="gpt-3.5-turbo-16k")

ddg_search = DuckDuckGoSearchResults()

HEADERS = {

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:90.0) Gecko/20100101 Firefox/90.0'

}

def parse_html(content) -> str:

soup = BeautifulSoup(content, 'html.parser')

text_content_with_links = soup.get_text()[:3000]

return text_content_with_links

def fetch_web_page(url:str) -> str:

if isinstance(url,list):

url = url[0]

print("Fetch_web_page URL :",url)

response = requests.get(url, headers=HEADERS)

return parse_html(response.content)

web_fetch_tool = StructuredTool.from_function(

func=fetch_web_page,

name="WebFetcher",

description="Useful to fetches the contents of a web page"

)

summarization_chain = LLMChain(

llm=model,

prompt=PromptTemplate.from_template("Summarize the following content: {content}")

)

summarize_tool = Tool.from_function(

func=summarization_chain.run,

name="Summarizer",

description="Useful to summarizes a web page"

)

tools = [ddg_search, web_fetch_tool, summarize_tool]

template = '''Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer

Thought: you should always think about what to do

Action: the action to take, should be one of [{tool_names}]

Action Input: the input to the action

Observation: the result of the action

... (this Thought/Action/Action Input/Observation can repeat N times)

Thought: I now know the final answer

Final Answer: the final answer to the original input question

Begin!

Question: {input}

Thought:{agent_scratchpad}'''

prompt = PromptTemplate.from_template(template)

agent = create_react_agent(model,tools,prompt)

agent_executor = AgentExecutor(

agent=agent,

tools=tools,

verbose=True,

return_intermediate_steps=True,

handle_parsing_errors=True,

)

question = "Tell me about best Korean reastaurant in Seoul.\

Use search tool to find the information.\

To get the details, please fetch the contents from the web sites.\

Summarize the details in 1000 words."

print(agent_executor.invoke({"input":question}))

이 예제는 DuckDuckGo 서치를 이용하여, 필요한 정보를 검색하도록 하고, DuckDuckGo 서치에서 검색된 페이지의 URL을 필요한 경우 web_fetch_tool로 전달하여, URL에서 부터 본문을 추출한 후, summarize_tool을 이용해서 요약한 정보를 출력하도록 하는 예제이다.

먼저 duckduckgo Search 툴을 등록한다. https://duckduckgo.com/ 는 구글과 같은 검색엔진으로, 사용자 정보를 수집하지 않고, 개인 정보를 보호하는 기능이 강화된 검색 엔진이다. 파이썬의 DuckDuckGoSearchResult() 는 검색 결과에 검색 결과 텍스트 뿐만 아니라, URL 까지 같이 리턴하기 때문, 특정 페이지의 내용을 모두 크롤링하는 이 예제의 시나리오에 적절하기 때문에 사용하였.

ddg_search = DuckDuckGoSearchResults()

다음 검색 결과를 요약하는 툴을 작성해보자.

summarization_chain이라는 이름으로 LLMChain을 아래와 같이 생성하. Chain의 템플릿은 “Summarize the following content: {content}” 으로 입력된 문장을 요약하도록 하였다.

summarization_chain = LLMChain(

llm=model,

prompt=PromptTemplate.from_template("Summarize the following content: {content}")

)

summarize_tool = StructuredTool.from_function(

func=summarization_chain.run,

name="Summarizer",

description="Useful to summarizes a web page"

)

다음 LLMChain을 StructuredTool.from_function을 이용하여 툴로 등록하였다.

이렇게 3가지 툴을 모두 생성하였으면 tools 리스트에 3가지 툴을 등록하였으면, agent를 생성한 후, agent_executor를 이용하여 툴과 에이전트를 등록한다.

tools = [ddg_search, web_fetch_tool, summarize_tool]

template = '''Answer the following questions as best you can. You have access to the following tools:

agent = create_react_agent(model,tools,prompt)

agent_executor = AgentExecutor(

agent=agent,

tools=tools,

verbose=True,

return_intermediate_steps=True,

handle_parsing_errors=True,

)

Agent와 agent_executor가 모두 준비 되었으면, agent_executor를 호출해보자.

question = "Tell me about best Korean reastaurant in Seoul.\

Use search tool to find the information.\

To get the details, please fetch the contents from the web sites.\

Summarize the details in 1000 words."

print(agent_executor.invoke({"input":question}))

agent_executor에서 verbose=True로 하였기 때문에 중간 과정을 콘솔을 통해서 확인할 수 있는데, 대략 다음과 같은 순서로 실행이 된다. verbose=True로 하지 않더라도 LangSmith를 통해서도 실행 과정을 모니터링할 수 있다.

질문	Tell me about best Korean restaurant in Seoul.\ Use search tool to find the information.\ To get the details, please fetch the contents from the web sites.\ Summarize the details in 1000 words. 서울에서 최고의 한국 레스토랑을 이야기해줘. 검색 도구를 이용해서 정보를 찾아줘. 상세 정보를 찾기 위해서, 웹사이트에서 정보를 수집해줘. 마지막으로 디테일을 1000 자 이내로 요약해줘
Thought	“I need to use the DuckDuckGo Results JSON tool to search for the best Korean restaurant in Seoul. Then, I can use the web_fetch_tool to fetch the contents of the websites that provide information about the restaurant. Finally, I can use the Summarizer tool to summarize the details in 1000 words.” “"서울에서 최고의 한식 레스토랑을 찾기 위해 DuckDuckGo 결과 JSON 도구를 사용한다. 그런 다음, 레스토랑 정보를 제공하는 웹사이트의 내용을 가져오기 위해 web_fetch_tool을 사용할 수 있다. 마지막으로, Summarizer 도구를 사용하여 세부 정보를 1000단어로 요약할 수 있다."”
Action :	DuckDuckGo Result JSON
Action Input:	best Korean restaurant in Seoul
Thought	I have fetched the contents of the website "https://www.willflyforfood.net/seoul-food-guide-25-must-eat-restaurants-in-seoul-south-korea/". Now I need to use the Summarizer tool to summarize the details in 1000 words. "https://www.willflyforfood.net/seoul-food-guide-25-must-eat-restaurants-in-seoul-south-korea/" 웹사이트의 내용을 가져왔다. 이제 1000단어로 요약하는 Summarizer 도구를 사용한다."
Action	Summarizer
Action Input	Contents of the website "https://www.willflyforfood.net/seoul-food-guide-25-must-eat-restaurants-in-seoul-south-korea/"The website "https://www.willflyforfood.net/seoul-food-guide-25-must-eat-restaurants-in-seoul-south-korea/" provides a comprehensive food guide for Seoul, South Korea. It lists 25 must-eat restaurants in the city and offers detailed information about each establishment, including the type of cuisine, specialties, location, and price range. This guide aims to help visitors and locals alike discover the best dining experiences in Seoul.I now know the final answer. (웹사이트에서 읽어온 웹페이지 내용)
Final Answer (결과)	The best Korean restaurants in Seoul can be found by referring to the comprehensive food guide provided by "https://www.willflyforfood.net/seoul-food-guide-25-must-eat-restaurants-in-seoul-south-korea/". This guide lists 25 must-eat restaurants in Seoul, offering detailed information about each establishment, including the type of cuisine, specialties, location, and price range.

저작자표시 비영리 변경금지

'빅데이타 & 머신러닝 > 생성형 AI (ChatGPT etc)' 카테고리의 다른 글

Small to Big Chunking in RAG (0)	2024.12.25
LLM 애플리케이션 아키텍처 (1/2) (6)	2024.03.19
#18.LangSmith를 이용한 Langchain agent 내부 동작 구조 이해 (1)	2024.02.03
Langchain을 이용한 LLM 애플리케이션 구현 #17-ChatGPT 구글 검색 엔진과 연동하기 (1)	2024.02.02
Langchain Integrations (0)	2024.02.01

현재글#19 ChatGPT에서 질문과 관련된 웹페이지 크롤링하기

실리콘밸리에서 살고 있는 평범한 엔지니어 입니다 이메일-bwcho75골뱅이지메일 닷컴. 아키텍처 디자인, 머신러닝 시스템, 빅데이터 설계, DEVOPS/SRE, 애자일 방법론,쿠버네티스,마이크로서비스, ChatGPT 생성형 AI , CTO 등에 대한 기술 멘토링과 강의 진행합니다. 쓰레드 : https://www.threads.net/@byungwookcho

클라우드 컴퓨팅, tensorflow, 머신러닝, Tutorial, 쿠버네티스, node.js, Machine Learning, Kubernetes, 빅데이타, 텐서플로우, 초보, 딥러닝, 튜토리얼, google, 구글, cloud, 클라우드, 소개, 조대협, 강좌,

Today :
Yesterday :

조대협의 블로그