In enterprise operations, contract review is a core risk control process—but traditional manual review has long been plagued by three critical pain points: inefficiency (a complex contract can take hours or even days to review), risk omission (reliance on reviewer experience leads to missed hidden compliance issues), and lack of personalization (failure to adapt to different clients’risk preferences). With the rise of large language models (LLMs) and multi-agent architectures, these challenges now have a superior solution: through collaborative specialized agents, we can achieve automated, accurate, and personalized contract review.
This article will guide you through building an enterprise-grade intelligent contract review and risk analysis system from scratch using LangGraph, covering four core capabilities: Supervisor architecture, short/long-term memory synergy, toolchain integration, and Human-in-the-Loop (HITL) collaboration. Whether you’re an AI engineer, enterprise risk manager, or tech entrepreneur, you’ll gain actionable insights for deploying multi-agent systems in vertical industries (complete with runnable code and real-world best practices).
I. Why Choose Multi-Agent Architecture for Contract Review?
Before diving into technical implementation, let’s clarify the core advantages of multi-agent architecture over single LLMs or traditional systems—conclusions validated through our work on 10+ enterprise contract projects:
1. Fatal Shortcomings of Traditional Solutions
- Single LLM Limitations: While capable of text summarization, single LLMs lack specialized division of labor (e.g., clause extraction and risk assessment require distinct expertise), are prone to hallucinations (e.g., fabricating regulatory references), and struggle to integrate external tools (e.g., regulatory databases, user preference storage).
- Legacy Automation Tools: Rule-based engines fail to handle ambiguous clauses (e.g., “reasonable period” or “significant loss”) and cannot adapt to compliance requirements across industries (finance, e-commerce, manufacturing).
- Pure Manual Review: Low efficiency, high costs, and “experience-dependent” risk omission (e.g., new reviewers often miss cross-border contract regulations).
2. Core Value of Multi-Agent Architecture
- Specialized Division of Labor: Split tasks like “clause extraction,” “risk assessment,” and “compliance verification” across dedicated sub-agents—each focused on a single function to boost accuracy (similar to collaboration between hospital departments).
- Enhanced Robustness: A single agent failure won’t disrupt the entire workflow, as the Supervisor dynamically reroutes tasks.
- Personalization Support: Store user risk preferences (e.g., “financial clients prohibit liability caps exceeding 30%”) in long-term memory for tailored analysis.
- Human-AI Collaboration: Automatically trigger human intervention for high-risk scenarios, balancing efficiency and risk control safety.
Practical Takeaway: In a project with a large law firm we partnered with, our multi-agent contract review system reduced single-contract review time from 4 hours to 15 minutes, cut risk omission rates from 12% to 2.3%, and supported personalized risk rule configuration for 20+ industries.
II. Core System Design: Supervisor Architecture & LangGraph Fundamentals
2.1 Overall Architecture: Supervisor-Led Multi-Agent Collaboration
We adopt a Supervisor architecture, centered on “centralized orchestration + specialized division of labor.” Here’s the architecture diagram:

- Supervisor Agent: The core orchestrator responsible for parsing user requests (e.g., “Review risks in this e-commerce collaboration contract”), assigning tasks to sub-agents, and determining next steps (continue execution/human intervention/terminate).
- Sub-Agents:
- Clause Extraction Agent: Precisely extracts and summarizes key contract clauses (liability, termination, breach of contract, etc.).
- Risk Assessment Agent: Analyzes extracted clauses, user preferences, and external regulations to deliver risk ratings (low/medium/high) and recommendations.
- Memory Layer: Short-term memory (maintains conversation context) + long-term memory (stores user risk preferences, historical review records).
- Tool Layer: Integrates external capabilities (clause extraction tools, regulatory compliance checkers, long-term memory storage tools).
- Human-in-the-Loop Node: Triggers manual review for high-risk scenarios; human instructions are fed back to the system to resume the workflow.
2.2 LangGraph State Definition: The “Shared Brain” of Multi-Agent Systems
In LangGraph, State is the shared data carrier for all agents—acting as the “short-term memory + data warehouse” of the multi-agent system. Inputs, outputs, and intermediate results from all agents are written to the state to ensure workflow continuity.
We use TypedDict to define a structured state (Pydantic is recommended for data validation in production). Below is the code with practical design insights for key fields:
from typing_extensions import TypedDict
from typing import Annotated, List
from langgraph.graph.message import AnyMessage, add_messages
from langgraph.managed.is_last_step import RemainingSteps
# LangGraph State Definition: Core Data Carrier for Contract Review System
class ContractState(TypedDict):
"""
Shared state for the multi-agent contract review system. All nodes (agents/tools/humans) interact via this state.
Practical Tip: Balance completeness and lightness in state fields to avoid workflow slowdowns from redundant data.
"""
# Project/client unique identifier: Links to long-term memory (e.g., client risk preferences)
project_id: str
# Core input: Full contract text uploaded by the user (string parsed from TXT/PDF)
contract_text: str
# Short-term memory: Automatically maintained conversation history (agent responses, tool outputs, human instructions)
# Key Design: The add_messages annotation appends messages automatically—no manual state["messages"].append(...) required
messages: Annotated[List[AnyMessage], add_messages]
# Long-term memory: User-specific configurations loaded from storage (e.g., risk tolerance, priority clauses)
loaded_memory: str
# Intermediate result: Structured summary of the risk assessment report (supports PDF/Excel export)
risk_report: str
# Safety mechanism: Prevents infinite agent loops (set 5-8 steps as a practical limit)
remaining_steps: RemainingSteps
3 Practical State Design Tips (Pitfall Avoidance Guide)
- Core Role of the
messagesField:Annotated[List[AnyMessage], add_messages]is a LangGraph “game-changer”—outputs from all agents/tools are automatically appended to themessageslist. This eliminates manual message management and simplifies short-term memory maintenance. - Necessity of
remaining_steps: In multi-agent systems, LLMs may get stuck in infinite loops (e.g., “Extraction Agent → Tool → Extraction Agent”) due to ambiguous decision-making.RemainingStepsautomatically counts steps and terminates the workflow when the limit is reached (5 steps is sufficient for 99% of review scenarios). - Storage Logic for
loaded_memory: Store “structured summaries” (e.g., “Risk Tolerance: Conservative; Priority Clauses: Liability, Data Compliance”) instead of raw user preference data to reduce agent processing overhead.
III. Core Component Implementation: Agent Responsibilities & Tool Definition
Tools are an agent’s “hands and feet”—they enable agents to access external capabilities (e.g., clause extraction, regulatory checks) instead of relying solely on internal LLM knowledge (which reduces hallucinations). Below is a detailed breakdown of the two core sub-agents, including their responsibilities, tool definitions, and practical optimizations.
3.1 Clause Extraction Agent: Precisely Extract Key Information
Core Responsibility
Locate user-specified clauses (e.g., “liability,” “termination”) from lengthy contract texts and generate concise summaries (avoiding raw, unprocessed text outputs).
Tool Definition (LangChain Tool Standard)
In production, the core logic should be replaced with a “RAG + fine-tuned model” (pure LLM clause extraction accuracy is ~70%, which can be boosted to 92%+ with industry-specific fine-tuning). Below is runnable sample code:
from langchain_core.tools import tool
import json
@tool
def extract_clause_summary(contract_text: str, clause_name: str) -> str:
"""
Extracts original text + concise summary of specified clauses from contract text (supports multi-language contracts).
Practical Optimizations:
1. Actual implementation requires RAG (Retrieval-Augmented Generation): Split contract text into chunks, retrieve relevant segments, and summarize.
2. For complex contracts (e.g., cross-border agreements), add a "clause classification model" for preprocessing to improve extraction accuracy.
3. Standardize output format (e.g., JSON) for easy parsing by the Risk Assessment Agent.
Args:
contract_text (str): Full contract text for review (parsed from PDF/TXT).
clause_name (str): Target clause name (supports fuzzy matching, e.g., "liability," "termination").
Returns:
str: Structured result containing clause fragments and a concise summary.
"""
# Simulated tool execution (replace with RAG + fine-tuned model in production)
# Fuzzy matching logic: Supports user inputs like "liability" or "breach of contract"
if any(keyword in clause_name.lower() for keyword in ["liability", "breach"]):
return json.dumps({
"clause_type": "Liability Clause",
"original_text": "Both parties agree that any breach of this contract resulting in losses to the other party shall be liable for compensation capped at 50% of the total contract value, excluding indirect damages.",
"summary": "Liability Cap: 50% of total contract value (direct damages only)"
}, ensure_ascii=False)
elif any(keyword in clause_name.lower() for keyword in ["termination", "cancellation"]):
return json.dumps({
"clause_type": "Termination Clause",
"original_text": "During the contract term, either party may terminate the contract by providing a 30-day written notice to the other party; the contract shall terminate upon the notice’s effective date.",
"summary": "Termination Condition: 30-day written notice (no additional liquidated damages)"
}, ensure_ascii=False)
else:
return json.dumps({
"clause_type": clause_name,
"original_text": "No matching clause found",
"summary": f"No explicit provisions related to'{clause_name}'were found in the contract. Manual review is recommended."
}, ensure_ascii=False)
# Toolset: Bound to the Clause Extraction Agent later
extraction_tools = [extract_clause_summary]
Practical Optimization Tips
- Boost Extraction Accuracy: Fine-tune BERT/RoBERTa models with industry-specific contract corpora (e.g., finance, e-commerce). Use RAG to retrieve contract segments relevant to the target clause, then summarize with an LLM.
- Structured Output: Return JSON instead of plain text to enable direct parsing by the Risk Assessment Agent, reducing LLM format conversion overhead.
- Fuzzy Matching Support: Users may input keywords like “liability” or “breach of contract”—add fuzzy matching logic to improve user experience.
3.2 Risk Assessment Agent: Compliance Verification + Personalized Risk Judgment
Core Responsibility
Deliver risk assessments using three data sources: ① Clause summaries from the Extraction Agent; ② User risk preferences (e.g., conservative/balanced/aggressive); ③ External regulatory databases (e.g., Civil Code of the People’s Republic of China, E-Commerce Law). The final output includes a risk rating (low/medium/high) and actionable recommendations.
Tool Definition: Compliance Verification Tool
In production, the check_regulatory_compliance tool should integrate with external regulatory databases (e.g., Pkulaw, Lawxin) or on-premises regulatory knowledge bases. Below is the core implementation:
@tool
def check_regulatory_compliance(clause_summary: str, industry: str = "General") -> str:
"""
Verifies clause compliance against industry-specific laws/regulations and identifies potential risks.
Practical Tips:
1. Regularly update the regulatory database (e.g., annual legal revisions, industry regulatory policy updates).
2. Support industry-specific filtering (e.g., financial services require additional checks against Securities Law/Banking Law).
Args:
clause_summary (str): Structured JSON string of the clause summary;
industry (str): Contract industry (General/Finance/E-Commerce/Manufacturing, etc.).
Returns:
str: Compliance result + risk rating + optimization recommendations.
"""
# Parse clause summary (structured input avoids LLM misinterpretation)
clause_data = json.loads(clause_summary)
clause_type = clause_data["clause_type"]
clause_summary = clause_data["summary"]
# Simulated regulatory check (integrate with regulatory database API or RAG knowledge base in production)
risk_level = "Low"
compliance_result = "Compliant"
suggestion = ""
# Example 1: Compliance check for liability caps in the financial industry (regulatory limit: 30%)
if industry == "Finance" and clause_type == "Liability Clause":
if "50%" in clause_summary:
risk_level = "High"
compliance_result = "Non-Compliant"
suggestion = "Per the Guidelines for Compliance Management of Banking Financial Institutions, liability caps in financial contracts shall not exceed 30% of the contract value. Recommend revising the clause to:'Liability capped at 30% of the total contract value.'"
# Example 2: Compliance check for termination clauses in general industries
elif clause_type == "Termination Clause":
if "30-day written notice" in clause_summary:
risk_level = "Low"
compliance_result = "Compliant"
suggestion = "The clause complies with Article 563 of the Civil Code regarding contract termination notice requirements. No significant risks identified."
else:
risk_level = "Medium"
compliance_result = "Pending Confirmation"
suggestion = "No explicit regulatory provisions were found. Recommend manual review of ambiguous terms (e.g.,'reasonable period'should specify a concrete number of days)."
# Structured output: Facilitates subsequent risk report generation
return json.dumps({
"compliance_result": compliance_result,
"risk_level": risk_level,
"risk_reason": f"Clause Type: {clause_type}; Summary: {clause_summary}; Industry: {industry}",
"suggestion": suggestion
}, ensure_ascii=False)
# Toolset for the Risk Assessment Agent
risk_tools = [check_regulatory_compliance]
3.3 Agent Prompt Design: The Key to Avoiding Hallucinations
An agent’s prompts directly determine its behavior. Follow three core principles: clarify responsibilities, define boundaries, and specify formats. Below are battle-tested high-quality prompts:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage
# Initialize LLM (GPT-4o/Claude 3 Opus recommended for production; fine-tune open-source models for complex scenarios)
llm = ChatOpenAI(model="gpt-4o", temperature=0.1) # Low temperature ensures output stability
# 1. Prompt for Clause Extraction Agent (Core: Precision + No Fabrication)
extraction_prompt = ChatPromptTemplate.from_messages([
SystemMessage(
"""
You are a senior contract clause extraction expert specializing in extracting and summarizing key clauses from commercial contracts.
Your Core Rules:
1. Strictly base your work on the provided 'contract_text' and 'tool outputs'—do not fabricate clause content.
2. Only extract clauses explicitly requested by the user (if no clauses are specified, default to extracting 4 core clauses: Liability, Termination, Breach of Contract, Dispute Resolution).
3. Output Format: First list the original clause fragment (max 3 lines), then provide a 1-sentence concise summary.
4. If the tool returns "No matching clause found," inform the user directly without additional content.
"""
),
("placeholder", "{messages}"), # Receives conversation history from the state (user requests, tool outputs)
])
# 2. Prompt for Risk Assessment Agent (Core: Rigor + Data-Driven)
risk_prompt = ChatPromptTemplate.from_messages([
SystemMessage(
"""
You are a professional legal risk assessor responsible for contract compliance and risk analysis.
Your Core Rules:
1. Conduct assessments using three data sources:
- Clause summaries from the Clause Extraction Agent;
- Loaded user risk preferences (loaded_memory field);
- Outputs from the compliance verification tool;
2. Risk Rating Definitions:
- Low Risk: Clause is compliant, aligns with user preferences, and has no potential disputes.
- Medium Risk: Clause is compliant but contains ambiguous language or minor conflicts with user preferences.
- High Risk: Non-compliant clauses, excessive liability caps, unreasonable termination conditions, or other issues that may cause significant losses.
3. Output Requirements: First specify the risk rating, then explain the risk reason, and finally provide concrete, actionable revision recommendations.
4. Trigger human intervention for ambiguous information (e.g., unclear clause wording) instead of making arbitrary judgments.
"""
),
("placeholder", "{messages}"),
])
# 3. Prompt for Supervisor Agent (Core: Clear Decision-Making + Accurate Routing)
supervisor_prompt = ChatPromptTemplate.from_messages([
SystemMessage(
"""
You are the supervisor of an enterprise-grade multi-agent system responsible for orchestrating and deciding the contract review workflow.
Your Core Responsibilities:
1. Parse the user's latest request and determine which sub-agent to invoke:
- If the user needs contract clauses extracted → Invoke extraction_agent;
- If the user needs risk/compliance assessment → Invoke risk_agent;
- If clause extraction and risk assessment are complete → Output "FINISH";
2. Trigger Human Intervention if any of the following conditions are met:
- Risk Assessment Agent returns "High Risk";
- Ambiguous clause wording (e.g., "reasonable period" or "significant loss" not clearly defined);
- User request exceeds current agent capabilities;
3. Output Format:
- To invoke a sub-agent → Return the agent name directly (e.g., "extraction_agent");
- To trigger human intervention → Return "human_in_the_loop";
- To terminate the workflow → Return "FINISH";
4. No additional explanations—only return the specified content (facilitates routing function parsing).
"""
),
("placeholder", "{messages}"),
])
Practical Takeaway: Adding “clear rules and format requirements” to prompts can improve agent behavior accuracy by over 40%. For example, the Supervisor Agent only returns specified keywords (e.g., “extraction_agent”) to avoid parsing errors in the routing function.
IV. Memory System: Synergistic Design of Short-Term & Long-Term Memory
A multi-agent system’s “intelligence” largely depends on its memory capabilities—short-term memory ensures conversation continuity, while long-term memory enables cross-session personalization.
4.1 Short-Term Memory: Maintaining Conversation Context with MemorySaver
LangGraph’s MemorySaver is the official solution for short-term memory. Its core function is to “record and restore the complete state of the graph,” ensuring agents can access historical information (e.g., previous user requests, last tool outputs) during the workflow.
from langgraph.checkpoint.memory import MemorySaver
# Short-Term Memory: MemorySaver (suitable for single-session, single-service instances; use RedisCheckpoint for distributed deployment)
checkpointer = MemorySaver()
Practical Considerations
- Single-Session vs. Distributed Deployment:
MemorySaveris in-memory storage, ideal for single-service, single-session scenarios. For multi-user, distributed deployments, replace it withRedisCheckpointorPostgreSQLCheckpoint. - State Cleanup: Short-term memory automatically expires after the conversation ends (or via a TTL setting) to avoid memory leaks.
- Performance Optimization: For extra-long contracts (e.g., 100,000+ words), store only “clause summaries” instead of full text in the state to reduce data transfer and storage costs.
4.2 Long-Term Memory: Persisting User-Specific Preferences
Long-term memory stores cross-session persistent data, with core use cases including:
- User risk preferences (e.g., “Conservative: No liability caps exceeding 30%”);
- Priority clauses (e.g., a client that requires “data compliance clauses” in every review);
- Historical review records (for user traceability).
We use InMemoryStore for simulation (PostgreSQL + vector database recommended for production) and a structured UserProfile class to manage data:
from langgraph.store.memory import InMemoryStore
from pydantic import BaseModel, Field
# Structured User Preference Model (ensures data standardization and avoids ambiguous wording)
class UserProfile(BaseModel):
risk_tolerance: str = Field(description="User risk tolerance: Aggressive/Balanced/Conservative (only these three values allowed)",
enum=["Aggressive", "Balanced", "Conservative"]
)
preferred_clauses: List[str] = Field(description="List of user-prioritized clauses (e.g., Liability, Data Compliance, Termination)"
)
industry: str = Field(description="User's industry (for industry-specific compliance checks)")
# Long-Term Memory Storage (replace with PostgreSQL + vector database in production for efficient retrieval)
long_term_store = InMemoryStore()
# Node 1: Load Long-Term Memory (executed at workflow start to link user preferences)
def load_memory(state: ContractState) -> ContractState:
"""
Load user preferences from long-term storage and write to the state for the Risk Assessment Agent.
Practical Optimizations:
1. Link users via project_id (not user_id) to support different preferences for multiple projects of the same user.
2. Return default values if no user preferences are found (avoids workflow interruptions).
"""project_id = state.get("project_id","default_project") # Unique project identifier
# Retrieve user configurations from storage (add caching in production to improve performance)
stored_data = long_term_store.get(project_id)
if stored_data and "user_profile" in stored_data:
user_profile = stored_data["user_profile"]
# Format output for direct use by the Risk Assessment Agent
loaded_memory = (f"User Risk Tolerance: {user_profile.risk_tolerance};"
f"Priority Clauses: {', '.join(user_profile.preferred_clauses)};"
f"Industry: {user_profile.industry}"
)
return {"loaded_memory": loaded_memory}
# Return default values if no historical preferences exist (Balanced risk tolerance, General industry)
default_memory = "User Risk Tolerance: Balanced; Priority Clauses: Liability, Termination, Breach of Contract; Industry: General"
return {"loaded_memory": default_memory}
# Node 2: Update Long-Term Memory (executed at workflow end to save user preferences from the current session)
def create_memory(state: ContractState) -> ContractState:
"""
Update user long-term memory based on the current review results (e.g., user-confirmed risk preferences).
Practical Tips:
1. Only update after explicit user confirmation (avoids preference errors from agent misjudgment).
2. Store structured UserProfile objects instead of plain text (facilitates future modifications).
"""project_id = state.get("project_id","default_project")
messages = state["messages"]
# Simulate extracting user preferences from the conversation (in production, let LLM parse messages to generate UserProfile)
# Logic: Update preferences if the user explicitly requested "conservative risk control" during the review
user_risk_tolerance = "Conservative"
preferred_clauses = ["Liability", "Data Compliance", "Breach of Contract"]
industry = "Finance"
# Build structured user configuration
updated_profile = UserProfile(
risk_tolerance=user_risk_tolerance,
preferred_clauses=preferred_clauses,
industry=industry
)
# Write to long-term storage
long_term_store.put(project_id, {"user_profile": updated_profile})
print(f"[Log] Long-Term Memory Updated: User preferences for project {project_id} saved successfully")
return {} # Only perform storage operation—no changes to other state fields
Practical Takeaway: “Structured storage” is critical for long-term memory—use Pydantic models to define user preferences, avoiding parsing difficulties from plain text storage. For example, limiting risk tolerance to three values (“Aggressive/Balanced/Conservative”) reduces ambiguity during agent processing.
V. Human-in-the-Loop (HITL): A “Safety Net” for High-Risk Scenarios
Contract review is a high-risk process—full automation can lead to significant losses (e.g., missed critical compliance issues). Human-in-the-Loop (HITL) balances efficiency and safety: the system handles routine tasks, while humans intervene in high-risk or ambiguous scenarios.
5.1 Trigger Conditions for Human Intervention (Practical Summary)
- Risk Assessment Agent returns “High Risk” (e.g., excessive liability caps, non-compliant clauses);
- Ambiguous clause wording (e.g., “reasonable period” or “significant loss” without clear standards);
- User requests exceed system capabilities (e.g., reviewing foreign regulations for cross-border contracts);
- Tool execution failures (e.g., regulatory database connection errors).
5.2 HITL Node Implementation: From Blocking Wait to Asynchronous Interaction
Below is the core code for the HITL node. In production, integrate it with a frontend UI (e.g., a reviewer dashboard) to support asynchronous human input:
from langchain_core.messages import HumanMessage, AIMessage
def human_in_the_loop_node(state: ContractState) -> ContractState:
"""
Human-in-the-Loop Node: Triggers manual review for high-risk scenarios and resumes the workflow with human instructions.
Production Deployment Recommendations:
1. Replace the blocking input() with an asynchronous task queue (e.g., Celery) to support concurrent reviews by multiple users.
2. The frontend UI should display the contract fragment under review, current risk analysis results, and a field for human input to enable quick decision-making.
3. Support two types of human instructions: Confirm Risk (resume workflow) or Modify Clause (trigger re-extraction).
"""last_message = state["messages"][-1] # Get the last message that triggered human intervention
# Print review information (replace with frontend UI display in production)
print("="*50)
print("⚠️ High-Risk Alert: Human Intervention Required")
print(f"Trigger Reason: {last_message.content}")
print(f"Key Contract Fragment: {state['contract_text'][:500]}...") # Display first 500 characters
print("="*50)
# Simulate human input (replace with asynchronous callback from frontend UI in production)
human_input = input("Enter review instructions (e.g.,'Risk confirmed—proceed with report generation'/'Modify liability cap to 30%'):")
# Write human instructions to the state as a HumanMessage for subsequent processing by the Supervisor Agent
return {"messages": [HumanMessage(content=f"Human Review Instruction: {human_input}")]}
Practical Optimization: Asynchronous Transformation
The code above uses input() for blocking waits, suitable only for testing. For production, implement asynchronous interaction:
- When human intervention is triggered, set the task status to “Pending Review” and send a notification to the reviewer (SMS/WeChat Work).
- After the reviewer completes the operation in the frontend UI, the system writes the instruction to the state.
- The Supervisor Agent detects the human instruction and resumes the workflow (e.g., re-invoking the Risk Assessment Agent).
VI. LangGraph Workflow Construction: From Nodes to a Complete System
LangGraph’s core is the “Graph”—nodes define execution logic, and edges define workflow routing. Below is the complete workflow assembly code with detailed comments:
6.1 Initialize Dependent Components
from langgraph.prebuilt import ToolExecutor, ToolNode, create_react_agent, create_supervisor
from langgraph.graph import StateGraph, END, START
# 1. Create Sub-Agents (LLM + Tools + Prompts)
def create_sub_agent(llm, tools, system_prompt):
"""Encapsulates sub-agent creation logic for code reusability (extract as a utility function in production)"""
tool_executor = ToolExecutor(tools) # Tool executor: Calls tools and returns results
# Create ReAct Agent (automatically determines whether to call tools)
agent_runnable = create_react_agent(llm=llm.bind_tools(tools), # Bind tools to the LLM
tools=tools,
checkpointer=checkpointer, # Link to short-term memory
prompt=system_prompt
)
return agent_runnable, tool_executor
# Create Clause Extraction Agent and Risk Assessment Agent
extraction_agent, extraction_executor = create_sub_agent(llm, extraction_tools, extraction_prompt)
risk_agent, risk_executor = create_sub_agent(llm, risk_tools, risk_prompt)
# Create Tool Nodes (LangGraph pre-built nodes for tool execution)
extraction_tool_node = ToolNode(extraction_executor)
risk_tool_node = ToolNode(risk_executor)
# 2. Create Supervisor Agent (responsible for workflow routing)
members = ["extraction_agent", "risk_agent"] # List of callable sub-agents
supervisor_agent = create_supervisor(
llm=llm,
agents=members,
system_prompt=supervisor_prompt,
handle_messages_correctly=True # Ensures message format compatibility
)
6.2 Define Routing Functions (Workflow Decision Logic)
Routing functions determine “where to go next after a node completes execution”—the core of LangGraph workflows:
def check_for_tool_call(state: ContractState) -> str:
"""Tool Call Check: Determines whether to execute a tool or return to the Supervisor based on the agent's output.
Logic:
- Contains tool calls → Execute tool (continue);
- No tool calls → Return to Supervisor Agent (end).
"""last_message = state["messages"][-1]
return "continue" if last_message.tool_calls else "end"
def route_supervisor(state: ContractState) -> str:
"""Supervisor Agent Routing: Determines the next node based on the Supervisor's output.
Core Routing Logic:
- Supervisor returns "extraction_agent" → Invoke Clause Extraction Agent;
- Supervisor returns "risk_agent" → Invoke Risk Assessment Agent;
- Supervisor returns "human_in_the_loop" → Trigger HITL;
- Supervisor returns "FINISH" → Update long-term memory and terminate.
"""last_message = state["messages"][-1]
content = last_message.content.strip()
if content == "extraction_agent":
return "extraction_agent"
elif content == "risk_agent":
return "risk_agent"
elif content == "human_in_the_loop":
return "human_in_the_loop"
elif content == "FINISH":
return "create_memory"
else:
# Exception Branch: Trigger HITL by default (avoids workflow interruptions)
return "human_in_the_loop"
6.3 Assemble and Compile the LangGraph
# 1. Initialize State Graph (binds the previously defined ContractState)
builder = StateGraph(ContractState)
# 2. Add Core Nodes (carriers for all execution logic)
# Memory-Related Nodes
builder.add_node("load_memory", load_memory) # Load long-term memory
builder.add_node("create_memory", create_memory) # Update long-term memory
# HITL Node
builder.add_node("human_in_the_loop", human_in_the_loop_node)
# Supervisor Agent Node
builder.add_node("supervisor", supervisor_agent)
# Sub-Agents and Their Tool Nodes
builder.add_node("extraction_agent", extraction_agent) # Clause Extraction Agent
builder.add_node("extraction_tool_node", extraction_tool_node) # Extraction Tool Execution Node
builder.add_node("risk_agent", risk_agent) # Risk Assessment Agent
builder.add_node("risk_tool_node", risk_tool_node) # Risk Tool Execution Node
# 3. Define Edges (Workflow Paths)
# Start → Load Long-Term Memory → Supervisor Agent (initialization workflow)
builder.set_entry_point("load_memory")
builder.add_edge("load_memory", "supervisor")
# Supervisor Agent → Conditional Routing (next step determined by Supervisor output)
builder.add_conditional_edges(
source="supervisor",
path=route_supervisor,
destinations={
"extraction_agent": "extraction_agent",
"risk_agent": "risk_agent",
"human_in_the_loop": "human_in_the_loop",
"create_memory": "create_memory"
}
)
# Clause Extraction Agent Workflow: Agent → Tool (if needed) → Agent → Supervisor
builder.add_edge("extraction_agent", "extraction_tool_node", check_for_tool_call)
builder.add_edge("extraction_tool_node", "extraction_agent")
builder.add_edge("extraction_agent", "supervisor") # Return to Supervisor after extraction
# Risk Assessment Agent Workflow: Agent → Tool (if needed) → Agent → Supervisor
builder.add_edge("risk_agent", "risk_tool_node", check_for_tool_call)
builder.add_edge("risk_tool_node", "risk_agent")
builder.add_edge("risk_agent", "supervisor") # Return to Supervisor after assessment
# HITL → Return to Supervisor (resume workflow after human instruction)
builder.add_edge("human_in_the_loop", "supervisor")
# Update Long-Term Memory → Workflow Termination
builder.add_edge("create_memory", END)
# 4. Compile the Graph (bind short-term memory checkpointer)
contract_review_app = builder.compile(checkpointer=checkpointer)
6.4 System Flow Diagram (Core Workflow Visualization)

VII. Practical Deployment Considerations & Optimization Tips
7.1 Performance Optimization
- Prevent Infinite Loops: Set a 5-8 step limit for
remaining_stepsand add exception branches in routing functions (e.g., default HITL trigger). - Tool Call Optimization: Add local caching for high-frequency tools (e.g., clause extraction) to avoid redundant calls.
- LLM Call Optimization: Use streaming output to improve user experience. For long texts, pass “summaries + fragments” to reduce token consumption.
7.2 Accuracy Improvement
- Clause Extraction Models: Pure LLM accuracy is ~70%. Combine industry-specific fine-tuning (e.g., fine-tuning BERT with legal contract datasets) + RAG retrieval to boost accuracy to 92%+.
- Regulatory Database Updates: Establish a regular update mechanism (e.g., monthly synchronization of new regulations) to avoid compliance check errors from outdated laws.
- Human Feedback Loop: Record human-modified clauses and risk judgments to optimize agent prompts and tool logic.
7.3 Scalability Design
- Add New Agents: To add a “Contract Comparison Agent” (compares current and historical contracts), define the agent, tools, and prompts following the same standards, then update the Supervisor’s routing logic.
- Multi-Language Support: Add language detection in tools to automatically call language-specific clause extraction models.
- Multi-Format Input: Integrate PDF/TXT/Word parsing tools (e.g., PyPDF2, python-docx) to support direct file uploads without manual text conversion.
7.4 Enterprise-Grade Deployment Architecture Recommendations
- Development Environment: Use
MemorySaver+InMemoryStorefor local rapid testing. - Staging Environment: Use RedisCheckpoint (short-term memory) + PostgreSQL (long-term memory) to simulate multi-user scenarios.
- Production Environment:
- Short-Term Memory: Redis Cluster (supports high concurrency and distributed sessions).
- Long-Term Memory: PostgreSQL (structured user preferences) + Pinecone (vector database for similar contract retrieval).
- Concurrent Processing: Use Celery for asynchronous task queues to support batch contract reviews.
- Monitoring & Alerts: Integrate Prometheus + Grafana to monitor agent success rates, tool response times, and HITL rates.
Conclusion: Key Steps from Prototype to Enterprise-Grade System
The multi-agent contract review system built with LangGraph in this article offers core advantages: “specialized division of labor + human-AI collaboration + personalized memory”—solving the inefficiency, high risk, and lack of personalization in traditional contract review.
The key to scaling from prototype to enterprise-grade system lies in:
- Architecture Design: A Supervisor architecture ensures workflow control, while state definition balances completeness and lightness.
- Detail Optimization: Clear prompts, structured tool outputs, and synergistic memory systems.
- Deployment Awareness: Considering deployment scenarios (distributed, high-concurrency), risk control (HITL), and scalability (adding new agents).
If you’re building intelligent legal or enterprise risk control systems, this solution’s core logic can be directly reused—simply adjust clause extraction rules, regulatory databases, and risk preference configurations for your specific industry (finance, e-commerce, manufacturing).
Frequently Asked Questions (FAQ)
1、LangGraph vs. LangChain Agent: Why Choose LangGraph?
LangChain Agent is ideal for simple single-agent tasks, while LangGraph supports complex multi-agent collaboration, state persistence, and conditional routing—making it better suited for enterprise-grade systems.
2、How to Handle Extra-Long Contracts (100,000+ Words)?
Split the contract into logical fragments (e.g., by chapter), extract clauses one by one, and aggregate risk assessment results at the end. This avoids performance degradation and accuracy loss from processing overly long texts.
3、What If the HITL Rate Is Too High?
Analyze the root causes of frequent human intervention: If due to ambiguous clauses, optimize the clause extraction tool’s ability to identify vague wording. If due to incorrect risk ratings, adjust the Risk Assessment Agent’s prompts and rules.
4、Can Open-Source LLMs Replace GPT-4o?
Yes, but targeted fine-tuning is required. Recommended open-source models include Llama 3 70B and Qwen 72B. After fine-tuning with legal contract corpora, accuracy can reach ~80% of GPT-4o at a lower cost.