How do AI agents use tools and function calling?

AI agents extend LLMs with the ability to take actions. The developer provides tool schemas: {name: get_weather, description: Get current weather, parameters: {city: string}}. When the user asks about weather, the LLM generates a structured function call instead of text. The application executes the function, returns the result, and the LLM generates a response. The ReAct pattern alternates reasoning and acting: Thought (I need the stock price) -> Action (call API) -> Observation (result) -> Thought (now calculate) -> Action (compute) -> Answer. Key challenges: tool selection errors (wrong tool or parameters), infinite loops (retrying failed actions), hallucinated calls (fabricating results without calling), and context overflow (long sessions exceed context window). Mitigations: clear tool descriptions, input validation, maximum step limits, and user confirmation for destructive actions.

How are RAG agents better than static RAG pipelines?

Static RAG: retrieve once with the original query, generate from those results. RAG agents: decide WHEN to retrieve, WHAT to search, and WHETHER results are sufficient. Improvements: (1) Iterative retrieval -- retrieve multiple times with refined queries. If first results miss damaged goods return policy, search again with that specific query. (2) Query reformulation -- rewrite searches based on what was already found (filling gaps). (3) Source evaluation -- assess if retrieved info is sufficient or contradictory before generating. (4) Multi-source -- search different databases, APIs, and knowledge bases to compose complete answers. Example: What was our revenue growth vs industry? requires: Step 1 search internal DB for revenue, Step 2 search industry reports, Step 3 calculate comparison. A static pipeline cannot handle this multi-step reasoning.

AI/ML Interview: AI Agents — Tool Use, Chain-of-Thought, Planning, Function Calling, RAG Agents, Autonomous Systems

⏱ 6 min read

AI agents are LLMs that can reason, plan, and take actions in the world — calling APIs, querying databases, writing code, and orchestrating multi-step workflows. With the rise of function calling (GPT, Claude) and agent frameworks (LangChain, AutoGPT, CrewAI), understanding agent architecture is now tested in ML interviews at companies building AI products. This guide covers how agents work, their architectures, and the challenges of building reliable agent systems.

What is an AI Agent

An AI agent is an LLM-powered system that can: (1) Reason about a problem using chain-of-thought. (2) Plan a sequence of actions to achieve a goal. (3) Execute actions by calling external tools (APIs, databases, code execution). (4) Observe the results and iterate. The ReAct pattern (Reasoning + Acting): the agent alternates between thinking (reasoning about what to do next) and acting (calling a tool). Thought: “I need to find the current stock price of AAPL.” Action: call_api(stock_price, symbol=”AAPL”). Observation: “$185.42.” Thought: “Now I need to calculate the portfolio value.” Action: calculate(shares * price). Observation: “$18,542.” Answer: “Your AAPL holdings are worth $18,542.” Without tools: the LLM can only generate text from its training knowledge (may be outdated or hallucinated). With tools: the LLM accesses real-time data, performs precise calculations, and takes actions in external systems. This transforms LLMs from knowledge repositories into capable assistants.

Function Calling and Tool Use

Modern LLMs (GPT-4, Claude, Gemini) support structured function calling. The developer provides a list of available tools with their schemas: {name: “get_weather”, description: “Get current weather for a city”, parameters: {city: string, unit: “celsius”|”fahrenheit”}}. When the user asks “What is the weather in Tokyo?”, the LLM generates a structured function call: {name: “get_weather”, arguments: {city: “Tokyo”, unit: “celsius”}}. The application executes the function, returns the result to the LLM, and the LLM generates a natural language response. Key design decisions: (1) Tool selection — which tools to expose. Too many tools confuse the model (it may call the wrong one). Too few limit capability. Group related tools and provide clear descriptions. (2) Tool descriptions — the description is the “prompt” for tool selection. Be specific: “Search the product catalog by name, returning price and availability” is better than “Search products.” (3) Error handling — tools may fail (API timeout, invalid input). The agent must handle errors gracefully: retry, use an alternative tool, or inform the user. (4) Confirmation — for high-stakes actions (sending emails, making purchases, modifying data), require user confirmation before execution. The agent proposes the action; the user approves.

Chain-of-Thought and Planning

Chain-of-thought (CoT): prompting the LLM to reason step by step before answering. “Let me think through this step by step…” dramatically improves accuracy on complex tasks (math, logic, multi-step reasoning). CoT is the foundation of agent reasoning. Planning strategies: (1) Sequential planning — the agent decides the next action based on the current state. Simple but greedy (may get stuck or take suboptimal paths). Works for most tool-use scenarios. (2) Plan-then-execute — the agent first generates a full plan (list of steps), then executes each step. Better for complex multi-step tasks. Risk: the plan may become invalid if an early step fails or returns unexpected results. Mitigation: re-plan after significant deviations. (3) Tree-of-thought — explore multiple reasoning paths in parallel. At each step, generate several candidate next-steps, evaluate each, and pursue the most promising. Better for problems with many possible approaches. Expensive (multiple LLM calls per step). (4) Reflection — after completing a task (or failing), the agent reviews its actions and learns. “I called the wrong API first. Next time, I should check the documentation tool before calling the data API.” Self-reflection improves agent performance over time (within a session or across sessions with memory). In practice: ReAct (sequential reasoning + acting) is the standard for most agent applications. Plan-then-execute for complex workflows with many steps.

RAG Agents

A RAG agent combines retrieval-augmented generation with tool use. Instead of a static RAG pipeline (retrieve -> generate), the agent decides WHEN to retrieve, WHAT to search for, and WHETHER the retrieved information is sufficient. Agentic RAG flow: (1) User asks a question. (2) The agent reasons: “I need information about the company refund policy.” (3) Action: search_knowledge_base(query=”refund policy”). (4) Observation: retrieves 3 relevant chunks. (5) The agent evaluates: “These chunks discuss refund timelines but not the exception for damaged goods. I need more information.” (6) Action: search_knowledge_base(query=”damaged goods return policy”). (7) Observation: retrieves 2 more chunks. (8) The agent now has sufficient context to generate a complete answer. Why agentic RAG is better than static RAG: (1) Iterative retrieval — the agent retrieves multiple times with refined queries. Static RAG retrieves once. (2) Query reformulation — the agent rewrites queries based on what it has already found (filling gaps). (3) Source evaluation — the agent can assess whether retrieved information is sufficient or contradictory. (4) Multi-source — the agent can search different knowledge bases, databases, and APIs to compose a complete answer. Multi-step reasoning: “What was our revenue growth compared to the industry average?” Step 1: search internal database for company revenue. Step 2: search industry reports for average growth. Step 3: calculate and compare. A static RAG pipeline cannot handle this.

Agent Reliability and Challenges

Agents are powerful but unreliable. Failure modes: (1) Tool selection errors — the agent calls the wrong tool or passes wrong parameters. A billing inquiry triggers a cancellation tool. Mitigation: clear tool descriptions, input validation, and confirmation for destructive actions. (2) Infinite loops — the agent retries the same failed action indefinitely. Mitigation: maximum step count (stop after 10 iterations), loop detection (detect repeated actions), and timeout. (3) Hallucinated tool calls — the agent “calls” a tool that does not exist or fabricates the result without actually calling. Mitigation: strictly validate that tool calls match the registered tool list. (4) Context window overflow — long agent sessions accumulate history (thoughts, actions, observations) that exceed the context window. Mitigation: summarize past steps, keep only the most recent N steps, or use a memory system. (5) Cost — each agent step requires an LLM call. A 10-step agent task costs 10x a single-call task. Complex tasks may use 20-50 steps. Budget limits and step limits are essential. Evaluation: agent evaluation is harder than LLM evaluation. Metrics: task completion rate (did the agent solve the problem?), step efficiency (how many steps?), tool accuracy (correct tool with correct parameters?), and user satisfaction. Test on a suite of representative tasks with known-correct solutions. Benchmark frameworks: GAIA, AgentBench, and ToolBench evaluate agent capabilities across diverse tasks.