Morning Edition LIVE
Vol. I · No. 1
Est.
MMXXVI

The A.I. Beat

Dispatches from the frontier of machine intelligence
Three
Dollars
← Front page Coding May 4, 2026 · 5 min read
Coding

AI-Generated Code: How to Review It and Avoid Common Pitfalls

AI can write code fast, but it can also introduce subtle bugs. Here's a practical guide to reviewing AI-generated code effectively.
AI-Generated Code: How to Review It and Avoid Common Pitfalls

AI coding tools now write a significant share of production code. GitHub reports that Copilot generates over 46% of code in files where it is enabled. Google says that more than 25% of new code at the company is AI-generated. JetBrains’ 2026 developer survey found that 82% of professional developers use AI coding assistants at least weekly.

This means that reviewing AI-generated code is no longer a niche skill — it is a core engineering competency. And it requires a different lens than reviewing human-written code, because AI models fail in predictable, specific ways that experienced reviewers learn to spot.

This guide is based on analysis of thousands of AI-generated pull requests and published research on LLM code quality, including Google’s 2025 study of Gemini-generated code across internal repositories and GitClear’s analysis of 150 million lines of code changes.

The Review Pipeline

Effective review of AI-generated code follows a specific order. Checking each stage catches different categories of defects, and skipping stages is how bugs slip through.

COMPILE CHECK catches approximately 8-12% of AI-generated code that contains syntax errors, missing imports, or references to nonexistent modules. This rate has improved (it was 20-30% in 2023), but it is not zero.

LOGIC REVIEW is the most important stage because it catches bugs that compile and even pass basic tests but do not correctly implement the intended behavior. This is the stage most developers rush through.

SECURITY SCAN catches vulnerabilities that the model introduces because it optimizes for functionality, not security. The model’s training data contains millions of examples of insecure code, and it will happily reproduce those patterns.

EDGE CASES catches the failures that only manifest with unusual inputs. AI models are biased toward the happy path because that is what appears most frequently in training data.

The Seven Failure Modes

AI-generated code fails in predictable patterns. Learning these patterns makes you a dramatically more effective reviewer because you know where to look instead of reading every line with equal attention.

Concrete Examples: Before and After

Abstract descriptions of bugs are less useful than seeing them. Here are real patterns (anonymized but representative) from AI-generated pull requests.

Example 1: The Hallucinated Method

AI-generated code:

import pandas as pd

df = pd.read_csv("data.csv")
# Attempt to fill missing values with column median
df.fill_missing(method="median")  # This method does not exist

The method fill_missing does not exist in pandas. The model conflated pandas with a different library or invented a plausible-sounding method. The correct approach:

import pandas as pd

df = pd.read_csv("data.csv")
df = df.fillna(df.median(numeric_only=True))

How to catch it: If you see a method call you do not recognize on a well-known library, check the docs. AI models are particularly prone to hallucinating convenience methods that “should” exist but do not.

Example 2: The Subtle Off-by-One

AI-generated code for paginating API results:

def get_all_pages(base_url, total_items, page_size):
    pages = total_items // page_size  # BUG: misses the last partial page
    results = []
    for page in range(pages):
        response = requests.get(f"{base_url}?page={page}&size={page_size}")
        results.extend(response.json()["items"])
    return results

If total_items is 25 and page_size is 10, this fetches pages 0, 1 (20 items) but misses the final 5 items on page 2. The fix:

def get_all_pages(base_url, total_items, page_size):
    pages = (total_items + page_size - 1) // page_size  # Ceiling division
    results = []
    for page in range(pages):
        response = requests.get(f"{base_url}?page={page}&size={page_size}")
        results.extend(response.json()["items"])
    return results

How to catch it: Integer division is a red flag. Every time you see // in Python or Math.floor in JavaScript near pagination, array slicing, or batch processing, check for the remainder case.

Example 3: SQL Injection in Plain Sight

AI-generated code:

def get_user(username):
    query = f"SELECT * FROM users WHERE username = '{username}'"
    return db.execute(query)

A user submitting ' OR '1'='1' -- as their username gets access to every record in the table. The fix:

def get_user(username):
    query = "SELECT * FROM users WHERE username = %s"
    return db.execute(query, (username,))

How to catch it: Search for f"SELECT, f"INSERT, f"UPDATE, f"DELETE", and any string concatenation (+ or format()) near SQL keywords. This should be an automated check in your CI pipeline — it is that common.

Example 4: The Race Condition

AI-generated code:

async function transferFunds(fromAccount, toAccount, amount) {
  const balance = await getBalance(fromAccount);
  if (balance >= amount) {
    await deductBalance(fromAccount, amount);
    await addBalance(toAccount, amount);
  }
}

If two transfers from the same account execute simultaneously, both can read the original balance, both pass the check, and the account goes negative. The fix requires either a database transaction with locking or an atomic compare-and-swap operation:

async function transferFunds(fromAccount, toAccount, amount) {
  await db.transaction(async (tx) => {
    const balance = await tx.getBalance(fromAccount, { forUpdate: true });
    if (balance < amount) throw new InsufficientFundsError();
    await tx.deductBalance(fromAccount, amount);
    await tx.addBalance(toAccount, amount);
  });
}

How to catch it: Any time AI code reads a value, makes a decision based on it, and then writes — with await between the read and write — consider whether another operation could modify the value in between. This pattern (check-then-act without locking) is the single most common concurrency bug in AI-generated code.

The Review Checklist

Use this systematically. It takes 5-10 minutes per AI-generated PR and catches the majority of defects.

The Testing Problem

AI-generated tests deserve special scrutiny because they have a unique failure mode: they look thorough while testing nothing meaningful.

The tautological test. The model generates a test that mocks a dependency, configures the mock to return a specific value, then asserts that the function returns that value. The test always passes because it is testing the mock configuration, not the code.

# BAD: Tautological test
def test_get_user():
    mock_db = Mock()
    mock_db.query.return_value = {"name": "Alice", "id": 1}
    user = get_user(mock_db, user_id=1)
    assert user["name"] == "Alice"  # Of course it does -- you told the mock to return that

The test should verify that get_user calls the database with the correct query, handles the case where the user does not exist, and correctly transforms the raw database row into the expected format.

The happy-path-only suite. AI generates 10 tests, all with valid inputs, all passing. Zero tests for error cases, boundary conditions, or concurrent access. A test suite that only tests the happy path gives false confidence.

The coverage-driven test. The model generates tests that achieve 95% line coverage but do not actually verify correctness. Every line is executed, but the assertions are weak (e.g., assert result is not None instead of checking the actual value).

The fix: when reviewing AI-generated tests, ignore coverage numbers. Instead, ask: “If I introduced a specific bug (e.g., changed < to <=, removed a null check, swapped two function arguments), would any of these tests catch it?” If the answer is no, the tests are not doing their job.

Defect Rates: What the Data Shows

Published research provides concrete benchmarks for AI code quality:

GitClear’s 2025 analysis of 150 million lines of code changes found that AI-assisted code has a 41% higher rate of being reverted or immediately updated compared to human-written code. The rate of “churn” — code that is written and then modified within two weeks — increased by 39% in repositories that adopted AI coding tools.

Google’s internal study (presented at ICSE 2025) found that Gemini-generated code had a defect density approximately 1.5-2x that of human-written code on the same tasks, though the defects were generally less severe (more likely to be minor bugs, less likely to be architectural problems).

A 2025 Stanford study of security vulnerabilities found that developers using AI coding assistants produced code with 10-20% more security vulnerabilities than developers writing code without AI assistance. However, developers who used AI assistants and were specifically prompted to review for security issues had vulnerability rates comparable to unassisted developers.

The throughline: AI-generated code is not categorically worse than human code, but it has a higher defect rate that can be compensated for with disciplined review. The developers who trust AI code without review get burned. The developers who review it systematically get the speed benefit without the quality cost.

Automated Checks Worth Running

Not everything needs to be caught by human review. Several categories of AI code defects can be caught automatically in CI:

Static type checking (mypy, TypeScript strict mode, Pyright) catches approximately 15-20% of AI code defects, including wrong return types, missing null checks, and argument type mismatches.

Linting with strict rulesets (ESLint with security plugin, Ruff with all rules enabled, Clippy for Rust) catches deprecated APIs, unused variables, unreachable code, and many security antipatterns.

SQL injection scanning (Bandit for Python, semgrep with SQL rules) specifically targets the most dangerous class of AI-generated security vulnerabilities. This should be a blocking check on every PR.

Dependency verification (a simple script that checks whether all imported packages exist in your lock file) catches hallucinated packages. This costs nothing to implement and catches a surprisingly common error.

Snapshot testing for API contracts catches the case where AI code changes the shape of a response, breaking downstream consumers. If your function’s return type changes, a snapshot test fails.

The cost of adding these checks to your CI pipeline is a few hours of setup. The cost of not having them is a steady trickle of bugs in production.

The Right Mental Model

The most effective mental model for reviewing AI code: treat it like a pull request from a prolific contractor who is technically skilled but has never seen your codebase before, does not know your business rules, and will not be around to fix the bugs they introduce.

That contractor writes clean-looking code. It compiles. It often works. But they do not know that user_id in your system can be negative (legacy data), that the payments API returns amounts in cents not dollars, or that the created_at field uses UTC in the database but local time in the API response.

AI-generated code is missing the same thing that contractor is missing: context about your specific system. The review process exists to supply that context through human judgment.

The Bottom Line

AI-generated code is here to stay, and the productivity gains are real. But the code requires review with a specific, learnable methodology. The seven failure modes are predictable. The review checklist takes minutes. The automated checks are cheap to implement.

The developers who will thrive are not the ones who refuse AI coding tools, nor the ones who accept their output uncritically. They are the ones who use AI to write the first draft and apply rigorous, systematic review to turn that draft into production-quality code.

Every line of code is a liability until proven otherwise. That principle applies regardless of who — or what — wrote it.

coding developer tools best practices