How We Achieved 96.5% HTS Classification Accuracy with AI

Harmonize classifies products to the full 10-digit HTS statistical suffix with 96.5% accuracy across 200 real-world SKUs. This is not a curated demo set — it is the Atlas benchmark, a diverse catalog of consumer and industrial products tested against the same RAG pipeline available to every Harmonize user. Here is how the system works, how we measured it, and where it still fails.

The Problem: Manual HTS Classification Is Slow, Inconsistent, and Error-Prone

The Harmonized Tariff Schedule of the United States contains over 18,000 unique tariff lines at the 10-digit level. Classifying a single product requires navigating General Rules of Interpretation (GRI), hundreds of Section and Chapter Notes, and thousands of heading and subheading provisions. A licensed customs broker might spend 15–45 minutes on a single complex classification.

The consequences of getting it wrong are not theoretical. CBP issues penalties under 19 U.S.C. § 1592 for negligent misclassification, with fines up to the domestic value of the goods. In fiscal year 2024, CBP collected over $80 million in penalties related to classification and valuation errors. Even when errors do not trigger formal penalties, misclassification leads to overpaid duties, delayed shipments, and compliance risk that compounds across every entry.

The challenge is not just accuracy — it is consistency. Two brokers classifying the same product may arrive at different codes depending on their experience, the reference materials they consult, and how they interpret ambiguous GRI provisions. There is no single "correct" workflow for classification, which means quality depends heavily on individual expertise.

Our Approach: RAG Over 73,962 CBP CROSS Rulings

Harmonize uses Retrieval-Augmented Generation (RAG) to ground LLM classification in real precedent. Rather than relying on the language model's parametric knowledge of tariff codes — which is unreliable for a domain this specialized — we retrieve relevant CBP binding rulings and tariff provisions before classification.

The system works in five stages:

Supervisor hints. A keyword-based system scans the product description for material, function, and category signals. It generates chapter-level suggestions that bias the retrieval step toward the most relevant parts of the tariff schedule. For example, a product described as a "men's knitted cotton polo shirt" triggers hints for Chapters 61 (knitted apparel) and 52 (cotton), with Chapter 61 weighted higher because the product is a finished garment.
Vector retrieval. The product description is embedded and matched against two ChromaDB vector stores: one containing tariff provisions with their official descriptions, and another containing 73,962 CBP Customs Rulings Online Search System (CROSS) rulings. The retrieval returns the most semantically similar tariff codes and the most relevant prior rulings where CBP classified similar products.
Re-ranking and enhancement. Retrieved results are re-ranked using cross-encoder scoring to surface the most classification-relevant matches. Additional enhancements include GRI chain reasoning (applying the six General Rules of Interpretation in order), HyDE (Hypothetical Document Embeddings) for improved retrieval on vague descriptions, and composite goods detection for multi-material products.
LLM classification. The retrieved context — tariff provisions, CBP ruling excerpts, GRI analysis, and chapter notes — is assembled into a structured prompt sent to GPT-4o. The model returns a classification with the full 10-digit HTS code, a confidence score, and step-by-step reasoning that traces the classification through the GRI hierarchy.
Enrichment. The classification is enriched with specific CBP ruling citations that support the code, confidence calibration based on retrieval quality and ruling agreement, and duty rate information from the tariff schedule.

Every classification references real CBP rulings. This is not the LLM generating plausible-sounding citations — these are actual ruling numbers from the CROSS database that were retrieved based on semantic similarity to the product being classified. Users can verify each citation against CBP's public database.

The Benchmark: 200 Real-World SKUs from the Atlas Dataset

The Atlas benchmark consists of 200 products drawn from real commercial catalogs. The dataset was designed to be representative of the classification challenges brokers encounter in practice:

Consumer goods: Apparel, footwear, electronics, kitchenware, toys, personal care products
Industrial products: Fasteners, raw materials, machinery components, chemical compounds
Composite goods: Products made from multiple materials requiring essential character analysis under GRI 3
Edge cases: Products that sit at chapter boundaries, require Chapter Note exclusions, or depend on specific material percentages

Each product has a verified ground-truth HTS code at the 10-digit statistical suffix level. Verification was performed by cross-referencing CBP rulings, published tariff schedules, and broker-confirmed classifications.

The benchmark evaluates accuracy at multiple levels: 4-digit heading, 6-digit subheading, 8-digit tariff rate line, and full 10-digit statistical suffix. This matters because a classification that gets the heading right but the statistical suffix wrong may still result in the correct duty rate — but will cause errors in trade statistics reporting and may trigger CBP scrutiny.

Results: 96.5% at 10-Digit, 100% on CBLE

Across the 200-item Atlas benchmark, Harmonize achieved the following accuracy rates:

4-digit heading: 99.0% (198/200)
6-digit subheading: 98.0% (196/200)
8-digit tariff line: 97.5% (195/200)
10-digit statistical suffix: 96.5% (193/200)

These results were generated using the production classification engine with no benchmark-specific tuning, no manual overrides, and no post-hoc corrections. Every product was classified through the same pipeline available to Harmonize users.

In addition, Harmonize scored 100% on the classification section of the Customs Broker License Exam (CBLE), correctly answering all 22 product classification questions from the April 2025 and October 2024 exams. The CBLE benchmark validates the engine's GRI reasoning and Chapter Note application on CBP-vetted questions. Full CBLE results are published in a separate benchmark article.

How the RAG Pipeline Works: Embedding, Retrieval, Re-Ranking, Classification

Understanding the accuracy results requires understanding the pipeline architecture. Each stage addresses a specific failure mode in LLM-based classification:

Why not just ask GPT-4o directly?

Language models have general knowledge of tariff schedules, but that knowledge is unreliable at the 10-digit level. In testing, direct GPT-4o classification (without retrieval) achieved roughly 60–70% accuracy at the 8-digit level. The model knows that cotton t-shirts go in Chapter 61, but it cannot reliably distinguish between 6109.10.0012 (men's, knitted, cotton, not ornamented) and 6109.10.0040 (women's, knitted, cotton, not ornamented) without access to the actual tariff provisions and their statistical annotations.

What retrieval adds

By embedding the product description and searching against the tariff schedule and CBP rulings, the system retrieves the specific provisions that apply to the product. The LLM then classifies within the context of those real provisions rather than relying on parametric memory. This is analogous to how a human broker works: they do not memorize all 18,000 tariff lines; they look up the relevant section and apply GRI rules to the provisions they find.

What re-ranking adds

Embedding-based retrieval returns results that are semantically similar, but similarity does not always equal classification relevance. A product described as "stainless steel kitchen knife" might retrieve provisions for surgical instruments (also stainless steel, also knives) alongside the correct kitchen cutlery provisions. Cross-encoder re-ranking scores each retrieved provision against the product description to surface the most classification-relevant matches.

What CBP rulings add

CBP rulings are the closest thing to ground truth in tariff classification. When CBP has already classified a "battery-powered electric toothbrush" as 8509.80.0095, that ruling provides direct precedent. The RAG pipeline retrieves these rulings and includes them in the classification prompt, giving the LLM concrete examples of how CBP classifies similar products. This is especially valuable for edge cases where the tariff provisions are ambiguous.

Failure Analysis: 7 Misclassified Items — Where AI Still Struggles

Of the 200 Atlas benchmark items, 7 were misclassified at the 10-digit level. Analyzing these failures reveals patterns in where AI classification remains unreliable:

1. Statistical suffix ambiguity (3 errors)

Three products were classified to the correct 8-digit tariff line but assigned the wrong 10-digit statistical suffix. Statistical suffixes are not published with detailed descriptions in the tariff schedule — they are maintained by the U.S. International Trade Commission for trade data purposes and often require knowledge of industry conventions or specific physical characteristics (e.g., fiber diameter ranges for synthetic textiles) that are not always captured in product descriptions.

2. Chapter Note exclusions (2 errors)

Two products triggered Chapter Note exclusions that moved them from the intuitively correct chapter to a different one. For example, a product that appears to be an instrument (Chapter 90) but is excluded by a Note that reclassifies it as a machine (Chapter 84). These exclusions require reading and correctly applying specific legal text that may not be well-represented in the retrieval results if the exclusion is narrowly scoped.

3. Material composition edge cases (2 errors)

Two products required precise material composition analysis to determine classification. One involved a textile product where the classification depended on whether cotton exceeded 50% by weight — information that was present in the product description but required the LLM to perform a calculation and apply the result to a specific tariff provision. The other involved a composite product where the "essential character" determination under GRI 3(b) was genuinely ambiguous.

What these failures have in common

All seven errors share a characteristic: they require information or reasoning that goes beyond semantic similarity. The retrieval step successfully found the relevant provisions, but the classification step either misapplied the provisions or lacked the specific domain knowledge needed to select among very similar alternatives. These are the same areas where human brokers disagree with each other — the genuinely hard classification decisions.

What 96.5% Means in Practice

For a 200-SKU product catalog, 96.5% accuracy means 193 products classified correctly at the full 10-digit level on the first pass. The remaining 7 would require broker review and correction. At an average of 20 minutes per manual classification, the 193 correct results save approximately 64 hours of broker time. The 7 requiring review add roughly 2.5 hours.

Importantly, Harmonize provides confidence scores with every classification. The 7 misclassified items were disproportionately in the lower confidence tiers, meaning the system's self-assessment correctly flagged most of them as uncertain. A broker using Harmonize can prioritize their review time on low-confidence classifications rather than reviewing every result.

96.5% is not 100%. We do not claim that AI replaces human expertise in tariff classification. What we claim is that it handles the straightforward 96.5% accurately and flags the remaining 3.5% for expert review — transforming the broker's role from classifying every product from scratch to reviewing and validating AI-generated classifications with supporting evidence.

Try the Classification Engine

Harmonize.ai achieves 96.5% accuracy at the 10-digit HTS level, backed by 73,962 CBP rulings. Try a free classification and see the reasoning, confidence scores, and ruling citations for yourself.

Try Harmonize Free

View full benchmark results • Free 6-digit classification • No account required

This article is for informational purposes only and does not constitute legal or customs brokerage advice. Importers and brokers should consult with a licensed customs broker or trade attorney for guidance on specific classification and compliance decisions. Harmonize.ai is a classification research tool operating under 19 U.S.C. § 1641 — we provide research support, not customs brokerage services.