Evidence & Methodology

How We Measured 96.5% 10-digit HTS Accuracy

Skeptical by design? Good. This page documents exactly how Harmonize.ai was benchmarked against Flexify ATLAS on 200 real-world SKUs (run date: Jan 25, 2026), including known failure modes and the controls we used to prevent inflated results.

Benchmark Overview

We evaluated Harmonize.ai as an HTS tariff classification system using the Flexify ATLAS benchmark. Performance was scored at the 10-digit U.S. HTS level with deterministic inference settings and verifiable source retrieval.

96.5%

Overall 10-digit accuracy

193 / 200 SKUs exact matches

200

SKUs evaluated

Flexify ATLAS benchmark set

73,962

CBP rulings indexed

Embedded in ChromaDB RAG

0.0

Temperature setting

Deterministic model behavior

How It Works: RAG + Hierarchical Classification Pipeline

Harmonize.ai combines a retrieval-augmented generation (RAG) stack with tariff-specific logic layers. The classifier is optimized for chapter-level routing first, then narrows to heading, subheading, and final 10-digit code selection.

1) Product normalization

Input descriptions are cleaned and standardized before inference (material, function, form-factor cues).

2) 36-chapter detection

A dedicated detector proposes likely HTS chapters to reduce search space and improve routing precision.

3) CBP retrieval (ChromaDB)

Relevant evidence is retrieved from 73,962 CBP rulings, then attached as citations for downstream reasoning.

4) O3-mini hierarchical classifier

The O3-mini model classifies at multiple tariff levels with temperature = 0 for reproducible output.

5) 17 correction patterns

Post-classification rules catch recurring edge cases (units, composition ambiguities, chapter boundary slips).

6) Confidence + audit trace

Each result includes confidence scoring and traceable ruling references for reviewer validation.

Results Breakdown

96.5% accuracy

193 of 200 SKUs matched benchmark ground truth at the full 10-digit HTS code level.

Accuracy is reported as exact-match only (no partial-credit scoring).

95% confidence interval: approximately 93–98% based on 193/200 exact matches.

Known failure modes in the remaining 3.5%

Borderline chapter-selection cases for highly composite products.
Descriptions with missing material ratios where multiple subheadings remain plausible.
Specialized product nomenclature that required additional domain terms to disambiguate.

Confidence Scoring Logic

High confidence: strong chapter consensus + highly similar rulings + stable top candidate.
Medium confidence: minor ambiguity; top 2 candidates close in score.
Low confidence: sparse precedent or conflicting evidence; escalation recommended.

Confidence is designed for workflow triage, not marketing optics. Low-confidence outputs are explicit so teams can review before filing.

What This Means For You

If you're responsible for import compliance, this benchmark suggests you can automate a large share of routine classification work while preserving reviewability and control.

Faster research: Reduce manual lookup time by starting from evidence-backed candidates.
Cleaner audit trails: Attach CBP citation context to each classification decision.
Smarter escalation: Route low-confidence or edge-case SKUs to broker/legal review early.
Lower operational risk: Avoid black-box outputs by requiring verifiable source support.

Methodology Notes

We built this benchmark for decision-makers who require evidence, not hype. These controls are in place specifically to avoid over-claiming performance.

Deterministic inference: Temperature fixed at 0 to ensure repeatability.
Verifiable citations: Outputs rely on retrieved CBP ruling records, not fabricated references.
No cherry-picking: Reported metrics are based on the full 200-SKU Flexify ATLAS evaluation set.
Strict scoring: 10-digit exact match scoring only; near-miss categories are not counted as correct.
Transparent limitations: Failure modes are documented to support realistic adoption planning.

Test It on Your Own SKUs

Run a free classification and inspect the citation trail before you commit.

Try Free Classification