How We Measured 96.5% 10-digit HTS Accuracy
Skeptical by design? Good. This page documents exactly how Harmonize.ai was benchmarked against Flexify ATLAS on 200 real-world SKUs (run date: Jan 25, 2026), including known failure modes and the controls we used to prevent inflated results.
Benchmark Overview
We evaluated Harmonize.ai as an HTS tariff classification system using the Flexify ATLAS benchmark. Performance was scored at the 10-digit U.S. HTS level with deterministic inference settings and verifiable source retrieval.
How It Works: RAG + Hierarchical Classification Pipeline
Harmonize.ai combines a retrieval-augmented generation (RAG) stack with tariff-specific logic layers. The classifier is optimized for chapter-level routing first, then narrows to heading, subheading, and final 10-digit code selection.
1) Product normalization
Input descriptions are cleaned and standardized before inference (material, function, form-factor cues).
2) 36-chapter detection
A dedicated detector proposes likely HTS chapters to reduce search space and improve routing precision.
3) CBP retrieval (ChromaDB)
Relevant evidence is retrieved from 73,962 CBP rulings, then attached as citations for downstream reasoning.
4) O3-mini hierarchical classifier
The O3-mini model classifies at multiple tariff levels with temperature = 0 for reproducible output.
5) 17 correction patterns
Post-classification rules catch recurring edge cases (units, composition ambiguities, chapter boundary slips).
6) Confidence + audit trace
Each result includes confidence scoring and traceable ruling references for reviewer validation.
Results Breakdown
193 of 200 SKUs matched benchmark ground truth at the full 10-digit HTS code level.
Accuracy is reported as exact-match only (no partial-credit scoring).
95% confidence interval: approximately 93–98% based on 193/200 exact matches.
Known failure modes in the remaining 3.5%
- Borderline chapter-selection cases for highly composite products.
- Descriptions with missing material ratios where multiple subheadings remain plausible.
- Specialized product nomenclature that required additional domain terms to disambiguate.
Confidence Scoring Logic
- High confidence: strong chapter consensus + highly similar rulings + stable top candidate.
- Medium confidence: minor ambiguity; top 2 candidates close in score.
- Low confidence: sparse precedent or conflicting evidence; escalation recommended.
What This Means For You
If you're responsible for import compliance, this benchmark suggests you can automate a large share of routine classification work while preserving reviewability and control.
- Faster research: Reduce manual lookup time by starting from evidence-backed candidates.
- Cleaner audit trails: Attach CBP citation context to each classification decision.
- Smarter escalation: Route low-confidence or edge-case SKUs to broker/legal review early.
- Lower operational risk: Avoid black-box outputs by requiring verifiable source support.
Methodology Notes
We built this benchmark for decision-makers who require evidence, not hype. These controls are in place specifically to avoid over-claiming performance.
- Deterministic inference: Temperature fixed at 0 to ensure repeatability.
- Verifiable citations: Outputs rely on retrieved CBP ruling records, not fabricated references.
- No cherry-picking: Reported metrics are based on the full 200-SKU Flexify ATLAS evaluation set.
- Strict scoring: 10-digit exact match scoring only; near-miss categories are not counted as correct.
- Transparent limitations: Failure modes are documented to support realistic adoption planning.
Test It on Your Own SKUs
Run a free classification and inspect the citation trail before you commit.
Try Free Classification