Skip to main content
KYB verification engine dashboard screenshot
AIAutomation

Know Your Business: Building an AI Engine That Verifies 1,900 Companies a Day

March 8, 20265 min read
Back to Blog

This is the second post in a series on building business process automation at scale. Last time: infrastructure. This time: what runs on top of it.

Last week I talked about the Sovereign Stack — the bare-metal infrastructure that powers Walsenburg Tech. This week, let's talk about what runs on top of it: a fully automated Know Your Business (KYB) verification engine that processes thousands of companies a day to figure out which ones are real, which ones are active, and which ones are worth your time.

The project started from a simple need during a job search: evaluating companies beyond basic Google searches. You need to understand whether these entities are genuine, active, and worth considering.

The Data: Where Do 300,000 Companies Come From?

Before you can verify anything, you need a list. Here's where ours came from:

  • State Secretaries of State — business filings from Colorado and beyond
  • SEC EDGAR — public company filings
  • FDIC & NCUA — banks and credit unions
  • Wikidata — large employers with structured data

That aggregation produced roughly 300,000 companies, most with only basic information: name, state, and maybe registered agent details. The challenge is figuring out which ones deserve verification attention.

The Verification Tiers: Not All Companies Are Equal

Processing 300,000 companies at once isn't practical, so we built a tiered priority system. The engine works through them in order:

  1. Large companies with websites (100+ employees) — highest likelihood of remote roles
  2. Public/SEC-filed companies — structured hiring, real ATS platforms
  3. Large companies without websites — potential opportunities buried in state filings
  4. Mid-size companies with websites (50–99) — decent remote possibility
  5. Regional tiers — local leads for the Southern Colorado corridor
  6. Everything else — small companies, the long tail

The system finishes each tier before moving on. On a typical day, about 1,900 companies get processed.

The Checks: Eight Signals Per Company

Each company runs through eight automated verification checks:

  • Website Liveness — Is the domain live, parked, or dead? Checks redirects using Camoufox stealth browser to capture title and final URL.
  • Career Page Discovery — An AI agent navigates the website looking for careers or jobs pages and identifies the ATS platform (Greenhouse, Lever, Workday, and 17 others).
  • Contact Extraction — Pulls emails, phone numbers, and contact page URLs from the site.
  • SEC Filing Check — Cross-references against EDGAR for public filings.
  • Web Search — DuckDuckGo searches for the company name to find primary web presence and snippets.
  • Google Maps — Checks whether the business is marked as permanently closed.
  • Yelp — Verifies operational status (closed or active).
  • Facebook — Confirms existence of a business page.

Each signal is stored individually, so we can re-verify specific data points without rerunning the full check.

What Worked

Stealth browsing was essential. Early career-page scraping attempts got blocked by CAPTCHAs left and right. Switching to Camoufox — a stealth-modified Firefox variant — bypassed most detection. Organizations actually want you to find their career pages; bot detection is just collateral damage from security vendors.

ATS detection was surprisingly reliable. Applicant tracking systems leave distinctive fingerprints: specific URL patterns, JavaScript variables, and meta tags. Greenhouse uses /embed/job_board, Lever uses /jobs/, Workday runs on myworkdayjobs.com. Pattern recognition makes detection nearly automatic.

Tiered processing was the right call. Instead of randomly sampling from 300,000 companies, prioritizing by size and web presence gave us useful results on day one. The 884 large companies with websites were verified within hours.

What Didn't Work

DuckDuckGo rate limits hit hard. Daily search quotas ran out faster than expected. The engine now monitors daily usage and gracefully reduces intensity, though early iterations would just halt mid-batch.

State filing data is noisy. A huge number of "companies" in state databases are dissolved, delinquent, or shell entities. We added filters for names containing "DELINQUENT" or "DISSOLVED," but there are almost certainly more patterns we haven't caught yet.

Not all websites are what they seem. Parked domains, squatters, and GoDaddy placeholder pages look "live" if you only check HTTP status codes. The engine now inspects page titles and content to distinguish real sites from parked domains.

The Numbers So Far

SignalCount
Companies verified3,500+
Career pages found569
ATS platforms detected20 distinct
Contact emails found336
Websites confirmed live621

The system keeps adding roughly 500 companies to the verified pool on a rolling basis.

What's Next

The immediate goal is finishing the top tiers — about 3,500 large and public companies most likely to have remote roles. After that, we start grinding through the long tail.

The bigger play is turning this into a living dataset that updates itself. Companies change — they post new roles, remove career pages, get acquired. One-time scraping goes stale fast. Re-verification on a rolling basis keeps the data current.

Processing 1,900 companies a day sounds great on paper. But how good are those results, really? In the next post, I'll dig into where the agents fail — and why the same pattern shows up in every automation system I've built over the last 20 years.

The full verification engine is open source — check it out on GitHub.

Have ideas for signals we should add? Get in touch.

Have questions about this topic?

We love talking tech. Reach out and let's discuss how this applies to your business.

Get in Touch