About Client
What They Do
- Manufacture & distribute premium cannabis products across Michigan
- Operate retail locations for medical and recreational customers
- Manage a multi-brand portfolio with hundreds of active SKUs
- Fulfill orders across a multi-tier production-to-retail supply chain
Their Clients Include
Problem Statement
The Challenge
What Was Going Wrong
1. Zero Data Visibility for the Business
Pipeline failures were discovered hours or days after they happened — usually when a report looked wrong. There were no status alerts, no execution summaries, no audit trail. Data engineering was a black box to the people who needed the data most.
2. Inefficient Data Refresh Cycles
No mechanism existed to detect new vs. updated records. The default was full data reloads every time — burning API quotas, wasting compute, and still missing updates that fell between refresh windows. No one could trust whether the data was current.
3. Inconsistent and Unstructured API Data
Source data arrived as deeply nested JSON with no warning when the structure changed. Field names varied between endpoints, optional attributes appeared and disappeared, and schema mismatches regularly broke downstream processes.
4. Slow and Unreliable API Extraction
Paginated API calls ran sequentially — one at a time. Extraction jobs that should have taken minutes stretched to hours. Rate limits caused stalls, retries, and silent failures. Data freshness suffered across the board.
Solution Provided
What Is an ETL Pipeline in Simple Terms?
How the System Works:
01
Async Data Extraction
- Connects to all REST API endpoints simultaneously — not sequentially
- Handles pagination and rate limits with exponential backoff retry logic
- Cuts extraction time from 3+ hours to under 10 minutes
02
Schema Normalization & Flattening
- Deeply nested JSON flattened into clean, structured relational tables
- Enforces consistent schemas across every endpoint automatically
- Adapts to upstream schema changes and optional fields without breaking
03
Incremental Load with Change Detection
- Composite-key MERGE logic — only new and changed records are written
- Hourly incremental runs + overnight full refresh as a safety net
- No wasted compute, no missed updates, no stale data
04
Monitoring, Alerts & Audit Trail
- Every pipeline run logged end-to-end
- Automated email summaries sent to stakeholders after each execution
- Full audit trail — every record processed, every error, every retry
Results
System Capabilities
| Capability | Detail |
|---|---|
| Monthly Transactions Processed | 20,000+ |
| Manual Work Eliminated | 8+ hours per week |
| Data Freshness | Hourly (previously end-of-day) |
| Extraction Speed | Under 10 min (previously 3+ hours) |
| Deployment Time | 4 weeks from kickoff to production |
Operational Impact
- 8+ hours per week returned to the operations team — zero manual data wrangling
- Data freshness went from end-of-day to every hour, automatically confirmed
- Full pipeline transparency — business teams see health status without asking engineering
- Scalable foundation built to grow with DnK's transaction volume — no re-engineering needed
Before vs After
| Area | Before | After |
|---|---|---|
| Data Freshness | 12–24 hrs stale (end-of-day batch) | Hourly automated refreshes |
| Extraction Speed | 3+ hours (sequential calls) | Under 10 minutes (async parallel) |
| Manual Data Work | 8+ hrs/week cleanup | Zero manual intervention |
| Schema Handling | Broke on API changes | Auto-adapts to schema drift |
| Change Detection | Full reloads every cycle | MERGE — new/changed records only |
| Error Handling | Failures found hours later | Real-time retry + alerts within minutes |
| Inventory Visibility | Fragmented, no single source of truth | Unified across all brands, SKUs & locations |
| Audit Trail | None | Full log of every record, error & retry |
Tech Stack
| Layer | Technology | Role |
|---|---|---|
| Core Language | Python | Pipeline logic, data processing, orchestration |
| Async HTTP | aiohttp | High-speed parallel API extraction |
| Cloud Infrastructure | AWS | Hosting, compute, storage, scheduling |
| Data Source | REST APIs (LeafLink) | Orders, inventory, customer & fulfillment data |
| Database | PostgreSQL | Structured storage, query layer, analytics |
| Load Strategy | Composite-Key MERGE | Incremental change detection & upsert logic |
| Monitoring | Centralized Logging + Email Alerts | Pipeline health, execution summaries, audit trail |
| Scheduling | Hourly Incremental + Daily Full Refresh | Automated, zero-touch execution |