Home Blog About
6 min read

Building Price Intelligence from 118K Grocery Data Points

Turn grocery price data into a system that detects fake sales and shrinkflation. Built with Python, SQLite, and basic statistics in 118K data points.

TechnologyPythonData EngineeringWeb ScrapingAnalyticsSide Projects

Every Australian knows the feeling: you see “HALF PRICE!” on a block of cheese, feel clever for grabbing it, then wonder if the “original” price was inflated last week. Turns out, with enough data, you can answer that question definitively.

I’ve been scraping Coles and Woolworths prices daily for the past two months. What started as a simple “which store is cheaper?” comparison has grown into something more interesting — a price intelligence system with 118,000+ data points across 10,000 products. Here’s what I learned building it, and the patterns that emerged from the data.

The Problem with Grocery Price Comparisons

Most price comparison apps solve the wrong problem. They tell you today’s price at Store A vs Store B. Useful, but shallow.

The real questions shoppers should ask are:

  1. Is this actually a sale? Or did the retailer raise the price two weeks ago and “discount” it back?
  2. Is this product shrinking? Same price, smaller pack — the invisible price increase.
  3. When should I buy this? Some products follow predictable discount cycles.
  4. What’s the long-term trend? Is this category getting more expensive faster than inflation?

Answering these requires historical data. And historical data requires patience — you can’t buy it, you have to collect it.

Architecture: Keep It Boring

The system is deliberately simple:

Daily cron job (6 AM, 6 PM)
    → Scrape product listings (requests + BeautifulSoup)
    → Normalize and deduplicate
    → Store in SQLite (one row per product per day per store)
    → Run analysis queries
    → Push alerts via messaging API

No Kafka. No Spark. No “data lake.” SQLite handles 118K rows without breaking a sweat, and the entire database fits in 50MB.

Schema Design

CREATE TABLE price_history (
    id INTEGER PRIMARY KEY,
    product_id TEXT NOT NULL,
    store TEXT NOT NULL,
    name TEXT NOT NULL,
    price_cents INTEGER NOT NULL,
    unit_price_cents INTEGER,
    pack_size TEXT,
    was_price_cents INTEGER,
    on_special BOOLEAN DEFAULT 0,
    scraped_at TEXT NOT NULL,
    UNIQUE(product_id, store, scraped_at)
);

CREATE INDEX idx_product_store ON price_history(product_id, store);
CREATE INDEX idx_scraped ON price_history(scraped_at);

The key insight: store was_price_cents separately. This is the retailer’s claimed “original” price — and comparing it against your own historical data reveals when it’s fiction.

Fake Sale Detection

This turned out to be the most interesting feature. The algorithm is simple:

def is_fake_sale(product_id: str, store: str, db) -> dict:
    """
    A sale is 'fake' if the 'was' price is higher than any price
    we've seen in the last 30 days before the sale started.
    """
    current = get_current_price(product_id, store, db)
    if not current.on_special or not current.was_price_cents:
        return {"fake": False, "reason": "not on special"}

    history = get_price_history(
        product_id, store, db,
        days=30, before=current.special_start_date
    )
    if not history:
        return {"fake": False, "reason": "insufficient history"}

    max_recent_price = max(h.price_cents for h in history)

    if current.was_price_cents > max_recent_price:
        inflation = current.was_price_cents - max_recent_price
        return {
            "fake": True,
            "reason": f"'Was' price inflated by \${inflation/100:.2f}",
            "was_price": current.was_price_cents,
            "actual_max": max_recent_price,
            "current": current.price_cents,
            "real_discount": max(0, max_recent_price - current.price_cents),
        }

    return {"fake": False, "real_discount": current.was_price_cents - current.price_cents}

In two months of data, roughly 15-20% of “specials” show inflated reference prices. The pattern is consistent: price creeps up over 2-3 weeks, then gets “discounted” back to or near its original level.

Price Trend Detection

For each product, I calculate a simple linear regression over the last 90 days of unit prices:

import numpy as np

def price_trend(product_id: str, store: str, db, days: int = 90):
    history = get_price_history(product_id, store, db, days=days)
    if len(history) < 14:
        return None

    x = np.arange(len(history))
    y = np.array([h.unit_price_cents for h in history])
    slope, intercept = np.polyfit(x, y, 1)
    annual_change_pct = (slope * 365 / np.mean(y)) * 100

    return {
        "direction": "up" if slope > 0.1 else "down" if slope < -0.1 else "stable",
        "annual_change_pct": round(annual_change_pct, 1),
        "confidence": min(len(history) / 60, 1.0),
    }

What the Data Shows

Aggregating trends across categories reveals clear patterns. Dairy and fresh meat are outpacing general inflation. Cleaning products are in a price war — good time to stockpile. Beverages are trending down, likely due to competitive pressure from store brands.

These are directional patterns from a limited dataset. Two months of data won’t capture full seasonal cycles.

Best Time to Buy

Some products follow remarkably predictable discount cycles:

def discount_cycle(product_id: str, store: str, db, days: int = 90):
    """Find the average number of days between specials."""
    history = get_price_history(product_id, store, db, days=days)
    special_dates = [h.scraped_at for h in history if h.on_special]

    if len(special_dates) < 2:
        return None

    gaps = []
    for i in range(1, len(special_dates)):
        gap = (special_dates[i] - special_dates[i-1]).days
        if gap > 1:
            gaps.append(gap)

    if not gaps:
        return None

    avg_gap = sum(gaps) / len(gaps)
    last_special = special_dates[-1]
    days_since = (datetime.now() - last_special).days

    return {
        "avg_cycle_days": round(avg_gap),
        "days_until_next": max(0, round(avg_gap - days_since)),
        "confidence": min(len(gaps) / 5, 1.0),
    }

Popular products like major soft drinks, biscuit brands, and cereal go on special every 2-4 weeks like clockwork. Patience literally pays.

Lessons Learned

1. SQLite is enough. At 10K products × 2 stores × 60 days, you have ~1.2M potential rows. SQLite handles this in milliseconds. The entire DB backs up with cp.

2. Unit price is the only honest metric. Absolute price is meaningless without pack size context. Always compare per-kg or per-litre. This is also how you catch shrinkflation — the unit price jumps even when the sticker price stays flat.

3. Scraping is the easy part. The hard part is product identity resolution. The same product after a rebrand needs fuzzy matching on name + category + pack size.

4. Two months isn’t enough. The real value comes at 6-12 months, when you can see full seasonal cycles. I’m playing the long game.

5. The data tells stories. Price movements aren’t random. They reflect supply chain decisions, competitive dynamics, and deliberate pricing strategies. Once you see the patterns, you can’t unsee them.

What’s Next

The natural extension is a mobile interface — scan a barcode, see the price history, get a “buy now or wait” recommendation. The data layer is the hard part, and that’s already running.

For now, the daily alert system catches the best deals and flags suspicious “sales.” It’s a decent return on a weekend project.

Sources

Share: