This site is under construction - not everything may work correctly yet.

user@boten-dev ~/posts/privacy-first-analytics $ cat README.md

Privacy-first analytics: how hard can it be?

2026-04-12 5 min read learning privacy de-Googleing self-host

I wanted to know if anyone actually reads this site. That's the main thing right now - am I writing into the void or not? But beyond that, I'd also like to understand the traffic better - which posts get the most views, where visitors come from, what devices they use. No tracking individuals, no behavioral profiling, no selling data to advertisers. Just enough data to know what's working and what isn't.

The plan was straightforward: build a small service, embed something on the page, done by the weekend. What could go wrong?

Two hours and a reality check

Roughly two hours into reading about what qualifies as PII - Personally Identifiable Information - I realized this "weekend project" was going to take a bit longer. PII is any data that can be used to identify a specific individual, either on its own or combined with other information. IP addresses and device fingerprints are obvious examples, but what surprised me is how combinations of seemingly harmless data points - screen resolution + timezone + language - can form a unique profile.

Under the GDPR, the definition is even broader - "personal data" covers any information relating to an identifiable natural person. And an IP address? The Court of Justice of the EU has ruled it counts as personal data. Suddenly my "just log a few things" approach started looking a lot more complicated.

Heads up Data collection - even for something as innocent as page view counting - is strictly regulated in the EU and increasingly worldwide. GDPR, ePrivacy Directive, and their equivalents in other jurisdictions all have something to say about what you can collect, how you store it, and what you owe your visitors. Ignorance is not a defense. If you're building anything that touches user data, do your homework first.

This is very much a learning-by-doing situation. I didn't start with a legal textbook - I started with "I want a number" and worked backwards from there. And the deeper I dug, the clearer it became that this is part one of what's going to be a longer series. The first chapter of a privacy-meets-analytics journey.

So many ways to not use JavaScript

One thing I knew from the start: I want this site to stay clean HTML and CSS. No JavaScript. So the usual analytics scripts were immediately off the table - no Google Analytics snippet, no Plausible JS tag, nothing that runs in the browser.

That constraint led me down a research path I didn't expect. It turns out there are quite a few ways to track page views without any client-side scripting:

Server-side middleware - if you control the server, you can log every request in Flask's after_request hook. Sees 100% of traffic, zero overhead, invisible to ad-blockers. The best option if you own the stack.

Server log analysis - parsing Nginx or Gunicorn access logs with tools like GoAccess. Same visibility as middleware but requires a post-processing pipeline and careful handling of raw IPs in the logs.

Redirect tracking - routing all links through a /r?url=... endpoint that logs the click before redirecting. Only tracks link clicks though, not page views.

CSS-based tracking - using background-image URLs in CSS to trigger server requests on page render. Creative, but unreliable across browsers and easy to block.

CSS media query tracking - different URLs in @media blocks to detect device type or color scheme preference. Useful as a supplement, not a standalone solution.

Tracking pixel - a tiny invisible image that triggers a server request when the page loads. Works without JavaScript, works with static hosting, and is the approach that resonated with me the most.

The pixel approach

The idea is deceptively simple. You embed a 1x1 transparent GIF on every page. When a browser loads the page, it requests that image from your server. Your server logs the request - what page was visited, when, from what kind of device - and returns the tiny image. The visitor sees nothing. No scripts run. No cookies are set.

In pseudocode, the server-side flow looks roughly like this:

on request for /pixel.gif:
    page     = query parameter "page"
    referrer = extract domain from Referer header
    device   = classify User-Agent as mobile/desktop/tablet
    language = first two letters from Accept-Language
    country  = GeoIP lookup from IP, then discard the IP
    time     = current date + 4-hour block

    if DNT or GPC header is set:
        return the GIF, log nothing

    if request looks like a bot:
        return the GIF, log nothing

    increment counter for (page, referrer, device, language, country, time)
    return 1x1 transparent GIF

The key is what happens to the data. The IP address is used only for a GeoIP country lookup - it's never written to disk. The raw User-Agent string is reduced to a device class and thrown away. For external referrers, only the domain name is kept - the full URL path and query string are discarded. Why? Because URLs can contain surprisingly personal information. Think search queries baked into the address bar, session tokens, email addresses in password reset links, someone's shoe size or their mother-in-law's credit card PIN. The point is - there's a lot to think about before deciding what to keep and what to throw away.

The simplest privacy policy You can't leak personal data if you never store it. The system does collect anonymous aggregates - page view counts, device classes, country codes - and those could technically be exposed. But since none of it can be tied back to a real person, the worst case scenario is someone finding out that twelve people from Germany read your blog post on a Tuesday.

The architecture is simple: static site on one side, a small Flask app on a VPS on the other. The pixel lives on the VPS. The static site just has an <img> tag pointing to it. No cookies, no ETag, no Last-Modified - nothing that could be used as a tracking identifier. The response headers explicitly forbid caching so that every page load generates a fresh request.

Respecting Do Not Track

Even though this system doesn't process personal data (and therefore DNT isn't legally required), I want to honor it anyway. If a browser sends DNT: 1 or the newer Sec-GPC: 1 header (Global Privacy Control), the server returns the pixel without logging anything. Yes, it means losing a few percent of data. That's a trade-off I'm comfortable with - the whole point of this project is to respect the visitor's choices.

Although - should I really skip the visit entirely? I could still count it as a page view but with nulls for all user-derived fields like country, device, or language. That way the total visit count stays accurate while still respecting the "don't profile me" intent behind the flag. Honestly, I'm not sure yet.

The VPS situation

I built this site to be as static as possible. Plain HTML files, a CSS stylesheet, no server required - just drop the files on any hosting platform and you're done. But a tracking pixel needs a server to receive requests and store data. And that means I'm going to need a VPS.

I've been putting this off, but the tracking pixel is just the beginning. I have other projects in the pipeline that need to actually run somewhere to provide real value - not everything can live only as a Docker image on DockerHub or a repository on GitHub. And I'd rather not expose anything to my private home server. A cheap VPS at a European provider like Hetzner (not an ad, please don't come for me, UOKiK, hehe) is looking increasingly inevitable.

What's next

I have an idea for how to implement this so that each page gets its own dedicated image path instead of pointing to the same file with different query parameters. No ?page=/blog/post in the markup - just a clean image reference unique to each page. But that's a topic for the next post in this series.

There will be more posts about this. The implementation, the deployment, maybe even a dashboard, or the edge cases I haven't thought of yet.

If there's one thing I've taken away from this initial deep dive, it's that "measure twice, cut once" applies doubly here. The core tension of this project is maximizing the usefulness of collected data while preserving as much visitor privacy as possible. And of course, doing all of it within the bounds of the law. Those goals are fundamentally at odds, and getting the balance right takes more than a weekend of coding. It takes reading, thinking, and being honest about what you actually need versus what you could technically grab.

And so the ethical web analytics saga begins.