Nan An

Data Scientist, Software Developer, Mathematician, Physicist

The interactive globe displays data gathered from my analytics system. The system also stalls malicious requests.

Analytics Information
ID: Loading...
Location: Loading...


© Nan An 2025 | Sapiens dominabitur astris
Signature > Gallery >

Building My Analytics System

Requirements

I wanted to build an efficient, scalable analytics system to track user activity on my website. Since I have a Cloudflare free tier account, those resources are availible: Workers, KV caches, and D1 databases. Additionally, the system should serve API endpoints for a real-time 3D visualization of visitor locations. I designed the following specifications.

Specifications

This mechanism tries to link sessions by IP and session ID. The session IDs are signed to avoid manipulation. A single user can generate multiple identities by changing IP and clearing browsing data, but in general this should give a reasonable approximation of user activity.

Implementation

The system follows a pipeline structure to optimize performance. The frontend logs user activity and sends data to Worker A, which writes to a KV cache. To prevent excessive writes, Worker B (running on a cron job) periodically batches and writes data to the SQL database. Requests follow these general pathways:

Architecture Visualization
w: write, r: read, A -> B: A initializes requests to B

                                          (cron)
Front End --> Worker A -w-> KV Cache <-r- Worker B -w-> SQL Database
                 |          ^                               ^
                 |          |                               |
                 +---- r ---+--- r (if cache-miss) ---------+
                 |
                 | (Malicious Requests)
                 V
               nginx (redirects to slow-lories)

KV Cache Design

Writes are cached using W-[ID]-[No]: Values and remain valid for one hour. I included [No] to handle multiple writes for the same ID. Worker B periodically collects these cached writes and batches them into the database.

Reads are cached using R-[ID]: Values, as only the most recent read needs to be retained. These entries expire after five minutes.

When Worker A attempts to read data, there are three potential sources: the write-cache, the read-cache, and the database. It first checks the write-cache for the most up-to-date data. If no entry is found, it falls back to the read-cache, and as a last resort, queries the database. This approach means that some retrieved data may be up to five minutes outdated, but this is an acceptable trade-off for improved performance.

In the end

Thanks to Cloudflare’s generous free-tier offerings, all the resources used in this system are free.

Nan An

Resume > Github >

Email: ann5@mcmaster.ca (Institution), admin@annan.eu.org (Website)

About Me

I’m a Data Scientist and Software Developer pursuing an M.Sc. in Computer Science at McMaster University. My research focuses on software modeling, and I have experience in machine learning, AI, and data-driven problem-solving. I’ve worked on large-scale data projects at the Ministry of Health and Front Row Ventures. I also contribute to open-source projects like Neo4j.

Coming from a background in mathematics and physics, I love optimizition and thought experiments. I am an outdoor person and I love nature:

Z = tr(e-βĤ) (The Canonical Partition Function)
Trefoil Knot 31

Experience

Education

Data Science Projects

Software Projects

Notes (Mathematics)

I find the connection between topological and algebraic structures beautiful.