Nan An

Data Scientist, Software Developer, Mathematician, Physicist

The interactive globe displays data gathered from my analytics system. The system also stalls malicious requests.

● Analytics Information

ID: Loading...

Location: Loading...

Building My Analytics System

Requirements

I wanted to build an efficient, scalable analytics system to track user activity on my website. Since I have a Cloudflare free tier account, those resources are availible: Workers, KV caches, and D1 databases. Additionally, the system should serve API endpoints for a real-time 3D visualization of visitor locations. I designed the following specifications.

Specifications

The database logs individual requests made to the website. The schema includes Geolocation, Time, IP, session ID, and a generated user ID. If two records share the same user ID, the system will think they come from the same user.
Users are tracked based on their session ID and IP address. If a new request has a matching session ID, it is assigned the same user ID. However, if only the IP matches but the session ID does not, the system assumes the session ID has been tampered with. In this case, the session ID is reset to the most recent one recorded in the database.
To optimize, the read & write operations to the database should be cached and batched as much as possible.
Malicious requests should be redirected to other endpoints for slow-lories.

This mechanism tries to link sessions by IP and session ID. The session IDs are signed to avoid manipulation. A single user can generate multiple identities by changing IP and clearing browsing data, but in general this should give a reasonable approximation of user activity.

Implementation

The system follows a pipeline structure to optimize performance. The frontend logs user activity and sends data to Worker A, which writes to a KV cache. To prevent excessive writes, Worker B (running on a cron job) periodically batches and writes data to the SQL database. Requests follow these general pathways:

Architecture Visualization
w: write, r: read, A -> B: A initializes requests to B

                                          (cron)
Front End --> Worker A -w-> KV Cache <-r- Worker B -w-> SQL Database
                 |          ^                               ^
                 |          |                               |
                 +---- r ---+--- r (if cache-miss) ---------+
                 |
                 | (Malicious Requests)
                 V
               nginx (redirects to slow-lories)

To filter out bot traffic, the tracking script only activates on user interactions such as 'mousemove' and 'scroll' events. Once triggered, it sends a request to Worker A.
Worker A logs the request, along with the IP address and session ID. The geographic location is inferred from the IP address using cloudflare's API.
Worker A writes to KV cache and reads user ID. It will first try to read from cache. If it is a miss, Worker A will read from the database. Then Worker A sends the user ID back to the frontend.
Worker B, sitting on a cron trigger that runs every 30 mins, collects the writes registered in cache, and write to the database in batch.

KV Cache Design

Writes are cached using W-[ID]-[No]: Values and remain valid for one hour. I included [No] to handle multiple writes for the same ID. Worker B periodically collects these cached writes and batches them into the database.

Reads are cached using R-[ID]: Values, as only the most recent read needs to be retained. These entries expire after five minutes.

When Worker A attempts to read data, there are three potential sources: the write-cache, the read-cache, and the database. It first checks the write-cache for the most up-to-date data. If no entry is found, it falls back to the read-cache, and as a last resort, queries the database. This approach means that some retrieved data may be up to five minutes outdated, but this is an acceptable trade-off for improved performance.

In the end

Thanks to Cloudflare’s generous free-tier offerings, all the resources used in this system are free.

Nan An

Resume > Github >

Email: ann5@mcmaster.ca (Institution), admin@annan.eu.org (Website)

About Me

I’m a Data Scientist and Software Developer pursuing an M.Sc. in Computer Science at McMaster University. My research focuses on software modeling, and I have experience in machine learning, AI, and data-driven problem-solving. I’ve worked on large-scale data projects at the Ministry of Health and Front Row Ventures. I also contribute to open-source projects like Neo4j.

Coming from a background in mathematics and physics, I love optimizition and thought experiments. I am an outdoor person and I love nature:

Z = tr(e^-βĤ) (The Canonical Partition Function)

Trefoil Knot 3₁

Experience

Sep 2024 - Present: Data Scientist
Health Science Data Branch, Ontario Ministry of Health
May 2024 - Jun 2024: Guest Researcher
Hausdorff Institute of Mathematics, Universität Bonn
Sep 2023 - Aug 2024: Software Developer (Research Assistant)
Centre for Software Verification, McMaster University
Sep 2021 - Aug 2024: Teaching Assistant
Shanghaitech (2021 - 2023), McMaster (2023 - 2024)

Education

Sep 2023 - Present: Computer Science M. Sc. (Thesis-based)
McMaster University, Supervised & Funded by Dr. Jacques Carette
Sep 2019 - Jun 2023: Physics B. Sc.
Shanghaitech Univeristy, Thesis: Majorana Fermions and its use in Quantum Computing
Read technical summary of my bachelor's thesis >

Data Science Projects

AGNN: A Graph Neural Network for Fraud Detection [Python]
DSLIB: A Library for Data Science Automation [Python]

Software Projects

TFSI: A tagless final-style interpreter [Haskell]
OpenTelemetry: A telemetry communication software [Haskell/C++]
AnRegex: A NFA Regex Engine [Python]
AnRT: A 3D Ray Tracing Rendering Engine [C++]

Notes (Mathematics)

I find the connection between topological and algebraic structures beautiful.

Notes on Differential Topology
This note was written while I was taking a course by Dr. Daniel Skodlerack. I was the only student in the course, which made the experience quite unique.
Notes on Natural Transformation
Solutions to Selected Real Analysis Problems