I wanted to build a project which dealt with clickstream user events as well as transaction data. So I decided to simulate it using python as I can tweak make changes add volume or any number of attributes. This project is for learning purpose. This is the first step of my dockerized streaming pipeline project that uses Kafka, Spark Streaming, Minio, Postgres and Grafana.
Each new user follows a mini-journey:
- Registers with a name and location
- Clicks through 3–6 pages
- (Maybe) completes a transaction
All events are timestamped and streamed to Kafka topics.
Technologies Used
- Python – to simulate data using
Faker
,requests
, andrandom
- RandomUser API – for realistic Indian user profiles
- Kafka – to stream user, click, and transaction events
- Docker – to containerize everything for reproducibility
Step 1: Simulating a New User
Each simulated user has:
- A UUID user ID
- A name, email, and city/state
- A registration timestamp
It tries the https://randomuser.me/api/?nat=in API first, and falls back to Faker
if the request fails.
{
"user_id": "5df1b623-bf7f-40ad-a1e6-731c6a8fc639",
"name": "Kanak Bisht",
"email": "kanak.bisht@example.com",
"location": "Bengaluru, Karnataka",
"registered_at": "2025-04-02 20:51:12"
}
Step 2: Clickstream Events
Each user clicks through 3–6 pages with short delays, generating actions like:
click
,scroll
,hover
,navigate
- Pages like
/home
,/products
,/cart
Example click event:
{
"user_id": "u123",
"session_id": "SESS123",
"timestamp": "2025-04-01 20:12:30",
"page": "/products",
"device": "iOS",
"action": "click"
}
To make user journey feel natural, just like how people behave on an e-commerce site I added weights for the random choices
Here's the logic behind click events:
"page": random.choices(
["/home", "/products", "/cart", "/checkout", "/offers"],
weights=[0.4, 0.2, 0.15, 0.1, 0.15],
k=1
)[0]
That means:
- 40% of the time, users land on the homepage
/home
- 20% go to product listings
/products
- 15% check their cart
/cart
- 10% proceed to checkout
/checkout
- 15% check out special offers
/offers
The weight distribution makes data more believable and simulates real world user activities. As most users who visit an ecommerce website dont end up making a purchase
Step 3: Transaction Event
Each transaction includes:
- User & session IDs to trace back the journey
- Order-level info: total amount, payment method and status
- Items: product details including quantity and price
Here's an example of what a transaction event looks like when it gets sent to the transactions
Kafka topic:
{
"user_id": "bc12EF45GH67",
"session_id": "SESS9832475901",
"transaction_id": "TXN327594837210",
"timestamp": "2025-04-04 20:42:33",
"transaction_amount": 4319.97,
"payment_method": "credit_card",
"payment_status": "successful",
"products": [
{
"product_id": "PROD13456",
"product_category": "Electronics",
"quantity": 1,
"unit_price": 3599.99
},
{
"product_id": "PROD98765",
"product_category": "Books",
"quantity": 2,
"unit_price": 359.99
}
]
}
Also to make transaction events realistic I gave a range for the product categories so that we dont get random price values that make no sense eg buying a book at Rs 5 or electronic item at Rs 10.
Dockerize the project
Docker-compose setup
services:
kafka:
image: bitnami/kafka
container_name: kafka
ports: ["9092:9092"]
environment:
- KAFKA_CFG_NODE_ID=0
- KAFKA_CFG_PROCESS_ROLES=controller,broker
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
- KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093
- KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
producer:
image: producer:v1
volumes:
- ./containers/producer/producer.py:/app/producer.py
environment:
- KAFKA_BROKER=kafka:9092
depends_on:
- kafka
To make producer container, I created a containers folder where we keep the producer.py and add requirements.txt as well as the Dockerfile for the container
FROM python:3.9-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY producer.py .
ENV KAFKA_BROKER=kafka:9092
CMD ["python", "producer.py"]
Run docker-compose up -d
If the producer container doesn’t exist Docker will make it using this Dockerfile
The generated data gets streamed to the kafka topics .
Kafka Topics in Use
The simulator pushes to three Kafka topics:
Topic | What it Stores |
users | Basic user profiles |
clickstream | Page visits & interactions |
transactions | Order & payment information |
Now we can use subscribe to these topics and make use of the data further.
This project is part of my larger streaming data pipeline series where this data feeds into:
- Spark Streaming
- PostgreSQL
- Minio object storage
- Grafana dashboard
Find the next part of the blog here Part 2
Github repo for this project can be found here GitHub