Building a social platform for University Alumni sounds simple on paper: "Just a CRUD app for profiles." But when you factor in 50 years of batch data, real-time reunions, job boards, and donation tracking, it quickly becomes a complex distributed system challenge. The naive version worked in testing. Then alumni reunion week hit, and 8,000 concurrent users melted the database.
In this deep dive, I will walk you through how we moved from a basic monolithic design to a robust Hybrid Architecture that separates high-throughput event streaming from the core relational data, achieving sub-20ms response times — and how we survived our first viral load test.
1. The Scale Problem: "It's Just a CRUD App, Right?"
The most dangerous assumption in software engineering is assuming "Internal Tools" don't need scale. The Alumni Portal wasn't just a directory; it was a multi-tenant social platform with three fundamentally different user types, each demanding conflicting database access patterns.
- The Alumni (Read-Heavy): Browsing yearbooks, searching for batchmates by company, city, or graduation year. Requires low latency full-text search across 50,000+ profiles.
- The Student (Write-Heavy): Posting questions on the mentorship forum, applying for referrals, RSVPing for events. High write concurrency during placement season.
- The Admin (Compute-Heavy): Generating annual donation reports, running degree verification against university databases, bulk-emailing 10,000+ alumni. Long-running background tasks.
RBAC Logic
Complex permission gates were required:
- Verified Alumni: Full Access (Profile, Jobs, Events, Donations)
- Unverified: Read-Only Profile (can request verification)
- Banned: Hard Block at Middleware (no API access)
- Super Admin: Audit logs + Impersonation for debugging
Data Integrity Rules
Zero-tolerance data conditions:
- Unique BatchID: Preventing duplicate graduating classes
- Donation ACIDity: Zero-tolerance for lost transaction data
- Audit Logs: Immutable history of all Admin actions
- Soft Deletes: No hard deletes — every record keeps a
deleted_attimestamp
2. The Hybrid Architecture
After the reunion week crash, we redesigned the system. We adopted a Hybrid Monolith-Microservice pattern — keeping the core business logic in Django for stability and developer velocity, while spinning off the real-time "Event Service" into a lightweight Flask microservice with WebSocket support.
Authentication Flow (JWT + OAuth)
Handling session state across two different frameworks (Django & Flask) is tricky. We solved this using Stateless JWTs signed by a shared HMAC-SHA256 secret key, stored in environment variables. The JWT payload carries the user's role, so both Django and Flask can enforce RBAC without hitting the database on every request.
3. Database Optimization & Indexing
The "Directory Search" was our biggest performance bottleneck. Alumni wanted to find batchmates based on extremely specific criteria: "All 2018 batch CSE alumni working at Microsoft in Bangalore." A naive SELECT * FROM alumni WHERE... was acceptable for 100 users, but at 50,000 users, page load hit 3.2 seconds.
The Solution: Partial & Composite Indexes
We analysed 30 days of query logs and found that 90% of searches filtered by batch_year AND company. Instead of blindly indexing every column, we created targeted indexes for the actual query patterns.
from django.db.models import Q
class AlumniProfile(models.Model):
user = models.OneToOneField(User, on_delete=models.CASCADE)
batch_year = models.IntegerField(db_index=True)
department = models.CharField(max_length=50)
company = models.CharField(max_length=100, blank=True)
city = models.CharField(max_length=100, blank=True)
industry = models.CharField(max_length=100)
is_verified = models.BooleanField(default=False)
deleted_at = models.DateTimeField(null=True, blank=True)
class Meta:
indexes = [
# Composite: "Find alumni from 2020 at Google"
models.Index(
fields=['batch_year', 'company'],
name='batch_company_idx'
),
# Partial: Only index verified, non-deleted users
# Saves ~35% index storage
models.Index(
fields=['batch_year', 'department'],
name='verified_active_idx',
condition=Q(is_verified=True, deleted_at__isnull=True)
),
]
ordering = ['-batch_year']
Query Performance Results
| Query Pattern | Before | After | Improvement |
|---|---|---|---|
| Search by batch + company | 3200ms | 12ms | 266× |
| Browse verified alumni | 1800ms | 8ms | 225× |
| Full-text profile search | 4500ms | 45ms | 100× |
| Donation report (aggregate) | 12000ms | 350ms | 34× |
4. Caching Strategy: Redis as the Secret Weapon
Indexes solved the cold-start problem, but for the most popular queries (like "Browse all alumni"), we still hit PostgreSQL on every request. The solution was a write-through cache using Redis with smart invalidation.
import redis
import json
from functools import wraps
r = redis.Redis(host='localhost', port=6379, db=0)
CACHE_TTL = 300 # 5-minute expiry
def cache_query(key_prefix, ttl=CACHE_TTL):
"""Decorator: Cache database queries in Redis with auto-invalidation."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Build cache key from function args
cache_key = f"{key_prefix}:{hash(str(args) + str(kwargs))}"
# Try cache first
cached = r.get(cache_key)
if cached:
return json.loads(cached) # Cache HIT → skip DB
# Cache MISS → query DB and store result
result = func(*args, **kwargs)
r.setex(cache_key, ttl, json.dumps(result))
return result
return wrapper
return decorator
# Usage: Cache alumni search for 5 minutes
@cache_query(key_prefix="alumni_search")
def search_alumni(batch_year, company=None):
qs = AlumniProfile.objects.filter(
batch_year=batch_year,
is_verified=True,
deleted_at__isnull=True
)
if company:
qs = qs.filter(company__icontains=company)
return list(qs.values('user__first_name', 'company', 'city'))
# Invalidate on profile update (Signal)
@receiver(post_save, sender=AlumniProfile)
def invalidate_cache(sender, instance, **kwargs):
pattern = f"alumni_search:*"
for key in r.scan_iter(match=pattern):
r.delete(key) # Purge stale cache entries
The result: cache hit rate of 87% during normal operation. During reunion week, it peaked at 94% — meaning only 6% of requests actually touched PostgreSQL. This is what allowed us to survive 8,000 concurrent users on a single $40/month DigitalOcean Droplet.
5. The Donation System (ACID Transactions)
Handling money requires zero tolerance for error. We integrated with Razorpay as the payment gateway and used Django's transaction.atomic() to ensure that money is never deducted without a corresponding database record. If any step in the chain fails, everything rolls back.
from django.db import transaction
from django.core.exceptions import ValidationError
@csrf_exempt
def razorpay_webhook(request):
"""Handle Razorpay payment success webhook with ACID guarantees."""
payload = json.loads(request.body)
payment_id = payload['payload']['payment']['entity']['id']
amount = payload['payload']['payment']['entity']['amount'] / 100
try:
with transaction.atomic():
# Step 1: Create immutable donation record
donation = Donation.objects.create(
alumni=get_alumni_from_payment(payment_id),
amount=amount,
payment_id=payment_id,
status='confirmed'
)
# Step 2: Update running total (SELECT FOR UPDATE prevents race)
fund = Fund.objects.select_for_update().get(id=1)
fund.total += amount
fund.save()
# Step 3: Generate PDF receipt (Celery async)
generate_receipt.delay(donation.id)
# All 3 steps succeeded → committed
return JsonResponse({'status': 'ok'}, status=200)
except Exception as e:
# ANY failure → entire transaction rolled back
logger.critical(f"Donation failed: {e}", exc_info=True)
alert_admin_slack(f"🚨 Donation rollback: {payment_id}")
return JsonResponse({'status': 'error'}, status=500)
6. Real-Time Events with WebSockets
The Flask Event Service handles the real-time features: live event RSVP counts, chat during virtual reunions, and instant notifications. We used Socket.IO on the server and the React client, with Redis as the pub/sub message broker.
from flask import Flask
from flask_socketio import SocketIO, emit, join_room
app = Flask(__name__)
socketio = SocketIO(app, cors_allowed_origins="*",
message_queue='redis://localhost:6379/1')
@socketio.on('join_event')
def handle_join(data):
"""Alumni joins an event room for live updates."""
event_id = data['event_id']
user = get_user_from_jwt(data['token'])
join_room(f"event_{event_id}")
# Broadcast updated attendee count to all in room
count = get_attendee_count(event_id)
emit('attendee_update', {
'event_id': event_id,
'count': count,
'latest_join': user.name
}, room=f"event_{event_id}")
@socketio.on('send_message')
def handle_message(data):
"""Real-time chat during virtual reunions."""
msg = sanitize_input(data['message']) # XSS prevention
emit('new_message', {
'user': data['user_name'],
'message': msg,
'timestamp': datetime.utcnow().isoformat()
}, room=f"event_{data['event_id']}")
7. Frontend Architecture & UX
The frontend required a seamless experience across a complex feature set. We chose React with Redux Toolkit for its component reusability and predictable state management, critical for the dynamic Job Board and Event Calendar.
RTK Query Caching
Managing 500+ job listings client-side required a robust cache. RTK Query automatically caches API responses with tag-based invalidation — reducing server load by 40% on repeat visits. When a new job is posted, only the Jobs tag invalidates.
Optimistic Updates
When a user clicks "Connect" or "RSVP", the UI updates instantly while the POST processes in the background. On failure, we rollback the UI state and show a toast notification. Users perceive zero latency.
8. Deployment & Load Testing
We deployed on a DigitalOcean Droplet ($40/month, 4 vCPU, 8 GB RAM) using Docker Compose with Nginx as the reverse proxy. We chose Docker Compose over Kubernetes because at our scale, orchestration complexity wasn't justified — premature optimization is the root of all evil.
CI/CD Pipeline (GitHub Actions)
- Lint:
Black(Python) +ESLint(JS) on every commit - Test:
pytestwith 92% coverage — payment logic covers negative amounts, double-charging, webhook replay attacks - Build: Docker image built and pushed to GHCR (GitHub Container Registry)
- Deploy: Automated SSH to production via
fabric— zero-downtime blue-green deploy
Load Test Results (Locust)
Before launch, we ran a Locust load test simulating Alumni Reunion Day traffic patterns. Here's what the single Droplet handled:
Locust Stress Test Results
| Metric | Target | Actual | Status |
|---|---|---|---|
| Concurrent Users | 5,000 | 8,200 | ✅ Exceeded |
| P50 Latency | <100ms | 18ms | ✅ 5.5× better |
| P99 Latency | <500ms | 210ms | ✅ Pass |
| Error Rate | <0.1% | 0.02% | ✅ Pass |
| Cache Hit Rate | >80% | 94% | ✅ Exceeded |
| DB Connections (Peak) | <100 | 67 | ✅ Pass |
Key Takeaways
- Index what you query, not what you store. Analysing 30 days of query logs saved us from blindly indexing every column.
- Cache aggressively, invalidate precisely. Redis with pattern-based invalidation gave us 94% cache hit rate on a $40/month server.
- Hybrid > Monolith > Microservices (at this scale). Django handles 80% of the logic; Flask handles the 20% that needs WebSockets.
- ACID is non-negotiable for money.
select_for_update()prevents race conditions in donation counters. - Load test before you ship. Locust found 3 connection pool leaks we would have missed in manual QA.