System Design

Scalable Alumni Portal: A Hybrid Monolith Strategy

How we architected a platform for 50,000+ alumni using a Hybrid Monolith approach (Django + Flask), handling real-time events and complex data relationships.

Shubham
Shubham Kulkarni Software Engineer
Published
System Architecture Diagram

Building a social platform for University Alumni sounds simple on paper: "Just a CRUD app for profiles." But when you factor in 50 years of batch data, real-time reunions, job boards, and donation tracking, it quickly becomes a complex distributed system challenge. The naive version worked in testing. Then alumni reunion week hit, and 8,000 concurrent users melted the database.

In this deep dive, I will walk you through how we moved from a basic monolithic design to a robust Hybrid Architecture that separates high-throughput event streaming from the core relational data, achieving sub-20ms response times — and how we survived our first viral load test.

50k+ Active Alumni
12ms Query Latency
99.9% Uptime

1. The Scale Problem: "It's Just a CRUD App, Right?"

The most dangerous assumption in software engineering is assuming "Internal Tools" don't need scale. The Alumni Portal wasn't just a directory; it was a multi-tenant social platform with three fundamentally different user types, each demanding conflicting database access patterns.

  • The Alumni (Read-Heavy): Browsing yearbooks, searching for batchmates by company, city, or graduation year. Requires low latency full-text search across 50,000+ profiles.
  • The Student (Write-Heavy): Posting questions on the mentorship forum, applying for referrals, RSVPing for events. High write concurrency during placement season.
  • The Admin (Compute-Heavy): Generating annual donation reports, running degree verification against university databases, bulk-emailing 10,000+ alumni. Long-running background tasks.
RBAC Logic

Complex permission gates were required:

  • Verified Alumni: Full Access (Profile, Jobs, Events, Donations)
  • Unverified: Read-Only Profile (can request verification)
  • Banned: Hard Block at Middleware (no API access)
  • Super Admin: Audit logs + Impersonation for debugging
Data Integrity Rules

Zero-tolerance data conditions:

  • Unique BatchID: Preventing duplicate graduating classes
  • Donation ACIDity: Zero-tolerance for lost transaction data
  • Audit Logs: Immutable history of all Admin actions
  • Soft Deletes: No hard deletes — every record keeps a deleted_at timestamp

2. The Hybrid Architecture

After the reunion week crash, we redesigned the system. We adopted a Hybrid Monolith-Microservice pattern — keeping the core business logic in Django for stability and developer velocity, while spinning off the real-time "Event Service" into a lightweight Flask microservice with WebSocket support.

flowchart TD subgraph Client ["🖥️ Frontend - React + Redux"] UI["React SPA"] --> RTK["RTK Query Cache"] UI --> WS["WebSocket Client"] end subgraph Gateway ["🔒 API Gateway - Nginx"] LB["Load Balancer\n Round Robin"] Rate["Rate Limiter\n 100 req/min"] LB --> Rate end subgraph Core ["🏛️ Core Service - Django"] Auth["JWT Auth\nMiddleware"] API["REST API\n DRF"] Celery["Celery Workers\n Background Tasks"] Auth --> API end subgraph Events ["⚡ Event Service - Flask"] WSGI["Flask API"] Socket["Socket.IO\nServer"] end subgraph Data ["🗄️ Data Layer"] PG[("PostgreSQL\n Primary")] Redis[("Redis\n Cache + Queue")] S3[("S3\n Profile Pics")] end UI -->|REST API| LB WS -->|WebSocket| Socket Rate --> Auth Rate --> WSGI API --> PG API --> Redis Celery --> PG Celery --> Redis WSGI --> PG Socket --> Redis

Authentication Flow (JWT + OAuth)

Handling session state across two different frameworks (Django & Flask) is tricky. We solved this using Stateless JWTs signed by a shared HMAC-SHA256 secret key, stored in environment variables. The JWT payload carries the user's role, so both Django and Flask can enforce RBAC without hitting the database on every request.

sequenceDiagram participant User participant Nginx as 🔒 API Gateway participant Django as 🏛️ Auth Service participant Flask as ⚡ Event Service participant DB as 🗄️ Redis/PostgreSQL User->>Nginx: POST /login Credentials Nginx->>Django: Proxy Request Django->>DB: Validate User and Role DB-->>Django: User Valid role alumni_verified Django->>Django: Generate JWT Exp 15min Role verified Django-->>User: Return Access + Refresh Tokens Note over User, Flask: Subsequent Request Cross-Service Auth User->>Nginx: GET /api/events Bearer Token Nginx->>Flask: Proxy with Auth Headers Flask->>Flask: Verify JWT Signature Shared Secret Flask->>DB: Check Token Blacklist Redis O1 DB-->>Flask: Token Valid Flask-->>User: Return Event Data 200 OK

3. Database Optimization & Indexing

The "Directory Search" was our biggest performance bottleneck. Alumni wanted to find batchmates based on extremely specific criteria: "All 2018 batch CSE alumni working at Microsoft in Bangalore." A naive SELECT * FROM alumni WHERE... was acceptable for 100 users, but at 50,000 users, page load hit 3.2 seconds.

The Solution: Partial & Composite Indexes

We analysed 30 days of query logs and found that 90% of searches filtered by batch_year AND company. Instead of blindly indexing every column, we created targeted indexes for the actual query patterns.

models.py
from django.db.models import Q

class AlumniProfile(models.Model):
    user = models.OneToOneField(User, on_delete=models.CASCADE)
    batch_year = models.IntegerField(db_index=True)
    department = models.CharField(max_length=50)
    company = models.CharField(max_length=100, blank=True)
    city = models.CharField(max_length=100, blank=True)
    industry = models.CharField(max_length=100)
    is_verified = models.BooleanField(default=False)
    deleted_at = models.DateTimeField(null=True, blank=True)

    class Meta:
        indexes = [
            # Composite: "Find alumni from 2020 at Google"
            models.Index(
                fields=['batch_year', 'company'],
                name='batch_company_idx'
            ),
            # Partial: Only index verified, non-deleted users
            # Saves ~35% index storage
            models.Index(
                fields=['batch_year', 'department'],
                name='verified_active_idx',
                condition=Q(is_verified=True, deleted_at__isnull=True)
            ),
        ]
        ordering = ['-batch_year']
Query Performance Results
Query PatternBeforeAfterImprovement
Search by batch + company3200ms12ms266×
Browse verified alumni1800ms8ms225×
Full-text profile search4500ms45ms100×
Donation report (aggregate)12000ms350ms34×

4. Caching Strategy: Redis as the Secret Weapon

Indexes solved the cold-start problem, but for the most popular queries (like "Browse all alumni"), we still hit PostgreSQL on every request. The solution was a write-through cache using Redis with smart invalidation.

cache_layer.py
import redis
import json
from functools import wraps

r = redis.Redis(host='localhost', port=6379, db=0)
CACHE_TTL = 300  # 5-minute expiry

def cache_query(key_prefix, ttl=CACHE_TTL):
    """Decorator: Cache database queries in Redis with auto-invalidation."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            # Build cache key from function args
            cache_key = f"{key_prefix}:{hash(str(args) + str(kwargs))}"
            
            # Try cache first
            cached = r.get(cache_key)
            if cached:
                return json.loads(cached)  # Cache HIT → skip DB
            
            # Cache MISS → query DB and store result
            result = func(*args, **kwargs)
            r.setex(cache_key, ttl, json.dumps(result))
            return result
        return wrapper
    return decorator

# Usage: Cache alumni search for 5 minutes
@cache_query(key_prefix="alumni_search")
def search_alumni(batch_year, company=None):
    qs = AlumniProfile.objects.filter(
        batch_year=batch_year,
        is_verified=True,
        deleted_at__isnull=True
    )
    if company:
        qs = qs.filter(company__icontains=company)
    return list(qs.values('user__first_name', 'company', 'city'))

# Invalidate on profile update (Signal)
@receiver(post_save, sender=AlumniProfile)
def invalidate_cache(sender, instance, **kwargs):
    pattern = f"alumni_search:*"
    for key in r.scan_iter(match=pattern):
        r.delete(key)  # Purge stale cache entries

The result: cache hit rate of 87% during normal operation. During reunion week, it peaked at 94% — meaning only 6% of requests actually touched PostgreSQL. This is what allowed us to survive 8,000 concurrent users on a single $40/month DigitalOcean Droplet.

5. The Donation System (ACID Transactions)

Handling money requires zero tolerance for error. We integrated with Razorpay as the payment gateway and used Django's transaction.atomic() to ensure that money is never deducted without a corresponding database record. If any step in the chain fails, everything rolls back.

flowchart TD A["💳 User Initiates Donation"] --> B{"🏦 Payment Gateway\nRazorpay"} B -->|✅ Success| C["🔔 Webhook Triggered"] B -->|❌ Failed| Z["Show Error and Retry"] subgraph ACID ["🔒 Django Transaction Atomic"] E["📝 Create Donation Record"] F["📊 Update Total Fund Counter"] G["📄 Generate Tax Receipt PDF"] end C --> ACID ACID --> H{"All Operations\nSucceeded?"} H -->|Yes| I["✅ Commit Transaction\nSend Confirmation Email"] H -->|No| J["⚠️ Rollback Everything"] J --> K["🚨 Log Critical Error\nAlert Admin"]
donations/views.py
from django.db import transaction
from django.core.exceptions import ValidationError

@csrf_exempt
def razorpay_webhook(request):
    """Handle Razorpay payment success webhook with ACID guarantees."""
    payload = json.loads(request.body)
    payment_id = payload['payload']['payment']['entity']['id']
    amount = payload['payload']['payment']['entity']['amount'] / 100
    
    try:
        with transaction.atomic():
            # Step 1: Create immutable donation record
            donation = Donation.objects.create(
                alumni=get_alumni_from_payment(payment_id),
                amount=amount,
                payment_id=payment_id,
                status='confirmed'
            )
            
            # Step 2: Update running total (SELECT FOR UPDATE prevents race)
            fund = Fund.objects.select_for_update().get(id=1)
            fund.total += amount
            fund.save()
            
            # Step 3: Generate PDF receipt (Celery async)
            generate_receipt.delay(donation.id)
            
        # All 3 steps succeeded → committed
        return JsonResponse({'status': 'ok'}, status=200)
        
    except Exception as e:
        # ANY failure → entire transaction rolled back
        logger.critical(f"Donation failed: {e}", exc_info=True)
        alert_admin_slack(f"🚨 Donation rollback: {payment_id}")
        return JsonResponse({'status': 'error'}, status=500)

6. Real-Time Events with WebSockets

The Flask Event Service handles the real-time features: live event RSVP counts, chat during virtual reunions, and instant notifications. We used Socket.IO on the server and the React client, with Redis as the pub/sub message broker.

event_service.py
from flask import Flask
from flask_socketio import SocketIO, emit, join_room

app = Flask(__name__)
socketio = SocketIO(app, cors_allowed_origins="*",
                    message_queue='redis://localhost:6379/1')

@socketio.on('join_event')
def handle_join(data):
    """Alumni joins an event room for live updates."""
    event_id = data['event_id']
    user = get_user_from_jwt(data['token'])
    
    join_room(f"event_{event_id}")
    
    # Broadcast updated attendee count to all in room
    count = get_attendee_count(event_id)
    emit('attendee_update', {
        'event_id': event_id,
        'count': count,
        'latest_join': user.name
    }, room=f"event_{event_id}")

@socketio.on('send_message')
def handle_message(data):
    """Real-time chat during virtual reunions."""
    msg = sanitize_input(data['message'])  # XSS prevention
    emit('new_message', {
        'user': data['user_name'],
        'message': msg,
        'timestamp': datetime.utcnow().isoformat()
    }, room=f"event_{data['event_id']}")

7. Frontend Architecture & UX

The frontend required a seamless experience across a complex feature set. We chose React with Redux Toolkit for its component reusability and predictable state management, critical for the dynamic Job Board and Event Calendar.

RTK Query Caching

Managing 500+ job listings client-side required a robust cache. RTK Query automatically caches API responses with tag-based invalidation — reducing server load by 40% on repeat visits. When a new job is posted, only the Jobs tag invalidates.

Optimistic Updates

When a user clicks "Connect" or "RSVP", the UI updates instantly while the POST processes in the background. On failure, we rollback the UI state and show a toast notification. Users perceive zero latency.

8. Deployment & Load Testing

We deployed on a DigitalOcean Droplet ($40/month, 4 vCPU, 8 GB RAM) using Docker Compose with Nginx as the reverse proxy. We chose Docker Compose over Kubernetes because at our scale, orchestration complexity wasn't justified — premature optimization is the root of all evil.

CI/CD Pipeline (GitHub Actions)

  • Lint: Black (Python) + ESLint (JS) on every commit
  • Test: pytest with 92% coverage — payment logic covers negative amounts, double-charging, webhook replay attacks
  • Build: Docker image built and pushed to GHCR (GitHub Container Registry)
  • Deploy: Automated SSH to production via fabric — zero-downtime blue-green deploy

Load Test Results (Locust)

Before launch, we ran a Locust load test simulating Alumni Reunion Day traffic patterns. Here's what the single Droplet handled:

Locust Stress Test Results
MetricTargetActualStatus
Concurrent Users5,0008,200✅ Exceeded
P50 Latency<100ms18ms✅ 5.5× better
P99 Latency<500ms210ms✅ Pass
Error Rate<0.1%0.02%✅ Pass
Cache Hit Rate>80%94%✅ Exceeded
DB Connections (Peak)<10067✅ Pass

Key Takeaways

  • Index what you query, not what you store. Analysing 30 days of query logs saved us from blindly indexing every column.
  • Cache aggressively, invalidate precisely. Redis with pattern-based invalidation gave us 94% cache hit rate on a $40/month server.
  • Hybrid > Monolith > Microservices (at this scale). Django handles 80% of the logic; Flask handles the 20% that needs WebSockets.
  • ACID is non-negotiable for money. select_for_update() prevents race conditions in donation counters.
  • Load test before you ship. Locust found 3 connection pool leaks we would have missed in manual QA.