Optimizing Performance & Reliability in Microservice Architectures using Graph Theory and Network Science
A research project by fourth-year students at Sri Lanka Institute of Information Technology (SLIIT), aimed at revolutionizing microservice reliability through graph-based analysis, predictive simulations, and intelligent resource scheduling.
🛑 IMPORTANT NOTE: We had to use multiple repositories to allow each component to be built, developed and deployed independently. Using a single repo would restrict CI/CD workflows.
| IT Number | Name | Component | Repositories |
|---|---|---|---|
| IT22917270 | Wimalagunasekara D.M.E | Service Graph Engine | 🔗 Link |
| IT22036148 | Minsandi I.G.Y | Kubernetes Scheduler Extender | 🔗 Link |
| IT22314574 | Samarathunge M.P.C | Graph Dependency Alert Engine | 🔗 Link |
| IT22346872 | Sarathchandra I.T.P | Predictive Analysis Engine | 🔗 Link |
Institution: Sri Lanka Institute of Information Technology (SLIIT)
Program: B.Sc. (Hons) in Information Technology, Specializing in Software Engineering
Year: 4th Year, Semester 2
Project Type: Research Project
Supervisor: Dr. Dharshana Kasthurirathna
Co-Supervisor: Mr. Vishan Jayasinghearachchi
GraphFoundry is a cloud-native observability and prediction platform designed to help DevOps teams understand, simulate, and optimize their microservice architectures. The platform combines real-time service topology mapping, predictive failure analysis, intelligent scheduling, and proactive alerting to minimize downtime and improve system reliability.
| # | Component | Tech Stack | Repository | Description |
|---|---|---|---|---|
| 1 | Service Graph Engine | Node.js, Neo4j, Graph Data Science | 🔗 Repo | Real-time service topology mapping and centrality analysis |
| 2 | Predictive Analysis Engine | Node.js, Express | 🔗 Repo | Simulates failure & scaling scenarios before production changes |
| 3 | Graph Dependency Alert Engine | Go, Hexagonal Architecture | 🔗 Repo | Proactive alerting based on service criticality & anomaly detection |
| 4 | Kubernetes Scheduler Extender | Go, Kubernetes API | 🔗 Repo | Graph-aware pod scheduling for optimal placement |
| 5 | Microservice Test Bed | Go, Node.js, Python, Java, C# | 🔗 Repo | 11-service e-commerce demo app (based on Google's Online Boutique) |
| 6 | Integrated Dashboard UI | React, TypeScript, Vite, Tailwind | 🔗 Repo | Unified web interface for simulations and monitoring |
Maintainer: Wimalagunasekara D.M.E. (IT22917270)
Repository: github.com/GraphFoundry/service-graph-engine
The Service Graph Engine is the single source of truth for microservice topology and metrics. It continuously ingests telemetry from Istio service mesh (via Prometheus) and constructs a living graph of service dependencies stored in Neo4j. Leveraging Neo4j's Graph Data Science (GDS) library, it computes centrality metrics (PageRank, Betweenness) to identify critical services.
- Real-time Graph Synchronization: Polls Prometheus every 10s for
istio_requests_totalandistio_request_duration_milliseconds - Centrality Scoring: Calculates PageRank and Betweenness Centrality every 30s to rank service criticality
- HTTP API: Provides RESTful endpoints for graph queries, metrics snapshots, and centrality scores
- Infrastructure Awareness: Tracks Kubernetes pod-to-service mappings for resource-level analysis
Repository: github.com/GraphFoundry/microservice-test-bed
A production-like microservice environment based on Google's Online Boutique demo. Consists of 11 polyglot services communicating via gRPC, deployed on Kubernetes with Istio service mesh. Used as a live testing ground for all GraphFoundry components.
Maintainer: Sarathchandra I.T.P. (IT22346872)
Repository: github.com/GraphFoundry/analysis-engine
The Predictive Analysis Engine is a microservice observability tool that performs predictive impact analysis on service call graphs. It enables operators to simulate infrastructure changes—service failures and scaling operations—before executing them in production, thereby reducing risk and improving operational decision-making. It consumes graph data from the Service Graph Engine and runs Breadth-First Search (BFS) to trace dependency chains.
- Failure Simulation: Predict which services will be affected if a target service fails
- Scaling Simulation: Estimate load redistribution when adding/removing replicas
- Impact Scoring: Calculates blast radius (number of affected services, severity levels)
- Path Tracing: Identifies critical paths where traffic will be disrupted
- Fetch Graph: Retrieves live topology from Service Graph Engine
- Traverse Dependencies: Uses BFS to find all downstream/upstream services
- Calculate Impact: Scores based on centrality, traffic volume, and hop distance
- Generate Recommendations: Suggests mitigation strategies (e.g., circuit breakers, retries)
Maintainer: Minsandi I.G.Y. (IT22036148)
Repository: github.com/GraphFoundry/kubernetes-scheduler-extender
A custom Kubernetes scheduler extension that makes pod placement decisions based on service graph topology. Unlike the default scheduler (which only considers resource availability), this extender factors in service dependencies to co-locate or anti-co-locate services for optimal performance and resilience.
- Topology-Aware Scheduling: Queries Service Graph Engine for service relationships before pod placement
- Affinity Rules: Co-locates frequently communicating services to reduce latency
- Anti-Affinity Rules: Spreads critical services across nodes for fault tolerance
- Dynamic Scoring: Assigns fitness scores to candidate nodes based on graph metrics
- Kubernetes scheduler sends node candidates to extender
- Extender queries Service Graph Engine for service neighborhood
- Scores nodes based on:
- Co-location with upstream dependencies (reduce cross-node traffic)
- Spread of high-centrality services (avoid single point of failure)
- Returns top-ranked node to scheduler
Maintainer: Samarathunge M.P.C. (IT22314574)
Repository: github.com/GraphFoundry/graph-dependency-alert-engine
A proactive alerting system that monitors service health and topology changes to detect potential cascading failures before they happen. Unlike traditional metric-based alerts (CPU, memory), this engine uses graph-based risk scoring to predict when a service degradation might trigger wider outages.
- Risk Scoring: Combines centrality metrics, latency trends, and error rates into a unified risk score
- Anomaly Detection: Uses statistical forecasting (moving average, standard deviation) to detect abnormal patterns
- Cascading Failure Prediction: Simulates failure propagation through dependency graph
- Webhook Notifications: Sends alerts to Slack, PagerDuty, or custom endpoints with HMAC signatures
- Event Bus Architecture: Decouples detection, evaluation, and notification
Repository: github.com/GraphFoundry/integrated-dashboard
Modern frontend dashboard for the Predictive Analysis Engine. Built with Vite, React, TypeScript, and Tailwind CSS. Provides a unified web interface for monitoring service health, running simulations, viewing metrics, analyzing history, and managing alerts.
This is an academic research project for SLIIT Y4S2. For questions or collaboration inquiries, please contact:
- Wimalagunasekara D.M.E. (IT22917270) - Service Graph Engine
- Minsandi I.G.Y. (IT22036148) - Kubernetes Scheduler Extender
- Samarathunge M.P.C. (IT22314574) - Graph Dependency Alert Engine
- Sarathchandra I.T.P. (IT22346872) - Predictive Analysis Engine
| Resource | Link |
|---|---|
| Organization | github.com/GraphFoundry |
| Service Graph Engine | 🔗 Repo |
| Predictive Analysis Engine | 🔗 Repo |
| Graph Alert Engine | 🔗 Repo |
| Scheduler Extender | 🔗 Repo |
| Microservice Test Bed | 🔗 Repo |
| Dashboard UI | 🔗 Repo |
Built by the GraphFoundry Team
Empowering DevOps through intelligent graph-based observability