Scalable Graph Processing on Distributed In-Memory Systems
DOI:
https://doi.org/10.15662/IJRAI.2018.0102002Keywords:
Distributed graph processing, in-memory graph computation, vertex-centric, subgraph-centric, PowerGraph, GraphLab, Giraph, GraphX, GoFFish, performance scalabilityAbstract
The explosive growth of graph-structured data in applications such as social networks, web analytics, and recommendation systems has prompted the development of scalable graph processing frameworks. Distributed inmemory systems have emerged to meet performance demands, processing massive graphs across clusters while maintaining low-latency, iterative operations. This paper surveys the evolution and effectiveness of key systems—such as Pregel, GraphLab (and its distributed variant), PowerGraph, Apache Giraph, GraphX, and GoFFish—each built before 2017. We analyze their programming abstractions (vertex-centric, subgraph-centric), partitioning strategies, and computation models (synchronous vs. asynchronous). Through synthesis of experimental insights from prior studies, including comparisons over billions of edges, we highlight trade-offs in communication overhead, load balancing, and scalability. Notably, PowerGraph introduces novel edge-centric partitioning to mitigate skew, achieving significant performance gains, while GoFFish demonstrates subgraph-centric execution that outperforms vertex-centric solutions on real-world graphs. We discuss the design patterns that foster efficiency—including minimizing network communication, exploiting in-memory locality, and iterative model optimizations. Our findings underscore the importance of tailored architecture depending on graph characteristics and algorithm demands. The study concludes with design guidelines for future systems aiming to process ever-larger graphs in memory, and recommends integrating dynamic partitioning, adaptive computation flow, and optimized memory management as areas for future exploration.