Notes from a 2011 Qcon talk about scaling the non-gaming server side of League of Legends. They are not worried about persistence during a game, but rather in the match-making, lobby, or store. They have soft real-time requirements: not the order of 10ms, more the order of a few seconds. [Twitter and Facebook are soft RT too]
Scalability: it's easier to constrain what the logic developers are allowed to do, than to define what they are not allowed to do. Examples: Map/Reduce is a whole paradigm (you have to make your logic fit into a
map() and a
reduce()), NoSQL's unstructured data is a double-edge sword (you lose the ability to
join, but queries come back faster), and when you're partitioned, you pick to be either atomic/consistent or available
46:50: Scaling should be dynamic/elastic; you need cluster recomposition and stateless growth patterns. Hence the system should be dynamically configurable. On the fly, you should be able to adjust the thread pool size, add/remove roles to machines, and switch logic or system algorithms.
Food for thought: list all the benefits/tricks a load balancer can provide to your system.
Caching provides flexibility: failovers, distribution of workload (hot code updates or restarts). Coherence is a distributed cache used between Hibernate and the DAO layer as "cache-through". If the DAO asks Coherence and there's a cache miss, then Coherence asks Hibernate (itself using a Coherence cache, or calling MySQL over the network if cache miss).
24:00: How to be sure that cache and DB are consistent? Do not cache! Query the DB directly, latency of few sec is OK for soft RT.
11:30: Serialize the work, not the data. It's faster to serialize the work to the data, than to unserialize the data, work on it, then serialize the result.
Avoid moving the data all over the network towards where the process is. It's also easier to distribute the work to all the DB nodes than to send to/receive from all the nodes.
The objects sent on network between server and DB should be small: if you have to edit only one field, you should not have to send a big object of 1MB.
Logging and Testing
42:30: They log each function call with how long it took; overhead of 1% performance, but huge value if able to graph/plot the logs. Compare the average call duration in the last few minutes to the usual average to detect problems.
17:15: Keep in mind that the code will be used/executed in a data center, not on your laptop. 51:25: They use EC2 for load testing. 1000 threads per node, each thread simulating 1 user. Realistic because not all clients (threads) are same speed in real-life, and EC2 network is not always the best/reliable. It's not the most performant, but it looks like a quick-and-dirty way to load test. One of their scale testing environment has more than 50 machines.