04 June 2012

TERA's free-targetting combat: server-side

In the May 2012 issue of GD mag, Seungmo Koo, the server architect for TERA, wrote about how their studio implemented the server back-end for the game's free-targeting combat system. Some parts of the technical solution described in the article were unclear, or the problems were not obvious, so I filled the gaps by guessing their approach.

Basics

During fights, the player uses various kinds of attacks and skills. For the player, free-targeting means paying attention to the avatar's orientation and position. There is no monster selection like in WoW. Looking at actual gameplay footage, the combat feels more lively, but there is a lot of hit and run.

From a system perspective, free-targeting means the server does not know which enemy the avatar is targeting. Hence, the server has to detect if a skill's volume of effect collides with any of the monsters around the avatar. To ensure quick computation of the collision, the server runs at 60 FPS. These features are similar to traditional multiplayer FPS requirements, except that for security and simplicity, there is no lag compensation running on the client-side. All the computations happen on the server-side. However, for an MMO to reach a gameplay as responsive as an FPS, the client should be able to execute the player's actions immediately, account for lag, and eventually run some of the NPCs. TERA accounts for lag like an MMO: in PvE, NPCs are slow to give players time to react. In PvP, however, players with a lower ping to the server have an advantage.

Game design

Every creature in the game world is made of cylinders. A skill reaches its target when the skill's volume of effect collides with one of the target's cylinders. A skill's volume of effect can be a cylinder (e.g. usual AoE around a target or self), a portion of a cylinder (e.g. rotating attack using an axe), a portion of a cone (e.g. a frontal attack using a spear), and so on.

Since skills may last for some time, they are discretized in two to more than a dozen targeting times, and each targeting time is associated with a particular volume of effect. Each targeting time consists of two phases: a search (read) phase, and an update (write) phase. For instance, using a rotating attack, the avatar spins for 360 degrees in 300ms, inflicting damage during its turn. The first targeting time fires at 50ms: in the search phase, the server searches for a target standing in a particular volume in front of the avatar, and, in the update phase, inflicts that target 200 damage. The second targeting time fires at 100ms: the server searches for a target standing in another particular volume on the left of the avatar, and inflicts that target 100 damage. And so on for the remaining targeting times.

Multi-threading and performance

Task Example
Timer Damage over time or NPC AI movement scripts
Region update Add/remove/update the position of a creature in a region
Creature update Update HP, status, position of a creature

Each TERA server runs the whole world for approximately 6k concurrent players. The server has a few things to do in the main loop. First, it processes client packets, and generates timer, region, and/or creature tasks from them. That is when the search-for-target phase happens. Then, it executes scheduled timer tasks, which will likely themselves generate region and creature tasks. And finally, it executes region tasks. The update phase happens in the creature tasks, and also in the region task in case of creature movement, spawn, or death events.

To come up with their final design, the TERA engineers may have followed this train of thoughts:

  • Problem: an 8-core server running Windows reaches 100% CPU with 3k players. How can this be improved?
  • Solution: use asynchronous IO such as Window's IOCP. It automatically creates a task queue for client packets and a worker thread pool to process that packet queue in parallel.
  • New problem: The threads spend too much time waiting for locks when trying to read or update region data (e.g. list of creatures in the region, or creature position).
  • Solution: Avoid contention by replicating the world in each thread. No lock: the thread searches in its own version of the world.
  • New search-phase problem: When executing a region task, the region data of a particular thread becomes inconsistent with other threads.
  • Solution: The thread pushes the region tasks it generates into a queue shared between all threads, so that each thread processes all the region tasks. As an example, when an 8-core server receives a client movement packet, one thread processes it and generates (for instance) one region task from that packet. The thread pushes this region task to the end of the queue with an execution counter of 8. Each of the 8 threads processes the task queue at its own pace. When a thread processes a task, it decrements the task's counter. When the counter reaches 0, the thread deletes the task from the queue. Hence, the task mentioned above will be removed from the queue when all the threads have processed it.
  • New update-phase problem: frequently updating a particular creature causes lock contention.
  • Solution: The world replica of each thread stores creature pointers. When a thread processes a packet or a timer task that results in a creature update, it pushes the creature's function to run (e.g. Creature.receiveDamage), with particular arguments (e.g. 200) to that creature's task queue. If the creature task queue is currently being iterated over by another thread, then the current thread returns to what it was doing before (e.g. main loop). Otherwise, the current thread iterates over the creature's task queue. The queue starts with the task the thread just pushed, but other threads may push tasks to the queue while the current thread is processing its task. The current thread will process those extra tasks too, until the queue is empty. Then, the thread will go back to its main loop. Hence, creature tasks are executed asynchronously; getting a return value for the task execution requires the active object pattern.

The resulting high-level architecture is something like the following:

Questions

Seugmo replied to my original questions (see the comment below); here are some more!

  • How much CPU is caused by packet processing? How much does the CPU increase when 1000 players gather together versus when players are homogeneously scattered?
  • When lots of creatures gather in the same region, the search phase (happening on only one thread) may consume a lot of that thread's CPU. Could the search phase happen faster if the regions were smaller? Could it happen faster if the regions were using Binary Space Partitioning?

3 comments:

  1. Simple Answer (Sorry. I am so busy :)

    0. "As an example, for 16 client movement packets arriving at the server, each of the 8 threads is in charge of two packets" ==> wrong example. It is non-deterministic. (depends on OS circumstance)

    1. for each targeting time

    2. parallelized between threads at each IOCPTask(). very little CPU usage needed. (Just decrypt )

    3. Simple 3D Euclid geometry. Lots of creatures? No problem. That is just a thread-local operation.

    4. whole world run on a single 8-core machine :)

    ReplyDelete
  2. Thanks for the reply, Seugmo. I updated the post, and edited/added questions. Thanks again!

    ReplyDelete
  3. Hello again!

    My Answers:
    1-1. How much CPU is caused by packet processing?
    less than 5%. It was cheaper than I thought.
    1-2. How much does the CPU increase when 1000 players gather together versus when players are homogeneously scattered?
    It's hard to answer exactly. It depends on players' behaviors which invoke frequent broadcasting.

    2-1. When lots of creatures gather in the same region, the search phase (happening on only one thread) may consume a lot of that thread's CPU
    Yeah. You are right. But "How often gather" is more critical factor than "How many creature"

    2-2. Could the search phase happen faster if the regions were smaller?
    Yes, of course. We divided the world into small areas (clusters) as small as possible.

    2-3. Could it happen faster if the regions were using Binary Space Partitioning?
    Maybe it gives less improvement than you thought. Because any creatures know their position (x,y,z), they already know which cluster should be searched.

    For your information, you can refer to http://www.slideshare.net/ujentus/cgdc2010

    ReplyDelete

Note: Only a member of this blog may post a comment.