Uploaded by User12584

Using Game Engines to Develop Virtual Agents

Linköpings universitet/Linköping University | IDA
Kandidat, 16 hp | Innovativ Programmering
Vårterminen 2022 | LIU-IDA/LITH-EX-G--22/072—SE
Using Game Engines to Develop Virtual
Nathalie Stensdahl
Kristoffer Svensson
Handledare, Jalal Maleki
Examinator, Rita Kovordanyi
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum
under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt
bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av
upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet
kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar
av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed
kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller
presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller
konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida https://ep.liu.se/ .
The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25
years starting from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for anyone to read, to download, or to
print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational
purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are
conditional upon the consent of the copyright owner. The publisher has taken technical and administrative
measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work is accessed
as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures for publication
and for assurance of document integrity, please refer to its www home page: https://ep.liu.se/.
© 2022 Nathalie Stensdahl
Kristoffer Svensson
Using Game Engines to Develop Virtual Agents
Nathalie Stensdahl
[email protected]
Virtual agents have been around since the 1960’s and their
usage is constantly increasing. They are used in many
different settings. At the same time, game engines are used
more and more for various purposes. In this paper, the aim is
to create two proof-of-concept virtual agents in the form of a
humanoid receptionist as part of a bigger project. The agents
were implemented using two of the most widely used game
engines, Unity and Unreal Engine. During the development,
the different aspects of the engines and their suitability for
developing virtual agents were compared. It was found that
Unity was easier to use for someone inexperienced with
game engines, but that it in general is more of a matter of
personal preference.
Virtual agents are becoming increasingly popular, and
almost everyone you ask has probably interacted with one at
some point. They can be used for various tasks, like setting
an alarm, controlling smart devices, or telling the user what
the weather is like. Companies may utilize them instead of
human employees to aid with customer support amongst
other things [1] and most people have smartphones with
voice activated assistants right inside their pockets.
The first ever voice assistant was introduced in 1961 by IBM
in the form of the IBM Shoebox [2]. It could listen for
instructions for simple mathematical problems through a
microphone and calculate and print the results [3]. Since
then, the evolution of virtual agents has progressed
Siri, launched by Apple with the release of the iPhone 4S in
2011, is considered to be the first modern virtual assistant. A
few familiar assistants introduced after Siri are Amazon’s
Alexa, Google’s Google Assistant and Cortana by Microsoft
[4]. All of these are examples of assistants that use text and
voice to communicate.
The market for virtual assistants is constantly growing with
more features and platforms being added [4] and therefore
the user base will grow, and more companies will want to
develop and use these technologies.
In this paper, the focus will be on assistants with a visual
aspect to them, more specifically assistants represented in a
human form on a screen. Game engines have the potential to
be good tools to develop these agents. Therefore, their
Kristoffer Svensson
[email protected]
suitability for this purpose will be examined. Two of the
most popular game engines1 will be evaluated and a virtual
agent will be designed using each of them. The engines
chosen for this project are Unity and Unreal Engine. While
developing the agents, the different aspects of the respective
engines will be evaluated and compared against each other
for this specific use case.
The purpose of this paper is to evaluate the suitability of two
popular game engines for developing virtual agents.
This paper is part of a larger project with the goal of
developing a virtual receptionist which will be placed
throughout the Linköping University campus on TV
monitors. The monitors will be equipped with proximity
sensors or a camera that will detect people walking up to the
screen and allow the receptionist to make eye contact and
start listening. The receptionist will be able to answer
questions regarding the location of lectures, offices and other
rooms and give directions to the user based on the location
of the monitor. The directions will be given through speech
and gestures such as pointing and looking.
The goal for this paper is to evaluate the suitability of the
chosen game engines by creating one proof-of-concept
receptionist in each engine. These agents will only have basic
animations, listen to keywords, and respond with predefined
answers specified in a grammar file. The purpose of this
paper is to arrive at conclusions that will be of use to future
development of the receptionist.
Research questions
RQ1: Are relevant plugins and libraries harder to get up and
running in either of the engines?
It will be investigated whether the engines have built-in
support for the functionality required for a virtual agent, or if
plugins from their respective asset stores or third-party
plugins are needed. The ease of which these plugins can be
implemented into the respective projects will also be
RQ2: Is Unity or Unreal the better choice for developing
virtual agents?
The required criteria for virtual agents are specified under
“Virtual agents” in the theory section of this paper. The game
engines will be assessed on whether they are able to fulfill
those. The time requirements for learning each engine will
also be taken into account.
Unity have been used in educational environments as well
This section contains a description of terms, techniques and
software used or investigated throughout the work. This is
meant to provide a good understanding of the work and
following sections of the paper.
Each game engine comes with editors. An editor is a visual
interface where properties such as textures and objects can
be edited and added to the project. A main part of the editors
is that they have a large window where the “world” preview
is visible and where the user also can run simulations and test
the game during development. Objects can also be added,
removed as well as moved around and manipulated in the
scene. The Unreal Engine also has editors for editing
materials and blueprints.
Virtual Agents
The term “virtual agent” is broad and can mean a lot of things
and imply different features and attributes. According to
Hartholt et al [5] a virtual human need to be able to simulate
human capabilities, such as recognizing speech and
understanding natural language, recognize facial expressions
and to respond with speech and facial expressions. The
authors also say that not all virtual humans will have all of
these capabilities but often rather consist of a subset of all
defining capabilities.
In this paper, when talking about virtual agents, it refers to
agents that represent humans in that they can process natural
language, respond to questions, and have visual human
features, gestures and attribute. The terms “virtual agent” and
“virtual human” will be used interchangeably going forward.
In Unreal Engine, the user is given the option to do
programming in C++ using a text editor such as Visual
Studio or to use blueprints (see Figure 1). Blueprints are a
way to implement functionality and logic using visual
scripting where nodes are created and connected through
input and output pins.
Game engines
Game engines are software frameworks designed to aid
developers in constructing games. The functionality often
includes a rendering engine, a physics engine, sounds,
animation, AI, networking, memory management, threading,
porting to multiple platforms etc. 2.
Unreal Engine
Unreal Engine is a game engine owned by Epic games. It was
first released in 19983 and is used for not only creating games
but in film making to create special effects, real life
simulations and architectural design. Unreal Engine was
used during the creation of the Mandalorian series by
Disney4. It has also been used to simulate visual impairments
to aid with both education and describing symptoms to
patients in the health care industry [6].
Unity is a game engine written in C++ and C# owned by
Unity Technologies. It was founded in Copenhagen,
Denmark and was first launched in 2005 [7]. In 2021, 50%
of games across mobile, PC, and console were made with
Unity and 5 billion apps made with the engine were
downloaded each month5. Just like Unreal it is mainly used
to create games but is also used for film making and
architectural design. Disney used Unity to create their short
series Baymax Dreams6. Unity has also been used to simulate
smart houses to aid with designing them [8]. Tools built in
Figure 1. An example of a Blueprint in Unreal Engine
Textures are what is placed on and wrapped around objects
to give them the appearance of being made of a specific
material such as skin and wood and consists of an image file.
A UV map tells the renderer how to apply or map the image
to an object, for example, how a shirt should fit on a
character’s body.
Rigs are like skeletons and are what gives characters bones
and joints used for animating the characters (see Figure 2).
Blendartrack10 is a third-party plugin for Blender that allows
the user to record facial movements with their proprietary
app for iOS or Android and exporting the recording into
Blender. The movements can then be transferred to a Rigify11
face rig in order to avoid animating manually.
A study [10] was conducted at Ithaca College, where the
authors evaluated Unity vs Unreal Engine in the context of
teaching it vs. learning it in an educational environment. The
authors found that students generally found Unity to be
easier to learn, while Unreal was harder, but provided more
satisfying results. One student thought that Unreal Engine
was as easy to learn as Unity, but they had learned Unity first
and not both at the same time.
Figure 2. Face rig in Blender
Plugins are added to projects to add functionality that aren’t
included in the game engines natively, such as TTS (text-tospeech) and STT (speech-to-text).
Mixamo7 is a company that provides a free online web
service for 3D characters, rigging and animations. They have
the option to choose one of their own characters or upload a
custom model and apply a rig and optional animations to
them before downloading. The user is also able to choose to
download animations only, without any model. The
downside to using Mixamo characters is that they lack face
rigs and separate eyes, teeth, and tongue. This makes
animating facial expressions difficult. The characters and
any animations are exported as .fbx (filmbox) files.
Makehuman8 is a software tool that allows the user to create
a 3D model of a human. Properties such as gender, ethnicity,
and body shape can be set using sliders. The user also gets to
pick clothes and hair as well as eyes, teeth and tongue which
are needed when creating face rigs and animating mouth
movements, such as speech.
Blender9 is a software used for modeling and animating 3D
graphics. When animating in Blender, the user can manually
pose their model and use each pose as a keyframe. Several
keyframes can be used to create an animation. Blender “fills
in the gaps” between keyframes, making the animation look
smooth and the animator's job easier. If the animation needs
to be more precise, more keyframes can be added.
A study by S. Babu et. al. [11] investigated the popularity of
a virtual receptionist called Marve’s conversational skills
and social characteristics. Marve interacted using spoken
language and non-verbal cues such as vision-based gaze
tracking, facial expressions, and gestures. One of the metrics
evaluated in the study was to measure how often people
passing by would stop to interact with him. The authors
found that almost 29% of the times someone passed the
doorway by his display, they stopped to interact with him.
The authors also evaluated how often users wanted to have a
task-oriented versus a social conversation with Marve. The
users were given the four conversational topics to - weather,
movies, a joke or leave messages. They found that 80% of
users preferred social small talk, i.e., the first three topics.
Another study [12] found that even when the intended
function of an agent was not small talk, 18% of the questions
asked were socializing questions, such as asking the agent
about its favorite color.
Researchers at MIT developed, what they referred to as, an
Autonomous Conversational Kiosk called MACK [13] in
2002. MACK was displayed as a life-sized blue robot on a
flat screen and was aware of his surroundings. Because of
this awareness, MACK was able to give directions to the
users by using gestures. The researchers found that the
gestures were effective in conveying information. For
example, when he would point to a room behind the user and
the user would turn around to look in that direction.
These studies indicate that virtual agents can be useful for
people, and they need functionality for more than just the
purpose they were planned for. Which is something to keep
in mind during the implementation of this project.
The following subsections make up an overview of the tools
and methods used during implementation as well as a stepby-step description of the development process.
Research setting
All work was done on a PC with Windows 10 Education with
Unity version 2020.3.32f1 and Unreal Engine version 4.27
installed as well as MakeHuman version 1.2.0 and Blender
version 3.1.2.
A connected speaker with an integrated microphone was
used for voice input and output.
Research method
The virtual agents were developed in parallel, i.e., each step
was completed on both agents before moving to the next step,
even if something was much more time consuming for one
of the agents. Throughout development, notes were
continuously taken on issues, such as missing plugins and
tasks that required more time than expected, as well as parts
that were significantly easier to implement in one engine
than the other.
Facial animations requirement
In order for facial animations to work properly, the model
required separate meshes for teeth, tongue, and eyes as well
as a face rig. The character from Mixamo does not have
these. Because of this, a character made in Makehuman was
used instead. The character’s .fbx (filmbox) file was
uploaded to Mixamo and their automatic rigging tool was
used to add a rig to the character’s body. This was done so
that the exported animations from Mixamo would fit the rig
of the character. The rigged Makehuman character was then
imported into Blender to add a face rig to it and combine it
with the Mixamo rig. The Blender plugin Rigify was used to
create the face rig.
Figure 3. Part of the grammar
Importing the model
At the start a character exported from Mixamo was added to
the Unity project. The model was then dragged into the
scene. The textures of the model were displayed as all white
at that point (see Figure 4). To fix this, the embedded textures
and materials had to be extracted in the materials tab of the
character. The rig’s animation type was also changed to
humanoid which creates an avatar for the character to enable
For the grammar, a .csv (comma-separated values) file was
used to make it simple to add or edit which keywords the
agent listens for and what the agent is supposed to say in
response. This file contained a column for keywords and one
column for their corresponding responses. In the responses,
triggers for animations were embedded (See Figure 3) and
could be opened and edited in Excel or any plain text editor.
Figure 4. Unextracted textures
The agent needed to be able to hear the user's voice as well
as recognize words and sentences. Unity provides a built-in
class called DictationRecognizer12 for this purpose that uses
DictationRecognizer was utilized in a C# script and the
“DictationRecognizer.DictationHypothesis” event was used
to call a function that checked if a specified keyword had
been heard.
Since Unity does not have any built-in functionality for text
to speech, a plugin was used. There were some plugins
available in Unity's asset store but none of them were free. A
third-party plugin13 was used which utilized Window’s builtin speech synthesizer. The “WindowsVoice.Speak'' function
was used to enable the agent to talk, and was called with a
string as the argument, specifying what the agent should say.
The animations used were all downloaded from Mixamo and
were imported into the project. The animation type was
changed to humanoid in the animation’s rig tab to match the
rig of the character. Doing this created an avatar for the
animation with its own bones. These bones mapped to the
character's avatar and made sure the correct bones moved
during a certain animation. The next step was to create an
Animator Controller (see Figure 5) which was used to create
the different states that run the animations. To go from one
animation to another, a transition is required. In order to
trigger the transitions, Unity uses a feature called triggers.
Triggers can be activated from a C# script.
fixed by changing the settings in their materials to the
appropriate ones for hair textures. The bones for the
Makehuman character’s avatar were mapped to the
corresponding bones in the agent’s rig. This was done to
make sure that the animations made for the Mixamo
character worked as intended for the new character (see
Figure 6).
Private Animator anim;
anim.SetTrigger("BODY: POINTLEFT");
This function takes in a string which represented the name of
the trigger to activate. Triggers in Unity are booleans that
when set to the value “true”, immediately resets back to
being “false”, removing the need for the developer to reset
the value. Each animation, except the idle animation, was
associated with its own trigger and started playing when the
trigger was set to “true”. Since the triggers automatically
reset, the animations only played for one animation cycle.
Since the idle animation was set to the default state, the agent
always went back to that animation after any of the other
cycles finished.
Figure 6. Bone mapping in Unity
The grammar file’s contents were parsed in the C# script.
Each line was split on the first occurrence of a comma. The
two resulting parts were inserted into a map with the
keywords as keys and the phrases as the values. When the
DictationRecognizer called its hypothesis event, the
recognized phrase was checked for any of the parsed
keywords. If a keyword was found, the corresponding phrase
was further parsed. Since the phrases contained trigger words
for animations, they had to be processed before calling the
Speak function. The phrase to be spoken was split at each
trigger using a regular expression. All resulting elements
from the split were put into an array. The array was looped
through and SetTrigger was called when a trigger was found
with the trigger word as an argument. The rest of the
elements were passed to “WindowsVoice.Speak”.
Unreal Engine
Importing the model
Figure 5. Animator Controller in Unity
Facial animations and retargeting
To make face animations possible, a Makehuman model with
the required meshes for eyes, teeth and tongue was needed.
The new model was imported into Unity and its textures and
materials were extracted in the same way as for the Mixamo
character. This time however, the textures for the hair,
eyebrows and eyelashes showed up as solid blocks. This was
The same character model from Mixamo was used and
imported into the Unreal Engine project in the same way as
it was done in Unity. The character’s hair texture was not
visible so some changes to the hair material were required.
Speech-to-text and text-to-speech
Unreal Engine had no built-in support for STT or TTS. The
plugin that was used utilized Microsoft Azure14, which was
free up to a certain number of hours of speech per month,
which was sufficient for the purpose of this project. An
account at Microsoft Azure was required to obtain a
verification key, which was needed to use the plugin.
Initially a Blueprint class was used for the logic, but it
quickly became messy and difficult to follow or modify.
Because of this, a C++ based class of the agent was
implemented and used instead.
The animations from Mixamo were imported into the Unreal
project. To get them to work properly, an animation blueprint
was required. In this blueprint a state machine, similar to the
animator controller in Unity, was added. The state machine
contained different animation states and transitions between
them. The transitions were activated when a specified value
is achieved. In this case, booleans were used to trigger the
animations. Nothing similar to Unity’s SetTrigger was found
for Unreal engine. Instead, a boolean was set to true in the
C++ script, then in the animation controllers event graph
blueprint cast a Pawn to our C++ class and use Get functions
to retrieve its value and trigger the animations (see Figure 7).
Facial animations and retargeting
To change to the new model from Makehuman, adapted for
facial expressions, Unreal Engine’s retargeting manager was
used. In the model’s Skeleton Editor, there was a button
called Retarget Manager. As long as the rig type was the
same for the new and the old character and all the bones were
assigned correctly, it was capable of duplicating and
transferring animations from one model to another. This
made it possible to smoothly transition from the Mixamo
character to the Makehuman character.
To parse the grammar file in Unreal Engine, Unreal’s
FFileHelper class was used in the C++ script. This allowed
us to read the file into an array, then loop through it to read
the file line by line. The contents were put into a map in a
similar way to how it was done in Unity.
To implement text-to-speech and speech-to-text
functionality in this project, a mixture of functions built into
the engines, plugins from the asset stores and third-party
plugins found by browsing internet forums were used. STT
in Unity was significantly easier to implement than the others
since it did not require a plugin at all. The TTS plugin for
Unity as well as the TTS and STT plugins for Unreal were
equally simple to get up and running if the time spent on
searching for them online is ignored.
Figure 7. Blueprint for retrieving and setting boolean values
To set the triggering boolean values back to false, animation
notifications15 were utilized. Animation notifications, also
referred to as notifies, are a way to trigger events during an
animation cycle. One was added to the end of the cycle for
each of the animations and named them according to which
animation they came from. An example would be the notify
called “DoneWaving” which was triggered at the end of the
waving animation cycle. In the animation blueprint class,
that event triggered a change to the corresponding boolean
values (see Figure 8).
Figure 8. Blueprint for resetting values after notify event
When first starting to work on this project, the impression
was that everything could be implemented in Unity using C#
and everything in Unreal using C++ and that Unreal’s
blueprints were optional. This was true for the Unity project.
For Unreal though, blueprints were necessary to achieve the
goals without having to write very complex code. Most of
the official documentation and instructions in online forums
used blueprints and at least a combination of blueprints and
C++ was required.
When using blueprints in Unreal, the editor quickly became
cluttered, hard to follow and difficult to edit with all the
nodes and pins in between (see Figure 9). Some of it was
manually translated into C++ code. For the animation logic
though, we had to use blueprints.
however utilize Microsoft Azure which could become costly
if the free hours they provided were used up.
In general, plugins were equally easy to get up and running
in both engines and the user had the option to use third-party
plugins if the asset stores did not have the plugins needed.
RQ2: Is Unity or Unreal the better choice for developing
virtual agents?
Both Unity and Unreal Engine are viable tools for this
application. It is possible to implement a virtual agent in both
engines with some features being easier to do in Unity than
Unreal, like triggering animations. Development in Unreal
Engine required more time spent on learning and research.
Figure 9. STT blueprint
A big difference between the engines, with respect to the
amount of code and time spent, was the implementation of
triggers in Unity versus animation notifications in
combination with boolean values in Unreal. They were both
used for starting and stopping animations in this project. The
trigger solution in Unity required only one line of code to
access the animator controller and one line of code to get
each trigger value and was very quick to implement,
including research. The way to implement the same
functionality in Unreal was through animation notifications
and boolean values. That required significantly more time to
research and implement and the C++ script, the animation
blueprint, and the animations themselves were involved.
In order to implement facial animations, a third-party
software was needed. Creating a rig for the character's face
and merging it with the rig for the body, including bug fixing
and research, took approximately five days. Unfortunately,
none of the motion capture plugins for Blender that was
planned to use worked as expected for various reasons. By
the time the face rig was complete, and the plugins had been
tried, there was not enough time to create the animations
manually though.
RQ1: Are relevant plugins and libraries harder to get up and
running in either of the engines?
It was found that Unity had built-in support for STT which
we used for a basic version of dialogue capability. A thirdparty plugin had to be used for TTS. Unity’s asset store
contained a plethora of different plugins for this purpose, but
they were all paid, and since this project did not have a
budget, a third-party plugin found on the web was used. The
plugin was easy to get up and running. All that was required
was to download it and then import it into the Unity project
and add it to the agent.
Unreal, on the other hand, had no built-in support for neither
TTS nor STT. Unreal’s asset store also contained a large
number of plugins for this purpose that all came at a cost,
except for one plugin with both TTS and STT support. It did
For the purposes of this project, Unity is more than good
enough when it comes to available functionality and the
visual requirements. It also requires less time learning how it
works. For a beginner, Unity would be the better option with
respect to its relative simplicity compared to Unreal Engine.
For the experienced user, the choice might depend on factors
not investigated in this paper.
Had the project had funding, the search for plugins would
have taken much less time and the number of options
available would have been significantly higher. With more
time and options, one could try out several plugins, which
would aid in answering RQ1 more generally as well as
possibly allow for a smoother development process as some
plugins might be easier to use and might have more
documentation and more online support available (bigger
user base).
When implementing new functionality in Unreal, scripting
only using C++ would have been preferred, but for some
functions, blueprints had to be used since writing code would
have been far more difficult. It worked well but became
repetitive at times with a lot of duplicated code. The blueprint
could possibly have been made to look more compact and
easier to understand with more time, though it worked for
our purposes.
One difference between adding new plugins to the engines
was that Unreal Engine required a restart to enable them.
This was not a big issue when only one plugin was used, but
if multiple plugins get installed, over time it could result in a
lot of time getting wasted on waiting for Unreal to restart.
Given the simple graphics and functionality of the developed
agents, deciding which game engine is best suited for it is not
an easy task. Deciding which game engine is the better one
was described in an article by M. Lewis et al [14] when
talking about Unreal Engine and Quake.
A researcher’s choice between the Quake and Unreal
engines is similar to that between Coke and Pepsi. Both
are more than adequate for their intended purpose yet
each has its avid fans and detractors.
Unity was found to be easier to get started with than Unreal
Engine. This is partly because Unity uses regular C# while
Unreal Engine uses their own version of C++ called Unreal
C++. We had experience with both C# and C++ prior to this
project which resulted in a false sense of confidence when
starting to script in the two engines. It quickly became
apparent that many features of the standard library in C++
were not available in Unreal C++. Instead, one was forced
into using their own functions. For example, to read data
from a file one had to use the FFileHelper class instead of
std::filestream, which is the usual way to read files in regular
C++. To print something to the console, GLog was used
instead of std::cout. This made scripting in Unreal Engine
cumbersome and caused it to consume more time than
implementing the same thing in Unity.
Another factor that affected the time it took to script in
Unreal was the lack of error messages when using regular
C++ functions. When using std::cout to print to the console,
for example, no indication of error was given from the editor
or the console. Instead, the expected output simply did not
show up, which led to having to resort to online forums for
explanations and solutions.
Our findings regarding the difficulties in learning to use
engines are consistent with the results of the study mentioned
by P. E. Dickson et al [10] with respect to difficulty.
The two engines had great support online in the form of
forum posts and tutorials one could use as a reference and
guide to solve issues and implement new features. However,
the documentation of both engines could have been more
helpful, they did a good job of explaining what functionality
was available but left out how the functions worked and were
supposed to be used.
At the beginning of this project, we assumed all the work
would be contained within the game engines, but when it was
time to implement facial animations, we had to use an
external tool, in our case Blender. Learning to use Blender
took more time than we planned for it to do. When importing
the new model into the game engines, there were conflicts
between some of the bones in the new rig, which resulted in
more time being spent on debugging. A positive aspect of
using an external tool like Blender, was that the same models
and animations could be used in both engines. We did not
manage to get the facial animations to work properly due to
time restrictions but given enough time and knowledge it is
In this project, two virtual agents were created using Unity
and Unreal Engine. They are proof-of-concepts with basic
functionality whose purpose is to evaluate the viability of
developing them in this kind of environment.
This section describes features of the agents that were not
implemented. Some of them were never planned to be
implemented at this stage, but rather in the future of the
larger project that this paper is part of, and some of them
were not implemented because of time limitations or a lack
of funding. The future works described below are either
essential for a fully functional virtual receptionist or optional
features that would improve it.
All features described would help to gain a better
understanding about the difference in available plugins for
each engine that can aid development as well as creating a
wider basis for evaluating whether one engine is better suited
than the other.
Facial expressions
The agents created in this project do not have any facial
expressions. In order to implement these, one would have to
have spent sufficient time learning about animating in
Blender or have access to software that can create facial
expressions, like the body animations available from
Mixamo. In this project, some time was spent examining the
third-party plugin Blendartrack, which did not work as
expected. If more time was available, one could have used
the plugin to create a wide range of expressions to combine
with body movements. This is an essential feature for a
virtual receptionist as part of conveying emotion.
The agents created in this project can answer simple
questions specified in a grammar file. They only listen for
keywords though, which means that the user can say
anything to the agent and get the correct response as long as
the recognized phrase contains one of the keywords specified
in the grammar.
In future work, a dialogue engine should be implemented,
where the agent can remember earlier questions or phrases in
order to ask for clarifications and answer more complex
questions as well as participating in small talk This is a
requirement for the final product.
Eye gaze
Eye gaze can help indicate the flow of the conversation,
indicate interest and aid turn-taking in the conversation [15].
As of now, the agents stare straight ahead, resulting in no eye
contact with the person interacting with them. In the future,
a camera or microphone that senses the direction of the voice
of the person speaking could be utilized to get the agent to
look at the person they are talking to as well as looking
around. This is not an essential feature but would improve
the agent.
Blending animations
Currently, there is no blending between animation cycles for
either agent, which results in some unexpected movement
when transitioning between animations. To solve this issue,
one could research on how to blend the animations for
smoother transitions to achieve human-like movements. This
is not essential but would greatly improve the sense of human
The appearance of virtual agents influences the user’s
motivation and attitude [16]. At this moment, the agents are
created using MakeHuman and the visual features are
arbitrary. In future works, it would be interesting to take user
preferences into account and do some manual modeling to
create the most approachable agent to encourage interaction.
This is not essential but might improve the user’s feelings
toward the agent.
Synchronizing speech and mouth movements
Creating and synchronizing mouth movements with speech
is something that was planned to be implemented in this
project but was not due to time limitations. It would make
the interactions look and feel more realistic. There were paid
plugins in the asset stores to facilitate the implementation of
this. This is essential for the finished agent, and we hope to
see it implemented in future work.
Synchronizing speech and body movements
As of now, one of the developed agents pauses in order to
play a different animation in the middle of a sentence, while
the other runs the animations and speech asynchronously.
The latter looks and sounds natural with short sentences and
few animations, but it would be preferred to have more
control over it, and it would be interesting to see this
researched and implemented in future work. For the finished
receptionist, this is essential.
Motion capture
As previously mentioned, the third-party plugin
Blendartrack was examined. The same developer released a
plugin called BlendarMocap which can capture motions of
an entire body. The captured movements can then be
transferred to a Rigify rig in Blender. Unfortunately,
BlendarMocap only worked when working with a Rigify
face rig that has not been upgraded in Blender, which the face
rig used in this project had. In future works it would be
interesting to see what could be accomplished with these or
similar plugins. This is not an essential feature but could
potentially save a lot of time and provide custom, natural
body movement.
Geographic localization
As mentioned in the introduction, the end goal is for the
agent to be aware of its geographic position at the university
campus and give directions based on that location. The agent
would also need to have access to information about courses,
rooms and employees at the university. This is an essential
feature of the finished agent.
Multi-language support
The agent developed in Unity is, as of now, only able to
speak English, since the third-party plugin uses Windows
built-in voice, which only supports a limited selection of
languages. AzSpeech, which was the plugin used in Unreal
Engine, supports many languages, including Swedish. It also
offers multiple choices for the voice. Unfortunately, parsing
the Swedish letters “Å”, “Ä” and “Ö” failed when reading
the grammar file, requiring translating the grammar to
English and changing the spoken language of the agent. In
future work, multi-language support is something that could
be investigated, though not essential.
Expressing emotions through voice
The current agents have voices that do not express emotions.
There are some paid plugins in the asset stores, for example
Replica Studios16 for Unreal Engine that uses AI to convey
emotion. Implementing something like this in future works
is not essential but would create a more natural result.
Recognizing unusual names and words
One of the downsides of the speech recognition in both
agents is that they do not recognize unusual names or words,
such as the last name “Silvervarg” (the closest match was
Silfvervarg) or names of rooms like “E324”, which seems to
be heard as several words. Recognizing uncommon words
and names would be essential for a virtual receptionist in
order for it to properly answer questions in this context.
Small sample size
In this project, only two agents were created, one for each
engine. In order to make a fair assessment of the two engines,
it would have been ideal to develop multiple agents with
different attributes to get a wider understanding of the
differences. By only implementing one agent in each engine,
the research questions are only answered for the very specific
use case of creating a simple, proof-of-concept agent, rather
than virtual humanoid agents in general.
Time limitations
Since neither of us had any experience with either engine
before this project, all the features and functionality had to
be learned during development. With more time to learn
about the engines, more focus could have been put on the
differences in functionality rather than which one is the
easiest to learn or the best choice for a beginner. One
example of something we would have liked to have more
time to investigate, was the animating part of the engines.
Since no surveys were used to help reach conclusions
regarding which engine was better suited or more preferred,
this study relies on our own opinions of the engines.
Sometimes, one aspect of an engine was significantly easier
to learn and use, which might have affected our opinion on
that part of the engines, even if it was not “better” in reality,
just less frustrating.
Lack of funding
The lack of funding for this project impacted the possibilities
of comparing the available plugins in the asset stores. It also
ate up a lot of time having to search online for third-party
alternatives, time that could have been spent perfecting the
agents and adding more functionality, allowing for a better
overview of the engines for the comparison.
In this paper, we evaluated two game engines, Unity and
Unreal Engine, by creating one virtual agent using each of
them. The purpose was to assess whether either was better
suited than the other for this purpose and the differences in
ease of plugin usage.
We found that using plugins and get them up and running
were equally simple in both engines. Regarding which
engine was the better choice for developing virtual agents
depended on whether the developer was a beginner or not
and that it was in part a matter of taste. We found Unity to be
easier to get started with than Unreal Engine for someone
that is inexperienced with game engines.
During the development in this project, a lot of plugins were
found that could have been useful but that were paid plugins.
In order to make a fair comparison of the engines, funding is
essential in order to avoid limitations caused by it.
The usage of and the time it takes to learn about required17
third-party software, such as Blender, have to be considered,
as well as its ability to be integrated into the engines, to better
plan out the project and to make the comparisons fair.
Both Unity and Unreal Engine fulfilled the requirements
needed to develop a virtual human.
[1] V. Chattaraman, W.-S. Kwon and J. E. Gilbert,
"Virtual agents in retail web sites: Benefits of
simulated social interaction for older users,"
Computers in Human Behavior, vol. 28, no. 6, pp.
2055-2066, 2012.
[7] J. Haas, "A History of the Unity Game Engine," 2014.
[8] W. Lee, S. Cho, P. Chu, H. Vu, S. Helal, W. Song, Y.S. Jeong and K. Cho, "Automatic agent generation for
IoT-based smart house simulator," Neurocomputing,
vol. 209, pp. 14-24, 2016.
[9] E. Kucera, O. Haffner and R. Leskovsky, "Interactive
mechatronics education developed in unity engine," in
2018 Cybernetics & Informatics (K&I), Lazy pod
Makytou, 2018.
[10] P. E. Dickson, J. E. Block, G. N. Echevarria and K. C.
Keenan, "An Experience-based Comparison of Unity
and Unreal for a Stand-alone 3D Game Development
Course," in Proceedings of the 2017 ACM Conference
on Innovation and Technology in Computer Science
Education, 2017.
[11] S. Babu, S. Schmugge, T. Barnes and L. F. Hodges,
"“What Would You Like to Talk About?” An
Evaluation of Social Conversations with a Virtual
Receptionist," in International Workshop on
Intelligent Virtual Agents, Berlin, 2006.
[12] Q. V. Liao, M. Davis, W. Geyer, M. Muller and N. S.
Shami, "What Can You Do? Studying Social-Agent
Orientation and Agent Proactive Interactions with an
Agent for Employees," in Proceedings of the 2016
acm conference on designing interactive systems,
Brisbane, 2016.
[2] A. Mutchler, "voicebot.ai," 14 July 2017. [Online].
[Accessed 27 April 2022].
[13] J. Cassell, T. Stocky, T. Bickmore, Y. Gao, Y.
Nakano, K. Ryokai, D. Tversky, C. Vaucelle and H.
Vilhjálmsson, "MACK: Media lab Autonomous
Conversational Kiosk," in Proceedings of Imagina,
Monte Carlo, 2002.
[3] "ibm.com,"
d1/specialprod1_7.html. [Accessed 27 April 2022].
[14] M. Lewis and J. Jacobson, "Game engines,"
Communications of the ACM, vol. 45, no. 1, pp. 27-31,
[4] M. B. Hoy, "Alexa, Siri, Cortana, and More: An
Introduction to Voice Assistants," Medical Reference
Services Quarterly, vol. 37, no. 1, pp. 81-88, 2018.
[5] A. Hartholt, D. Traum, S. Marsella, A. Shapiro, G.
Stratou, A. Leuski, L.-P. Morency and J. Gratch, "All
Together Now," in International Workshop on
Intelligent Virtual Agents, Berlin, 2013.
[6] J. Lewis, D. Brown, W. Cranton and R. Mason,
"Simulating visual impairments using the Unreal
Engine 3 game engine," in 2011 IEEE 1st
International Conference on Serious Games and
Applications for Health (SeGAH), 2011.
“required” meaning industry standard
[15] K. Ruhland, C. E. Peters, S. Andrist, J. B. Badler, N.
I. Badler, M. Gleicher, B. Mutlu and R. McDonnell,
"A Review of Eye Gaze in Virtual Agents, Social
Robotics and HCI: Behaviour Generation, User
Interaction and Perception," Computer Graphics
Forum, vol. 34, no. 6, pp. 299-326, 2015.
[16] A. L. Baylor, "Promoting motivation with virtual
agents and avatars: role of visual presence and
appearance," Philosophical Transactions of the Royal
Society B: Biological Sciences, vol. 364, no. 1535, pp.
3559-3565, 2009.
Related documents