Linköpings universitet/Linköping University | IDA Kandidat, 16 hp | Innovativ Programmering Vårterminen 2022 | LIU-IDA/LITH-EX-G--22/072—SE Using Game Engines to Develop Virtual Agents Nathalie Stensdahl Kristoffer Svensson Handledare, Jalal Maleki Examinator, Rita Kovordanyi Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida https://ep.liu.se/ . Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: https://ep.liu.se/. © 2022 Nathalie Stensdahl Kristoffer Svensson Using Game Engines to Develop Virtual Agents Nathalie Stensdahl [email protected] ABSTRACT Virtual agents have been around since the 1960’s and their usage is constantly increasing. They are used in many different settings. At the same time, game engines are used more and more for various purposes. In this paper, the aim is to create two proof-of-concept virtual agents in the form of a humanoid receptionist as part of a bigger project. The agents were implemented using two of the most widely used game engines, Unity and Unreal Engine. During the development, the different aspects of the engines and their suitability for developing virtual agents were compared. It was found that Unity was easier to use for someone inexperienced with game engines, but that it in general is more of a matter of personal preference. INTRODUCTION Virtual agents are becoming increasingly popular, and almost everyone you ask has probably interacted with one at some point. They can be used for various tasks, like setting an alarm, controlling smart devices, or telling the user what the weather is like. Companies may utilize them instead of human employees to aid with customer support amongst other things [1] and most people have smartphones with voice activated assistants right inside their pockets. The first ever voice assistant was introduced in 1961 by IBM in the form of the IBM Shoebox [2]. It could listen for instructions for simple mathematical problems through a microphone and calculate and print the results [3]. Since then, the evolution of virtual agents has progressed significantly. Siri, launched by Apple with the release of the iPhone 4S in 2011, is considered to be the first modern virtual assistant. A few familiar assistants introduced after Siri are Amazon’s Alexa, Google’s Google Assistant and Cortana by Microsoft [4]. All of these are examples of assistants that use text and voice to communicate. The market for virtual assistants is constantly growing with more features and platforms being added [4] and therefore the user base will grow, and more companies will want to develop and use these technologies. In this paper, the focus will be on assistants with a visual aspect to them, more specifically assistants represented in a human form on a screen. Game engines have the potential to be good tools to develop these agents. Therefore, their 1 https://www.gamedesigning.org/career/video-gameengines/ Kristoffer Svensson [email protected] suitability for this purpose will be examined. Two of the most popular game engines1 will be evaluated and a virtual agent will be designed using each of them. The engines chosen for this project are Unity and Unreal Engine. While developing the agents, the different aspects of the respective engines will be evaluated and compared against each other for this specific use case. Purpose The purpose of this paper is to evaluate the suitability of two popular game engines for developing virtual agents. This paper is part of a larger project with the goal of developing a virtual receptionist which will be placed throughout the Linköping University campus on TV monitors. The monitors will be equipped with proximity sensors or a camera that will detect people walking up to the screen and allow the receptionist to make eye contact and start listening. The receptionist will be able to answer questions regarding the location of lectures, offices and other rooms and give directions to the user based on the location of the monitor. The directions will be given through speech and gestures such as pointing and looking. The goal for this paper is to evaluate the suitability of the chosen game engines by creating one proof-of-concept receptionist in each engine. These agents will only have basic animations, listen to keywords, and respond with predefined answers specified in a grammar file. The purpose of this paper is to arrive at conclusions that will be of use to future development of the receptionist. Research questions RQ1: Are relevant plugins and libraries harder to get up and running in either of the engines? It will be investigated whether the engines have built-in support for the functionality required for a virtual agent, or if plugins from their respective asset stores or third-party plugins are needed. The ease of which these plugins can be implemented into the respective projects will also be evaluated. RQ2: Is Unity or Unreal the better choice for developing virtual agents? The required criteria for virtual agents are specified under “Virtual agents” in the theory section of this paper. The game engines will be assessed on whether they are able to fulfill those. The time requirements for learning each engine will also be taken into account. Unity have been used in educational environments as well [9]. THEORY Editors This section contains a description of terms, techniques and software used or investigated throughout the work. This is meant to provide a good understanding of the work and following sections of the paper. Each game engine comes with editors. An editor is a visual interface where properties such as textures and objects can be edited and added to the project. A main part of the editors is that they have a large window where the “world” preview is visible and where the user also can run simulations and test the game during development. Objects can also be added, removed as well as moved around and manipulated in the scene. The Unreal Engine also has editors for editing materials and blueprints. Virtual Agents The term “virtual agent” is broad and can mean a lot of things and imply different features and attributes. According to Hartholt et al [5] a virtual human need to be able to simulate human capabilities, such as recognizing speech and understanding natural language, recognize facial expressions and to respond with speech and facial expressions. The authors also say that not all virtual humans will have all of these capabilities but often rather consist of a subset of all defining capabilities. In this paper, when talking about virtual agents, it refers to agents that represent humans in that they can process natural language, respond to questions, and have visual human features, gestures and attribute. The terms “virtual agent” and “virtual human” will be used interchangeably going forward. Blueprints In Unreal Engine, the user is given the option to do programming in C++ using a text editor such as Visual Studio or to use blueprints (see Figure 1). Blueprints are a way to implement functionality and logic using visual scripting where nodes are created and connected through input and output pins. Game engines Game engines are software frameworks designed to aid developers in constructing games. The functionality often includes a rendering engine, a physics engine, sounds, animation, AI, networking, memory management, threading, porting to multiple platforms etc. 2. Unreal Engine Unreal Engine is a game engine owned by Epic games. It was first released in 19983 and is used for not only creating games but in film making to create special effects, real life simulations and architectural design. Unreal Engine was used during the creation of the Mandalorian series by Disney4. It has also been used to simulate visual impairments to aid with both education and describing symptoms to patients in the health care industry [6]. Unity Unity is a game engine written in C++ and C# owned by Unity Technologies. It was founded in Copenhagen, Denmark and was first launched in 2005 [7]. In 2021, 50% of games across mobile, PC, and console were made with Unity and 5 billion apps made with the engine were downloaded each month5. Just like Unreal it is mainly used to create games but is also used for film making and architectural design. Disney used Unity to create their short series Baymax Dreams6. Unity has also been used to simulate smart houses to aid with designing them [8]. Tools built in Figure 1. An example of a Blueprint in Unreal Engine Textures Textures are what is placed on and wrapped around objects to give them the appearance of being made of a specific material such as skin and wood and consists of an image file. A UV map tells the renderer how to apply or map the image to an object, for example, how a shirt should fit on a character’s body. Rigs Rigs are like skeletons and are what gives characters bones and joints used for animating the characters (see Figure 2). 2 5 3 6 https://en.wikipedia.org/wiki/Game_engine https://unreal.fandom.com/wiki/Unreal_Engine_1 4 https://www.unrealengine.com/en-US/blog/forging-newpaths-for-filmmakers-on-the-mandalorian https://unity.com/our-company https://unity.com/madewith/baymax-dreams Blendartrack Blendartrack10 is a third-party plugin for Blender that allows the user to record facial movements with their proprietary app for iOS or Android and exporting the recording into Blender. The movements can then be transferred to a Rigify11 face rig in order to avoid animating manually. RELATED WORKS A study [10] was conducted at Ithaca College, where the authors evaluated Unity vs Unreal Engine in the context of teaching it vs. learning it in an educational environment. The authors found that students generally found Unity to be easier to learn, while Unreal was harder, but provided more satisfying results. One student thought that Unreal Engine was as easy to learn as Unity, but they had learned Unity first and not both at the same time. Figure 2. Face rig in Blender Plugins Plugins are added to projects to add functionality that aren’t included in the game engines natively, such as TTS (text-tospeech) and STT (speech-to-text). Mixamo Mixamo7 is a company that provides a free online web service for 3D characters, rigging and animations. They have the option to choose one of their own characters or upload a custom model and apply a rig and optional animations to them before downloading. The user is also able to choose to download animations only, without any model. The downside to using Mixamo characters is that they lack face rigs and separate eyes, teeth, and tongue. This makes animating facial expressions difficult. The characters and any animations are exported as .fbx (filmbox) files. Makehuman Makehuman8 is a software tool that allows the user to create a 3D model of a human. Properties such as gender, ethnicity, and body shape can be set using sliders. The user also gets to pick clothes and hair as well as eyes, teeth and tongue which are needed when creating face rigs and animating mouth movements, such as speech. Blender Blender9 is a software used for modeling and animating 3D graphics. When animating in Blender, the user can manually pose their model and use each pose as a keyframe. Several keyframes can be used to create an animation. Blender “fills in the gaps” between keyframes, making the animation look smooth and the animator's job easier. If the animation needs to be more precise, more keyframes can be added. A study by S. Babu et. al. [11] investigated the popularity of a virtual receptionist called Marve’s conversational skills and social characteristics. Marve interacted using spoken language and non-verbal cues such as vision-based gaze tracking, facial expressions, and gestures. One of the metrics evaluated in the study was to measure how often people passing by would stop to interact with him. The authors found that almost 29% of the times someone passed the doorway by his display, they stopped to interact with him. The authors also evaluated how often users wanted to have a task-oriented versus a social conversation with Marve. The users were given the four conversational topics to - weather, movies, a joke or leave messages. They found that 80% of users preferred social small talk, i.e., the first three topics. Another study [12] found that even when the intended function of an agent was not small talk, 18% of the questions asked were socializing questions, such as asking the agent about its favorite color. Researchers at MIT developed, what they referred to as, an Autonomous Conversational Kiosk called MACK [13] in 2002. MACK was displayed as a life-sized blue robot on a flat screen and was aware of his surroundings. Because of this awareness, MACK was able to give directions to the users by using gestures. The researchers found that the gestures were effective in conveying information. For example, when he would point to a room behind the user and the user would turn around to look in that direction. These studies indicate that virtual agents can be useful for people, and they need functionality for more than just the purpose they were planned for. Which is something to keep in mind during the implementation of this project. 7 10 8 11 https://www.mixamo.com/#/ http://www.makehumancommunity.org/ 9 https://www.blender.org/ https://cgtinker.gumroad.com/l/tLEbZ https://docs.blender.org/manual/en/2.81/addons/rigging/ri gify.html METHOD The following subsections make up an overview of the tools and methods used during implementation as well as a stepby-step description of the development process. Research setting All work was done on a PC with Windows 10 Education with Unity version 2020.3.32f1 and Unreal Engine version 4.27 installed as well as MakeHuman version 1.2.0 and Blender version 3.1.2. A connected speaker with an integrated microphone was used for voice input and output. Research method The virtual agents were developed in parallel, i.e., each step was completed on both agents before moving to the next step, even if something was much more time consuming for one of the agents. Throughout development, notes were continuously taken on issues, such as missing plugins and tasks that required more time than expected, as well as parts that were significantly easier to implement in one engine than the other. Facial animations requirement In order for facial animations to work properly, the model required separate meshes for teeth, tongue, and eyes as well as a face rig. The character from Mixamo does not have these. Because of this, a character made in Makehuman was used instead. The character’s .fbx (filmbox) file was uploaded to Mixamo and their automatic rigging tool was used to add a rig to the character’s body. This was done so that the exported animations from Mixamo would fit the rig of the character. The rigged Makehuman character was then imported into Blender to add a face rig to it and combine it with the Mixamo rig. The Blender plugin Rigify was used to create the face rig. Figure 3. Part of the grammar Unity Importing the model At the start a character exported from Mixamo was added to the Unity project. The model was then dragged into the scene. The textures of the model were displayed as all white at that point (see Figure 4). To fix this, the embedded textures and materials had to be extracted in the materials tab of the character. The rig’s animation type was also changed to humanoid which creates an avatar for the character to enable animations to work correctly. Grammar For the grammar, a .csv (comma-separated values) file was used to make it simple to add or edit which keywords the agent listens for and what the agent is supposed to say in response. This file contained a column for keywords and one column for their corresponding responses. In the responses, triggers for animations were embedded (See Figure 3) and could be opened and edited in Excel or any plain text editor. Figure 4. Unextracted textures Speech-to-text The agent needed to be able to hear the user's voice as well as recognize words and sentences. Unity provides a built-in class called DictationRecognizer12 for this purpose that uses Windows’ built-in speech recognition. The DictationRecognizer was utilized in a C# script and the “DictationRecognizer.DictationHypothesis” event was used to call a function that checked if a specified keyword had been heard. Text-to-speech Since Unity does not have any built-in functionality for text to speech, a plugin was used. There were some plugins available in Unity's asset store but none of them were free. A 12 https://docs.unity3d.com/ScriptReference/Windows.Speec h.DictationRecognizer.html third-party plugin13 was used which utilized Window’s builtin speech synthesizer. The “WindowsVoice.Speak'' function was used to enable the agent to talk, and was called with a string as the argument, specifying what the agent should say. Animations The animations used were all downloaded from Mixamo and were imported into the project. The animation type was changed to humanoid in the animation’s rig tab to match the rig of the character. Doing this created an avatar for the animation with its own bones. These bones mapped to the character's avatar and made sure the correct bones moved during a certain animation. The next step was to create an Animator Controller (see Figure 5) which was used to create the different states that run the animations. To go from one animation to another, a transition is required. In order to trigger the transitions, Unity uses a feature called triggers. Triggers can be activated from a C# script. fixed by changing the settings in their materials to the appropriate ones for hair textures. The bones for the Makehuman character’s avatar were mapped to the corresponding bones in the agent’s rig. This was done to make sure that the animations made for the Mixamo character worked as intended for the new character (see Figure 6). Private Animator anim; anim.SetTrigger("BODY: POINTLEFT"); This function takes in a string which represented the name of the trigger to activate. Triggers in Unity are booleans that when set to the value “true”, immediately resets back to being “false”, removing the need for the developer to reset the value. Each animation, except the idle animation, was associated with its own trigger and started playing when the trigger was set to “true”. Since the triggers automatically reset, the animations only played for one animation cycle. Since the idle animation was set to the default state, the agent always went back to that animation after any of the other cycles finished. Figure 6. Bone mapping in Unity Grammar The grammar file’s contents were parsed in the C# script. Each line was split on the first occurrence of a comma. The two resulting parts were inserted into a map with the keywords as keys and the phrases as the values. When the DictationRecognizer called its hypothesis event, the recognized phrase was checked for any of the parsed keywords. If a keyword was found, the corresponding phrase was further parsed. Since the phrases contained trigger words for animations, they had to be processed before calling the Speak function. The phrase to be spoken was split at each trigger using a regular expression. All resulting elements from the split were put into an array. The array was looped through and SetTrigger was called when a trigger was found with the trigger word as an argument. The rest of the elements were passed to “WindowsVoice.Speak”. Unreal Engine Importing the model Figure 5. Animator Controller in Unity Facial animations and retargeting To make face animations possible, a Makehuman model with the required meshes for eyes, teeth and tongue was needed. The new model was imported into Unity and its textures and materials were extracted in the same way as for the Mixamo character. This time however, the textures for the hair, eyebrows and eyelashes showed up as solid blocks. This was 13 https://chadweisshaar.com/blog/2015/07/02/microsoftspeech-for-unity/ The same character model from Mixamo was used and imported into the Unreal Engine project in the same way as it was done in Unity. The character’s hair texture was not visible so some changes to the hair material were required. Speech-to-text and text-to-speech Unreal Engine had no built-in support for STT or TTS. The plugin that was used utilized Microsoft Azure14, which was free up to a certain number of hours of speech per month, which was sufficient for the purpose of this project. An account at Microsoft Azure was required to obtain a verification key, which was needed to use the plugin. 14 https://www.unrealengine.com/marketplace/enUS/product/azspeech-async-text-to-voice-and-voice-totext?sessionInvalidated=true Initially a Blueprint class was used for the logic, but it quickly became messy and difficult to follow or modify. Because of this, a C++ based class of the agent was implemented and used instead. Animations The animations from Mixamo were imported into the Unreal project. To get them to work properly, an animation blueprint was required. In this blueprint a state machine, similar to the animator controller in Unity, was added. The state machine contained different animation states and transitions between them. The transitions were activated when a specified value is achieved. In this case, booleans were used to trigger the animations. Nothing similar to Unity’s SetTrigger was found for Unreal engine. Instead, a boolean was set to true in the C++ script, then in the animation controllers event graph blueprint cast a Pawn to our C++ class and use Get functions to retrieve its value and trigger the animations (see Figure 7). Facial animations and retargeting To change to the new model from Makehuman, adapted for facial expressions, Unreal Engine’s retargeting manager was used. In the model’s Skeleton Editor, there was a button called Retarget Manager. As long as the rig type was the same for the new and the old character and all the bones were assigned correctly, it was capable of duplicating and transferring animations from one model to another. This made it possible to smoothly transition from the Mixamo character to the Makehuman character. Grammar To parse the grammar file in Unreal Engine, Unreal’s FFileHelper class was used in the C++ script. This allowed us to read the file into an array, then loop through it to read the file line by line. The contents were put into a map in a similar way to how it was done in Unity. RESULTS To implement text-to-speech and speech-to-text functionality in this project, a mixture of functions built into the engines, plugins from the asset stores and third-party plugins found by browsing internet forums were used. STT in Unity was significantly easier to implement than the others since it did not require a plugin at all. The TTS plugin for Unity as well as the TTS and STT plugins for Unreal were equally simple to get up and running if the time spent on searching for them online is ignored. Figure 7. Blueprint for retrieving and setting boolean values To set the triggering boolean values back to false, animation notifications15 were utilized. Animation notifications, also referred to as notifies, are a way to trigger events during an animation cycle. One was added to the end of the cycle for each of the animations and named them according to which animation they came from. An example would be the notify called “DoneWaving” which was triggered at the end of the waving animation cycle. In the animation blueprint class, that event triggered a change to the corresponding boolean values (see Figure 8). Figure 8. Blueprint for resetting values after notify event 15 https://docs.unrealengine.com/4.27/enUS/AnimatingObjects/SkeletalMeshAnimation/Sequences/ Notifies/ When first starting to work on this project, the impression was that everything could be implemented in Unity using C# and everything in Unreal using C++ and that Unreal’s blueprints were optional. This was true for the Unity project. For Unreal though, blueprints were necessary to achieve the goals without having to write very complex code. Most of the official documentation and instructions in online forums used blueprints and at least a combination of blueprints and C++ was required. When using blueprints in Unreal, the editor quickly became cluttered, hard to follow and difficult to edit with all the nodes and pins in between (see Figure 9). Some of it was manually translated into C++ code. For the animation logic though, we had to use blueprints. however utilize Microsoft Azure which could become costly if the free hours they provided were used up. In general, plugins were equally easy to get up and running in both engines and the user had the option to use third-party plugins if the asset stores did not have the plugins needed. RQ2: Is Unity or Unreal the better choice for developing virtual agents? Both Unity and Unreal Engine are viable tools for this application. It is possible to implement a virtual agent in both engines with some features being easier to do in Unity than Unreal, like triggering animations. Development in Unreal Engine required more time spent on learning and research. Figure 9. STT blueprint A big difference between the engines, with respect to the amount of code and time spent, was the implementation of triggers in Unity versus animation notifications in combination with boolean values in Unreal. They were both used for starting and stopping animations in this project. The trigger solution in Unity required only one line of code to access the animator controller and one line of code to get each trigger value and was very quick to implement, including research. The way to implement the same functionality in Unreal was through animation notifications and boolean values. That required significantly more time to research and implement and the C++ script, the animation blueprint, and the animations themselves were involved. In order to implement facial animations, a third-party software was needed. Creating a rig for the character's face and merging it with the rig for the body, including bug fixing and research, took approximately five days. Unfortunately, none of the motion capture plugins for Blender that was planned to use worked as expected for various reasons. By the time the face rig was complete, and the plugins had been tried, there was not enough time to create the animations manually though. RQ1: Are relevant plugins and libraries harder to get up and running in either of the engines? It was found that Unity had built-in support for STT which we used for a basic version of dialogue capability. A thirdparty plugin had to be used for TTS. Unity’s asset store contained a plethora of different plugins for this purpose, but they were all paid, and since this project did not have a budget, a third-party plugin found on the web was used. The plugin was easy to get up and running. All that was required was to download it and then import it into the Unity project and add it to the agent. Unreal, on the other hand, had no built-in support for neither TTS nor STT. Unreal’s asset store also contained a large number of plugins for this purpose that all came at a cost, except for one plugin with both TTS and STT support. It did For the purposes of this project, Unity is more than good enough when it comes to available functionality and the visual requirements. It also requires less time learning how it works. For a beginner, Unity would be the better option with respect to its relative simplicity compared to Unreal Engine. For the experienced user, the choice might depend on factors not investigated in this paper. DISCUSSION Had the project had funding, the search for plugins would have taken much less time and the number of options available would have been significantly higher. With more time and options, one could try out several plugins, which would aid in answering RQ1 more generally as well as possibly allow for a smoother development process as some plugins might be easier to use and might have more documentation and more online support available (bigger user base). When implementing new functionality in Unreal, scripting only using C++ would have been preferred, but for some functions, blueprints had to be used since writing code would have been far more difficult. It worked well but became repetitive at times with a lot of duplicated code. The blueprint could possibly have been made to look more compact and easier to understand with more time, though it worked for our purposes. One difference between adding new plugins to the engines was that Unreal Engine required a restart to enable them. This was not a big issue when only one plugin was used, but if multiple plugins get installed, over time it could result in a lot of time getting wasted on waiting for Unreal to restart. Given the simple graphics and functionality of the developed agents, deciding which game engine is best suited for it is not an easy task. Deciding which game engine is the better one was described in an article by M. Lewis et al [14] when talking about Unreal Engine and Quake. A researcher’s choice between the Quake and Unreal engines is similar to that between Coke and Pepsi. Both are more than adequate for their intended purpose yet each has its avid fans and detractors. Unity was found to be easier to get started with than Unreal Engine. This is partly because Unity uses regular C# while Unreal Engine uses their own version of C++ called Unreal C++. We had experience with both C# and C++ prior to this project which resulted in a false sense of confidence when starting to script in the two engines. It quickly became apparent that many features of the standard library in C++ were not available in Unreal C++. Instead, one was forced into using their own functions. For example, to read data from a file one had to use the FFileHelper class instead of std::filestream, which is the usual way to read files in regular C++. To print something to the console, GLog was used instead of std::cout. This made scripting in Unreal Engine cumbersome and caused it to consume more time than implementing the same thing in Unity. Another factor that affected the time it took to script in Unreal was the lack of error messages when using regular C++ functions. When using std::cout to print to the console, for example, no indication of error was given from the editor or the console. Instead, the expected output simply did not show up, which led to having to resort to online forums for explanations and solutions. Our findings regarding the difficulties in learning to use engines are consistent with the results of the study mentioned by P. E. Dickson et al [10] with respect to difficulty. The two engines had great support online in the form of forum posts and tutorials one could use as a reference and guide to solve issues and implement new features. However, the documentation of both engines could have been more helpful, they did a good job of explaining what functionality was available but left out how the functions worked and were supposed to be used. At the beginning of this project, we assumed all the work would be contained within the game engines, but when it was time to implement facial animations, we had to use an external tool, in our case Blender. Learning to use Blender took more time than we planned for it to do. When importing the new model into the game engines, there were conflicts between some of the bones in the new rig, which resulted in more time being spent on debugging. A positive aspect of using an external tool like Blender, was that the same models and animations could be used in both engines. We did not manage to get the facial animations to work properly due to time restrictions but given enough time and knowledge it is possible. LIMITATIONS AND FUTURE WORK In this project, two virtual agents were created using Unity and Unreal Engine. They are proof-of-concepts with basic functionality whose purpose is to evaluate the viability of developing them in this kind of environment. This section describes features of the agents that were not implemented. Some of them were never planned to be implemented at this stage, but rather in the future of the larger project that this paper is part of, and some of them were not implemented because of time limitations or a lack of funding. The future works described below are either essential for a fully functional virtual receptionist or optional features that would improve it. All features described would help to gain a better understanding about the difference in available plugins for each engine that can aid development as well as creating a wider basis for evaluating whether one engine is better suited than the other. Facial expressions The agents created in this project do not have any facial expressions. In order to implement these, one would have to have spent sufficient time learning about animating in Blender or have access to software that can create facial expressions, like the body animations available from Mixamo. In this project, some time was spent examining the third-party plugin Blendartrack, which did not work as expected. If more time was available, one could have used the plugin to create a wide range of expressions to combine with body movements. This is an essential feature for a virtual receptionist as part of conveying emotion. Dialogue The agents created in this project can answer simple questions specified in a grammar file. They only listen for keywords though, which means that the user can say anything to the agent and get the correct response as long as the recognized phrase contains one of the keywords specified in the grammar. In future work, a dialogue engine should be implemented, where the agent can remember earlier questions or phrases in order to ask for clarifications and answer more complex questions as well as participating in small talk This is a requirement for the final product. Eye gaze Eye gaze can help indicate the flow of the conversation, indicate interest and aid turn-taking in the conversation [15]. As of now, the agents stare straight ahead, resulting in no eye contact with the person interacting with them. In the future, a camera or microphone that senses the direction of the voice of the person speaking could be utilized to get the agent to look at the person they are talking to as well as looking around. This is not an essential feature but would improve the agent. Blending animations Currently, there is no blending between animation cycles for either agent, which results in some unexpected movement when transitioning between animations. To solve this issue, one could research on how to blend the animations for smoother transitions to achieve human-like movements. This is not essential but would greatly improve the sense of human likeness. Modeling The appearance of virtual agents influences the user’s motivation and attitude [16]. At this moment, the agents are created using MakeHuman and the visual features are arbitrary. In future works, it would be interesting to take user preferences into account and do some manual modeling to create the most approachable agent to encourage interaction. This is not essential but might improve the user’s feelings toward the agent. Synchronizing speech and mouth movements Creating and synchronizing mouth movements with speech is something that was planned to be implemented in this project but was not due to time limitations. It would make the interactions look and feel more realistic. There were paid plugins in the asset stores to facilitate the implementation of this. This is essential for the finished agent, and we hope to see it implemented in future work. Synchronizing speech and body movements As of now, one of the developed agents pauses in order to play a different animation in the middle of a sentence, while the other runs the animations and speech asynchronously. The latter looks and sounds natural with short sentences and few animations, but it would be preferred to have more control over it, and it would be interesting to see this researched and implemented in future work. For the finished receptionist, this is essential. Motion capture As previously mentioned, the third-party plugin Blendartrack was examined. The same developer released a plugin called BlendarMocap which can capture motions of an entire body. The captured movements can then be transferred to a Rigify rig in Blender. Unfortunately, BlendarMocap only worked when working with a Rigify face rig that has not been upgraded in Blender, which the face rig used in this project had. In future works it would be interesting to see what could be accomplished with these or similar plugins. This is not an essential feature but could potentially save a lot of time and provide custom, natural body movement. Geographic localization As mentioned in the introduction, the end goal is for the agent to be aware of its geographic position at the university campus and give directions based on that location. The agent would also need to have access to information about courses, rooms and employees at the university. This is an essential feature of the finished agent. Multi-language support The agent developed in Unity is, as of now, only able to speak English, since the third-party plugin uses Windows built-in voice, which only supports a limited selection of languages. AzSpeech, which was the plugin used in Unreal Engine, supports many languages, including Swedish. It also offers multiple choices for the voice. Unfortunately, parsing the Swedish letters “Å”, “Ä” and “Ö” failed when reading the grammar file, requiring translating the grammar to English and changing the spoken language of the agent. In 16 https://replicastudios.com/unreal future work, multi-language support is something that could be investigated, though not essential. Expressing emotions through voice The current agents have voices that do not express emotions. There are some paid plugins in the asset stores, for example Replica Studios16 for Unreal Engine that uses AI to convey emotion. Implementing something like this in future works is not essential but would create a more natural result. Recognizing unusual names and words One of the downsides of the speech recognition in both agents is that they do not recognize unusual names or words, such as the last name “Silvervarg” (the closest match was Silfvervarg) or names of rooms like “E324”, which seems to be heard as several words. Recognizing uncommon words and names would be essential for a virtual receptionist in order for it to properly answer questions in this context. THREATS TO VALIDITY Small sample size In this project, only two agents were created, one for each engine. In order to make a fair assessment of the two engines, it would have been ideal to develop multiple agents with different attributes to get a wider understanding of the differences. By only implementing one agent in each engine, the research questions are only answered for the very specific use case of creating a simple, proof-of-concept agent, rather than virtual humanoid agents in general. Time limitations Since neither of us had any experience with either engine before this project, all the features and functionality had to be learned during development. With more time to learn about the engines, more focus could have been put on the differences in functionality rather than which one is the easiest to learn or the best choice for a beginner. One example of something we would have liked to have more time to investigate, was the animating part of the engines. Subjectivity Since no surveys were used to help reach conclusions regarding which engine was better suited or more preferred, this study relies on our own opinions of the engines. Sometimes, one aspect of an engine was significantly easier to learn and use, which might have affected our opinion on that part of the engines, even if it was not “better” in reality, just less frustrating. Lack of funding The lack of funding for this project impacted the possibilities of comparing the available plugins in the asset stores. It also ate up a lot of time having to search online for third-party alternatives, time that could have been spent perfecting the agents and adding more functionality, allowing for a better overview of the engines for the comparison. CONCLUSIONS In this paper, we evaluated two game engines, Unity and Unreal Engine, by creating one virtual agent using each of them. The purpose was to assess whether either was better suited than the other for this purpose and the differences in ease of plugin usage. We found that using plugins and get them up and running were equally simple in both engines. Regarding which engine was the better choice for developing virtual agents depended on whether the developer was a beginner or not and that it was in part a matter of taste. We found Unity to be easier to get started with than Unreal Engine for someone that is inexperienced with game engines. During the development in this project, a lot of plugins were found that could have been useful but that were paid plugins. In order to make a fair comparison of the engines, funding is essential in order to avoid limitations caused by it. The usage of and the time it takes to learn about required17 third-party software, such as Blender, have to be considered, as well as its ability to be integrated into the engines, to better plan out the project and to make the comparisons fair. Both Unity and Unreal Engine fulfilled the requirements needed to develop a virtual human. REFERENCES [1] V. Chattaraman, W.-S. Kwon and J. E. Gilbert, "Virtual agents in retail web sites: Benefits of simulated social interaction for older users," Computers in Human Behavior, vol. 28, no. 6, pp. 2055-2066, 2012. [7] J. Haas, "A History of the Unity Game Engine," 2014. [8] W. Lee, S. Cho, P. Chu, H. Vu, S. Helal, W. Song, Y.S. Jeong and K. Cho, "Automatic agent generation for IoT-based smart house simulator," Neurocomputing, vol. 209, pp. 14-24, 2016. [9] E. Kucera, O. Haffner and R. Leskovsky, "Interactive and virtual/mixed reality applications for mechatronics education developed in unity engine," in 2018 Cybernetics & Informatics (K&I), Lazy pod Makytou, 2018. [10] P. E. Dickson, J. E. Block, G. N. Echevarria and K. C. Keenan, "An Experience-based Comparison of Unity and Unreal for a Stand-alone 3D Game Development Course," in Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education, 2017. [11] S. Babu, S. Schmugge, T. Barnes and L. F. Hodges, "“What Would You Like to Talk About?” An Evaluation of Social Conversations with a Virtual Receptionist," in International Workshop on Intelligent Virtual Agents, Berlin, 2006. [12] Q. V. Liao, M. Davis, W. Geyer, M. Muller and N. S. Shami, "What Can You Do? Studying Social-Agent Orientation and Agent Proactive Interactions with an Agent for Employees," in Proceedings of the 2016 acm conference on designing interactive systems, Brisbane, 2016. [2] A. Mutchler, "voicebot.ai," 14 July 2017. [Online]. Available: https://voicebot.ai/2017/07/14/timelinevoice-assistants-short-history-voice-revolution/. [Accessed 27 April 2022]. [13] J. Cassell, T. Stocky, T. Bickmore, Y. Gao, Y. Nakano, K. Ryokai, D. Tversky, C. Vaucelle and H. Vilhjálmsson, "MACK: Media lab Autonomous Conversational Kiosk," in Proceedings of Imagina, Monte Carlo, 2002. [3] "ibm.com," [Online]. Available: https://www.ibm.com/ibm/history/exhibits/specialpro d1/specialprod1_7.html. [Accessed 27 April 2022]. [14] M. Lewis and J. Jacobson, "Game engines," Communications of the ACM, vol. 45, no. 1, pp. 27-31, 2002. [4] M. B. Hoy, "Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants," Medical Reference Services Quarterly, vol. 37, no. 1, pp. 81-88, 2018. [5] A. Hartholt, D. Traum, S. Marsella, A. Shapiro, G. Stratou, A. Leuski, L.-P. Morency and J. Gratch, "All Together Now," in International Workshop on Intelligent Virtual Agents, Berlin, 2013. [6] J. Lewis, D. Brown, W. Cranton and R. Mason, "Simulating visual impairments using the Unreal Engine 3 game engine," in 2011 IEEE 1st International Conference on Serious Games and Applications for Health (SeGAH), 2011. 17 “required” meaning industry standard [15] K. Ruhland, C. E. Peters, S. Andrist, J. B. Badler, N. I. Badler, M. Gleicher, B. Mutlu and R. McDonnell, "A Review of Eye Gaze in Virtual Agents, Social Robotics and HCI: Behaviour Generation, User Interaction and Perception," Computer Graphics Forum, vol. 34, no. 6, pp. 299-326, 2015. [16] A. L. Baylor, "Promoting motivation with virtual agents and avatars: role of visual presence and appearance," Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 364, no. 1535, pp. 3559-3565, 2009.