Hy-Line Computer Components
Hello machine tool! - Voice control in the industry
Voice control has arrived in our everyday lives, whether in the car, for operating smartphones or home systems. The technology is also becoming increasingly important in the professional environment. Its use is less complicated than expected.
First seen as a nice gimmick, then matured into an integral part of the smart home: controlling music, lights, reminder timers and filling shopping lists is easy and convenient with the medium of voice. While voice control initially offered a similar level of convenience to wireless TV remote control, an infrastructure has now emerged in which it offers real added value. Amazon with "Alexa" as a pioneer supports the development of voice recognition. In the new project called "MASSIVE", Amazon provides data sets in 51 languages that developers can use to test their algorithms and systems.
Importance of language technology
In addition to the traditional display and touchscreen interface, the medium of speech, with the spoken word as an input command and synthesized speech as an output, has a firm place as a control element. The consulting company Gartner produces studies on the future of technologies. The so-called "Gartner Hype Cycle" explains the life phases of a technology in several stages - from the initial euphoria to disillusionment during implementation and productive use.
Speech recognition has already reached the productivity phase, and speech synthesis is well on its way. However, there is still development work to be done on understanding and interpreting natural language. In addition to purely algorithmic speech recognition, speech recognition supported by artificial intelligence (AI) is also very important. But what do we need for use in a professional environment? What distinguishes these applications from conventional voice assistants?
In terms of an ergonomically designed HMI, speech recognition is expected to recognize the spoken word independently of the speaker, to understand several languages if possible, to listen carefully and also to listen away (sometimes the voice control is triggered if the keyword is incorrectly recognized) and to be tolerant with regard to grammar. Filler words such as "please", "once", "yes, exactly" and throat clearing should be ignored and not lead to incorrect operation.
Using AI on the device's hardware platform can be difficult: Extensive circuitry with high power consumption and corresponding price are not economically feasible. Instead, AI is used in the training phase of the voice system. The result is transferred to the hardware platform, which only acts as an execution engine and therefore requires few resources in hardware and software.
Why voice control?
The pandemic situation has encouraged the tendency to no longer want to touch every control element; if hands are also not free, clean or wet, a task can be completed by voice control. If you don't want to read the result on a display, synthetic speech output can help. Current technology can no longer be compared with the "voice output" on home computers from the 1980s. Prosody (speech melody) and phrasing sound very natural, punctuation marks structure the spoken text.
Why is voice control so interesting and important? It is easy to understand and intuitive to use. After the "wake word", which wakes up the system and prompts it to listen, commands can be given or information retrieved in natural language. Ideally, it is possible to use the system as a "Do What I Mean" machine. One argument in favor of operation is that speech communicates faster than another input medium such as a keyboard. The path from the thought to the speech center is shorter than the detour of controlling the finger muscles and thus operating a keyboard.
With its HMI 5.0 strategy, Hy-Line aims to use as many senses as possible for interaction between man and machine - where it makes sense. The partnership with Voice Inter Connect, Dresden, for example, is all about incorporating the spoken word into communication, whether as an input medium for controlling the machine or as an output for its status. The GUI, which displays the commands entered and their effects for the user, also plays an important role here.
Voice input with "Natural Language Understanding"
The demands placed on a technology in professional use are much higher than those in the smart home environment. Availability and reliability, which are close to 100%, play an eminent role here. If it is an inconvenience in the smart home if the light does not switch on on command, it is unthinkable in professional use not to refocus the operating light. An analysis shows that systems that are connected to a cloud have latencies that are too high. Offline systems have a clear advantage: not only does the system work deterministically and in real time, the data remains local and therefore private. Without the need for a connection to a powerful cloud in which the requests are evaluated and processed, the device also works where there is no internet coverage, data is only transmitted with a moderate bandwidth or the cloud provider discontinues its service.
The concept presented here is hybrid: the computationally intensive training, during which the language models are created, takes place on a powerful server in the cloud. Only the result is transferred to the local memory and is used to recognize the input during operation. This means that the local computer only needs a moderate throughput, which has a positive effect on heat generation and power consumption. This means that the finished application for voice control runs purely on the local system without requiring a connection to a server via the Internet.
Voice output with "text to speech"
Speech synthesis turns voice control with a focus on voice input and voice output, even for extensive texts, into a fully comprehensive assistance system. This allows the operator or service technician to select relevant text passages from a stored operating manual using suitable search terms and have them read aloud. During troubleshooting, the eyes remain focused on the machine.
AI is used to create the synthesis models for text-to-speech (TTS): the models use machine learning algorithms that help to convert continuous text into dynamic, natural-sounding speech output. As with speech recognition training, the process here is also two-stage: training in the cloud, interpretation and playback only locally - so data remains confidential and secure.
Kickstart for professional voice control
The main medium is still manual input - whether using a keyboard, mouse, gesture control or control buttons. Voice can replace input wherever hands are not available because they are being used for other purposes or are dirty. This includes, for example, the HMI on the machine in the production line, where both hands are needed for the workpiece, or the information system at the point of sale, which provides information on where to find stores in the shopping mall or products on the shelves. In the catering industry, the temperature of professional kitchen appliances can be set to the exact degree, while the hands remain clean for the food. In logistics, the storage system gives instructions on where an item should be picked or placed. In medical technology, it is important to keep hands sterile or not to contaminate them so that viruses and bacteria are not passed on. New fields such as smart caravanning are also suitable for voice control: Where individual solutions are currently used for switching lights or querying the filling of fresh or service water tanks, a uniform interface with voice control can ensure simpler wiring and more ergonomic operation.
A ready-made hardware and software solution paves the way from the idea to the finished implementation of voice control. Figure 1 shows the starter kit, which not only makes the first steps easier. In order to develop a device that meets professional requirements and can be used around the clock, a WebSDK is available that abstracts the required algorithms and models.
Different languages are already stored in modules. The developer creates the SUI for the individual application with specific dialogs and commands. Below this is the machine interface, which passes on commands from the SUI to the hardware and GUI. To make this process as simple as possible, Hy-Line has developed a starter kit that not only simplifies the first steps on the way to a commercial solution.
The software
A web SDK is available as part of the starter kit, which can be used to explore the examples and create your own applications. You can create your own dialog models without programming by entering operating phrases with keywords and compiling them on the server. The result is downloaded to the starter kit and works without an Internet connection. The language system grows iteratively by formulating synonyms as alternative inputs and additional command phrases. The architecture receives the text and automatically recognizes keywords, which it assigns as subject or predicate. Filler words such as "please" and "uh" are skipped. The SDK provides APIs that can be transferred to the device via MQTT. This converts the recognized voice command into a hardware action. This reaction can be a voice output, a switching of a port, an output on the display or the change of a value in a JSON file. The kit is versatile enough to control external devices so that it can be used to create functional prototypes and test acceptance in the target group.
The hardware
The voice control kit is powered by a single-board computer in picoITX format based on the iMX8.M CPU. The operating interface is a 10.1-inch display with HD resolution and a capacitive touchscreen. All components are suitable for industrial use, so that a commercial implementation can be carried out with the starter kit. The application created in this way can also be ported to another target platform. In the simplest case, the acoustic output can be provided by a buzzer. However, it is better to use a loudspeaker that can emit broadband acknowledgement tones and voice messages. While earlier systems used to combine previously recorded audio snippets to output messages - such as announcing the time and date - TTS now offers the freedom to output any text in any language from a text file. The vocabulary is therefore practically unlimited and, just like voice input, works locally on the system without an internet connection at runtime.
Implementation process
Using a web-based development environment, the following steps are required to define a system for your own application. The voice dialog, i.e. the activation word with which the system's attention is drawn to the input, the permitted commands and their parameters, are compiled in the web tool as text input (Fig. 2). The first processing step takes place during input: Graphemes, i.e. characters entered, are converted into phonemes, i.e. the smallest acoustic components of speech.
Once all the words have been defined, the AI-based algorithms translate the defined language resources into a statistical and semantic model and offer them for download. The result is downloaded to the target platform and started. The network plug can then be pulled - the end product runs autonomously. The process in the finished application is shown in Figure 3.
The audio technology
The brain's ability to use two ears and the geometry of the head to isolate sounds and completely block out others is astonishing. This enables us to focus on the conversation with the other person at a table in a restaurant with lots of diners, while blocking out the neighbors who are also talking and the clattering of dishes. This is not so easy for a speech system: only with the help of a directional microphone or electronic filters can the system achieve an equally high recognition quality by increasing the signal-to-noise ratio. The directional microphone does not have to have the long design familiar from TV interviews. An array of several individual microphones makes it possible to identify the speaker of the "wake word" even in a noisy environment and to follow them if necessary. This increases the recognition accuracy, reaction speed and acceptance of the system enormously. The same technology can be used on the audio output side to emit sound in a specific direction.
Opening up new dimensions
The author: Rudolf Sosnowsky is Head of Technology at Hy-Line Computer Components Vertriebs GmbH in Unterhaching.
© Hy-LineWith the addition of language, every user interface gains a new dimension. Implementation is easier than expected, as the starter kit not only allows you to start a demo, but also to take your first steps with your own commands and outputs. A powerful SDK is available for implementing protocols to control external devices. State-of-the-art technology means that the system works independently of the speaker; 30 languages are predefined. This solution can also be used on platforms with limited CPU and memory resources; a digital signal processor may also be sufficient here.

















