Some personal understanding of speech operation

RaineSilverlock · September 3, 2024, 12:41pm

At the CES in Las Vegas in 1995, Bill Gates mentioned Microsoft’s voice design scheme in his speech. He passionately depicted the scene of people interacting with machines through voice in the future. That scene is as natural and smooth as communication between people. However, we all know that this vision did not become a reality at that time.

Time flies. In 2011, a person named Scott, on behalf of Apple, demonstrated their voice solution named Siri and described it as almost flawless. But in fact, Siri has now become a tool for people to make fun of in most cases.

At this time, we can’t help but think about a question: Why in 1995, Gates claimed that this would be the way humans interact in the future. In 2011, when Apple was vigorously promoting Siri, it was also very grand. But in the end, these all became toys. People put them aside after playing with them for a few days. One of the core problems is that the product design concept is fundamentally wrong. Take Siri as an example. It is actually made into an omnipotent voice assistant, like a full-time, 24-hour standby secretary, able to complete tasks such as “wake me up tomorrow morning” and “help me do something tomorrow afternoon”. It’s as if you really have a secretary. However, this product design is doomed to failure because it is fundamentally wrong and cannot succeed. The reason lies not in speech recognition technology at all, but that artificial intelligence simply cannot reach people’s expected level. We can imagine how interesting it would be if a person spent two or three thousand yuan or three to five thousand yuan to buy a mobile phone. After turning it on, he found that there was something like a living person inside who could talk to him and do whatever he asked. When Siri was first launched, I also often tried it. For example, when I picked up my mobile phone and said “Let so-and-so die”, it really seemed very willful and intelligent. It would give a reply of “Don’t be like this”, which surprised me. I thought this was artificial intelligence. But when I repeated it five or six times, seven or eight times, I would find that its pretended human reply began to repeat. If you are as boring as me and say it ten or twenty times, or even a hundred times, you will find that it is always looping and repeating. There is no group of Indians helping you solve problems on the computer at all. In fact, it is just pretending to be smart. So, when a software pretends to be smart for you, from the normal human psychology, you can’t help but want to expose it. Therefore, you will try to find its flaws in various ways. After a few attempts, you will stop using it because it is really stupid.

For example, three groups of experiments are carried out, and three groups of people are arranged to participate respectively. In the first group of experiments, when reading content to the participants, we have a real human head portrait on the computer screen, and the sound read out is computer-synthesized speech. In the second group of experiments, a robot is placed during the experiment, and the same computer-synthesized speech is read. In the third group of experiments, we placed a photo of a rabbit on the computer screen, and this photo has animation effects. The rabbit’s lips will move, and at the same time, it is equipped with the same speech synthesis as the first two groups. So, which group will the audience give the lowest score to among these three groups?You can imagine which group among these three groups gives the lowest score to the effect of this speech synthesis software. It is the first group because your psychological expectation matches it. If you see a real person and he speaks clumsily and his voice is as cold as a machine, you will think it is very bad. However, if the picture is a robot, your psychological expectation will be lowered. You will think that it is a machine anyway. It can still speak. What else can you ask of it? The situation is similar for rabbits. For a broken rabbit, it can already speak human language. Brother, what else do you want? And the price is only 199$.

A problem caused by such results is how you should control users’ psychological expectations. Frankly speaking, up to today, many speech recognitions have reached a recognition accuracy rate of 85% or more than 90%. What else do you want? All the mistakes do not lie in operation technology, but in that artificial intelligence simply cannot meet people’s expectations.

When we use voice software, there is another complex psychological problem, that is, when you use voice software, how will others view you. Have you noticed this problem? Will you be embarrassed to take out your mobile phone and say “Wake me up tomorrow morning” or tell it “Send a text message to Lao Wang with the following content” when you are on a bus and surrounded by people? When others look at you with strange eyes as if looking at a fool, you will realize that voice recognition software companies must first solve psychological problems rather than speech recognition technology problems. Once these problems are figured out, we will know how to reasonably design a voice scheme. The ultimate goal of this scheme is to enable users to use it every day instead of just playing with it a few times and then putting it away. Otherwise, this is another kind of failure.

First of all, the essence of a telephone is for making calls, so its main function is of course making calls. So, when you make a call, how do you hope to operate? Usually, we will turn on the mobile phone screen, unlock it, then open the contacts or phone application, and then make a call according to the recent call records or information in the address book. Therefore, we fundamentally consider a problem, that is, if we want speech recognition technology to be easy to use and make everyone willing to use it every day, we must solve a key problem, that is, the technology itself. First of all, we must solve the psychological problem to ensure that it will not make people feel embarrassed or lose face when using it. This is very important. Secondly, when you use it, its convenience must exceed the way of touching with your fingers. If the efficiency of these two methods is similar, then human habit is definitely to touch with hands instead of talking to it, because many people have difficulty overcoming psychological barriers when talking to machines, even when they are alone in a room. Do you understand? Therefore, when designing a product, this psychological problem must be taken into account. In addition, when designing a voice scheme, it must be more efficient than touch operation. So, which way is more efficient? Of course, it is making a direct call. For example, we pick up the phone and put it to our ear. When we hear it, the voice function is triggered. This is not our original design. Many people may not know that in fact, Apple’s iOS system already has this function. It uses two principles. One is that the gyroscope can judge the movement trajectory of the mobile phone in the air. When you bring the mobile phone to your ear, this trajectory will be detected. The other is to block the distance sensor at the same time, and the mobile phone screen will turn black. Because these two conditions are met, the system will trigger the voice operation function. Many mobile phone manufacturers have also used gyroscopes and optical sensors instead of distance sensors to judge whether to activate this function. The operation after activation is very important. When Siri is activated on Apple, you will say to it “Please make a call to so-and-so”. Can you imagine a more stupid design than this? You have already picked up the phone. Wouldn’t it be better to just talk to so-and-so directly?

For voice schemes like Siri, when you say a sentence to it or ask it a question, it will say “Wait a moment”, and then it will flip to the cloud like a somersault cloud. After obtaining some information from the cloud, it will flip back and tell you a result. Is that so? Have you noticed that all voice schemes need to be connected to the cloud. The significance of connecting to the cloud is that its operations and some related things need to be computed in the cloud because the computing power of your mobile phone is not enough. In addition, there is another reason. The purpose of all these companies is not simple. They want to obtain your big data. This is not to say that it involves personal privacy, but they need this big data. Google is a company that does big data. It needs your data. Therefore, when these companies are making this software, at least half of the purpose is to obtain big data. This is not making a mobile phone for the purpose of making a mobile phone. Your purpose is not pure. If you want to be really useful, almost more than half of the functions should not need to be connected to the cloud, but use local speech recognition. The advantage of this is that if there are 1000 people in your address book, the voice recognition software only needs to match the names of these 1000 people.

So, when I pick up the phone and directly read out the name I want to call, at this time, the voice software is not connected to the cloud. All it does is to build a database before, that is, those contacts in your address book. After this database is built, it can be recognized instantly. It can directly return data without connecting to the cloud. After returning the data, you don’t need to click the second item and you can make a call directly. If there is an approximate sound or an ambiguous pronunciation, it will prompt you whether it is the first or the second. So, your whole operation is to pick up the phone and call whoever you want. For example, if you say “so-and-so”, you can directly connect to so-and-so’s phone. This is the real smart phone. You can imagine how amazing it would be if this function appeared on a plastic landline phone at your home. In fact, this technology could have been implemented by speech recognition technology companies as early as more than ten years ago. However, for reasons we don’t know, all manufacturers have not designed like this. This is a very strange thing. You can try to take a few phones and directly activate them. After activation, read a name and make a call directly. This is the best way to use it. It may change the way humans use smart phones. In the future, we will no longer use the address book clumsily or the dialing records of the phone. If you want to call someone, just pick up the phone. I checked my address book. There are more than 1000 people in it. The actual rate of confusion caused by similar voices is not high. Moreover, if you have made several calls, we can also intelligently remember the people you call frequently and arrange their priorities. If two people are really difficult for the machine to distinguish, it will give you a list for you to choose. But in most cases, we pick up the phone and activate it instantly. Say a name and dial it directly. This is the real smart phone. I firmly believe that smart phones are the direction of the future. In this process, the embarrassing problem is completely solved psychologically. For example, we are at the same desk. You are there and I am here typing on the keyboard. We are colleagues. If I pick up the phone and directly say “so-and-so”, you won’t think it’s strange because you see me pick up the phone and speak directly with the corner of your eye. You will mistakenly think that I am talking to you or I put my hand on the table. However, if I pick up the phone and answer it when the mobile phone screen is on, you may have such an illusion and I will feel embarrassed. But if you pick up the phone and say “Make a call to so-and-so”, he may say “Damn it, you are rich enough to hire a secretary.” He will say that. So, we should know that designing a product is not a simple technical problem. Many times it is a complex psychological problem.