Gigaom Logo White

Voices in AI – Episode 78: A Conversation with Alessandro Vinciarelli

Byron Reese

Table of Contents

Share on facebook
Share on twitter
Share on linkedin


About this Episode

Episode 78 of Voices in AI features host Byron Reese and Alessandro Vinciarelli as they discuss AI, social signals, and the implications of a computer that can read and respond to emotions. Alessandro Vinciarelli has a Ph.D. in mathematics from the University of Bern and is currently a professor in the School of Computing Science at the University of Glasgow.


Visit to listen to this one-hour podcast or read the full transcript.

Transcript Excerpt

Byron Reese: This is Voices in AI brought to you by GigaOm. I’m Byron Reese. Today our guest is Alessandro Vinciarelli. He is a full professor at the University of Glasgow. He holds a Ph.D. in applied mathematics from the University of Bern. Welcome to the show, Alessandro.

Alessandro Vinciarelli: Welcome. Good morning.

Tell me a little bit about the kind of work you do in artificial intelligence. 

I work on a particular domain that is called social signal processing, which is the branch of artificial intelligence that deals with social psychological phenomena. We can think of the goal of this particular part of the domain as trying to read the mind of people, and through this to interact with people in the same way as people do with one another.

That is like picking up on subtle social cues that people naturally do, teaching machines to do that?

Exactly. At the core of this domain there is what we call social signals that are nonverbal behavioral cues that people naturally exchange during their social interactions. We talk here about, for example, facial expressions, spontaneous gestures, the posture, how we talk in a broadcast, the way of speaking – not what people say, but how they say it.

The core idea is that basically we can see facial expression with our eyes, can hear the way people speak with our ears… and so it is also possible to sense these nonverbal behavioral cues with common sensors – like cameras, microphones, and so on. Through automatic analysis of the signal into the application of artificial intelligence approaches, we can map the data information we extract from images, audio recordings and so on into social cues and their meaning for the people that are involved in an interaction.

I guess implicit in that is an assumption that there’s a commonness of social cues across the whole human race? Is that the case?

Yes. Let’s say social signals are the point where nature meets nurture. What does it mean? It means that at the end it’s something that is intimately related to our body, to our evolution, to our very natural being. And in this sense, we all have a disposition of the same expressive means, in the sense that we all have the same way of speaking, the same voice, the same phonetic apparatus. The face is the same for everybody. We have the same muscles of disposition in order to express a facial expression. The body is the same for everybody. So, from the way we talk, to our bodies… is the same for all people around the world.

However, at the same time as we are a part of society, part of a context, we somewhat learn from the others to express specific meaning, like for example a friendly attitude or a hostile attitude or happiness and so on, in a way that somewhat matches the others.

To give an example of how this can work, when I moved to the U.K. … I’m originally from Italy, and I started to teach in this university. A teaching inspector came to see me and told me, “Well, Alessandro, you have to move your arms a little bit less, because you sound very aggressive. You look very aggressive to the students.” You see, in Italy, it is quite normal to move the hands a lot, especially when we communicate in front of an audience. However, here in the U.K., when people use their arms – because everybody around the world does it – I have to do it in a bit more moderate way, in a more let’s say British way, in order to not sound aggressive. So, you see, gestures communicate all over the world. However, the accepted intensity you use changes from one place to the other.

What are some of the practical applications of what you’re working on?

Well, it is quite an exciting time for the community working on these types of topics. After the very pioneering years, if we look at the history of this particular branch of artificial intelligence, we can see that roughly the early 2000s was a very pioneering time. Then the community established more or less between the late 2000s and three or four years ago, when the technology started to work pretty well. And now we are at the point where we start seeing applications of these technologies initially developed at the research level in the laboratories in the real world.

To give an idea, think of today’s personal assistants that can not only understand what we say and what we ask, but also how we express our request. Think of many animated characters that can interact with the actual agents, social robots and so on. They are slowly entering into reality and interacting with people like people do – through gestures, through facial expressions and so on.  We see more and more companies that are involved and active in these types of domains. For example, we have systems that manage to recognize the emotions of people through sensors that can be carried like a watch on the wrist.

We have very interesting systems. I collaborate in particular with a company called Neurodata Lab that analyzes the content of multimedia material, trying to get an idea of its emotional content. That can be useful in any type of services about video on demand. There is a major force toward more human computer interfaces, or more in general human/machine interfaces that can figure out how we feel in order to intervene appropriately and interact appropriately with us. These are a few major examples.

So, there’s voice, which I guess you could use over a telephone to determine some emotional state. And there’s facial expressions. And there are other physical expressions. Are there other categories beyond those three that bifurcate or break up the world when you’re thinking of different kinds of signals?

Yes, somewhat. The very fact that we are alive and we have a body somewhat forces us to have nonverbal behavioral cues, how they are called, to communicate through our body. And even if you try not to communicate, that becomes somewhat of cue and becomes a form of communication. And there are so many nonverbal behavioral cues that psychologists group them into five fundamental classes.

One is whatever happens with the head. Facial expressions, we’ve mentioned, but there are also movements of the head, shaking, nodding and so on. Then we have the posture. Now in this moment we are talking into a microphone. But, for example, when you talk to people, you tend to face them. You can talk to them by not facing them, but the type of impression would be totally different.

Then we have gestures. When we talk about gestures, we talk about the spontaneous movements we make. So, it’s not like the OK gesture with the thumb. It’s not like pointing to something. These have a pretty specific meaning. For example, self-touching… that typically communicates some kind of discomfort. It is restrictive movements we make when we speak from a cognitive point of view. Speaking and gesturing is a cognitive bimodel unit, so it’s something that gets lumped together.

Then we have the way of speaking, as I mentioned. Not what we say, but how we say it. So, the sound of the voice, and so on. Then there is appearance, everything we can do in order to change our appearance. So, for example the attractiveness of the person, but also the kind of clothes you wear, the type of ornaments you have, and so on.

And the last one is the organization of space. For example, in a company, the more important you are, the bigger your office is. So space from that point of view communicates a form of social verticality. Similarly, we modulate our distances with respect to other people, not only in physical tasks but also in social terms. The closer a person is to us from a social point of view, the closer we let them come from a physical point of view.

So, these are the five wide categories of social signals that psychologists fundamentally recognize as the most important.

Well, as you go through them, I guess I can see how AI would be used. They’re all forms of data that could be measured. So, presumably you can train an artificial intelligence on them. 

That is exactly the core idea of the domain and of the application of artificial intelligence in these types of problems. So, the point is that to communicate with others, to interact with others, we have to manifest our inner state to our behavior – to what we do. Because we cannot imagine communicating something that is not observable… Whatever is observable, meaning it is accessible to our senses, is something that is accessible to artificial sensors. Once you can measure, once you can extract data about something, that is where artificial intelligence comes into play. At the point you can extract data, and the data can be automatically analyzed, then you can automatically infer information about the social and psychological phenomena taking place from the data you managed to capture.

Listen to this one-hour episode or read the full transcript at


Byron explores issues around artificial intelligence and conscious computers in his new book The Fourth Age: Smart Robots, Conscious Computers, and the Future of Humanity.

Get the scoop on what's new

Subscribe to get the latest GigaOm blogs, guides, and industry insight.