Want yet more email?
Recent blog posts
Do Robots Need Theory of Mind? Part 2
Why Robots might need Theory of Mind (ToM)
Existential Risk and the AI Alignment Problem
Russell (2019) argues that we have been thinking about building artificial intelligence (AI) systems the wrong way. Since its inception, AI has attempted to build systems that can achieve ‘their own’ goals, albeit that we might give them those goals in the first instance. Instead, he says, we should be building AIs that understand ‘the preference structure’ that a person has and attempt to satisfy goals within the constraints of that preference structure.
In this way, the AI will be able to understand that acting to achieve one goal (e.g. getting a coffee) may interact or interfere with other preferences, goals or constraints (e.g. not knocking someone out of the way in the process) and thereby moderate its behaviour. An AI needs to understand that a goal is not there to be achieved ‘at all cost’. Instead it should be achieved taking into account many other preferences and priorities that might moderate it. Russell argues that if we think of building AIs in this way, we may be able to avoid the existential risk that superhuman AIs will eventually take over, and either deliberately or inadvertently wipe out humanity.
This is an example of what AI researchers have termed ‘the AI alignment problem’, that potentially creates an existential risk to humanity if we find ourselves, having built super-intelligent machines, unable to control them. Nick Bostrom (Bostrom 2014) has also characterised this threat using the example of setting an AI the goal of producing paperclips and it taking this so literally that it destroys humanity (for example, in its need for more raw materials) in the single-minded execution of this goal and having no appreciation of when to stop. Several other researchers have addressed the AI alignment problem (mainly in terms of laws, regulations and social rules) including Taylor et. al (2017), Hadfield-Menell & Hadfield (2019), Vamplew et. al (2018), Hadfield-Menell, Andrus & Hadfield (2019).
Russell (2019) goes on to describe how an AI should always have some level of uncertainty about what people want. Such uncertainty would put a check on the single-minded execution of a goal at all cost. It would drive a need for the AI to keep monitoring and maintaining its model of what a person might want at any point in time. It would require the AI to keep checking that what it was doing was ‘on-track’ or ‘aligned’ with a person’s whole preference structure. So, if, for example, you had instructed your self-driving car to take you to the airport and you received a message during the trip that your child had been in a road accident, the AI might recognise this as significant, and check whether you wanted to change your plans.
Russell arrives at this position from addressing the problem of existential risk. It is a proposed solution to the AI alignment problem. Working within this frame of reference, he proposes solutions like ‘Cooperative Inverse Reinforcement Learning’ (Malik et. al. 2018) whereby the Autonomous Intelligent System (AIS) attempts to infer the preference structure of a person from an observation of behaviour. This, indeed, seems to be a sensible approach.
However, the exact mechanism by which an AIS coordinates its actions with a person or people may well depend on it being able to accurately infer people’s mental states. Otherwise it might have to explicitly check (e.g. by asking) every few seconds, whether what it was doing was acceptable, and it would need to ‘read’ when a person found it’s behaviour unacceptable (e.g. by noting the frown when about to hit somebody on its mission to get the coffee).
The AI alignment problem is precisely the problem that every person has when interacting with another human being. When interacting with somebody else we are unable to directly observe their internal mental states. We cannot know their preference structure and we can only take on trust that their intentions are what they might say they are. Their real intentions, beliefs, desires, values, and boundaries could, in principle, be anything. What we do, is infer from their behaviours, including what they say (and what we understand from this) what their intentions are. Intentions, beliefs, and preferences are all hidden variables that may be the underlying causes of behaviours but because they are unobservable can only be guessed at.
Russell takes this on board and understands that the alignment problem is one that exists between any two agents, human or artificial. He is saying that robots need to be equipped with similar mechanisms to those that people generally have. These are the mechanisms that can model human beliefs, preferences and intentions by making inferences from observations of behaviour. Fortunately, we are not discovering and inventing these mechanisms for the first time.
Alignment with What?
A potential problem with having an AIS infer, reason and act on its analysis of another person’s mental states is that it may not accurately predict the consequences of its own actions. An action designed to do good may, in fact, do harm. In addition to being mistaken about the direction of its effect on mental states (positive or negative) it may also be inaccurate about the extent. So, an act designed to please may have no effect, or an act that is not intended to cause either pleasure or displeasure may have an effect.
This is quite apart from all manner of other complications that we might describe as its ‘policies’. Should, for example, an AIS always act to minimise harm and maximise a person’s pleasure? How should an AIS react if a person consistently fails to take medication prescribed for their benefit? How should it trade-off short and longer-term benefits? How does an AIS reconcile differences between two or more people, a person’s legal obligations and their desires or the interests of a person and another organisation (a school, a company, their employer, the tax office an so on)?
In all these cases, the issue comes down to how the AIS evaluates it’s own choice of possible actions (or inaction) and which stakeholders it takes into account when performing this evaluation. Numerous guidelines have been produced in recent years to help guide developers of AI systems. The good news is that there is considerable agreement about the kinds of principles that apply – not contravening human rights, not doing harm, increasing wellbeing, transparency and explainability in how the AIS arrives at decisions, elimination of bias and discrimination, and clear accountability and responsibility for the AIS’s decisions. The main mechanism for putting these principles into practice is the training and controls (guidelines, standards and legal) of companies, designers and developers. Comparatively little has been proposed for the controls that might be embedded within the AIS itself, and even less about the principles and mechanisms that might be used to achieve this.
We could turn to economics for models of preference and choice, but these models are discredited by findings in the social sciences (e.g. prospect theory) and many would argue that the incentives encouraged by such models is precisely what has lead to existential risks like nuclear arms races and climate change. We would therefore need to think very carefully before using these same models to drive the design of artificial intelligences because of their potential in adding yet another existential risk.
The existential risk discussed in relation to AISs has tended to focus on the fear that if an artificial intelligence is given autonomy to achieve it’s objectives without constraint, then it might do anything. Even simple systems can become unpredictable very quickly, and if it is unpredictable it is out of control. In the anthropomorphic way, characteristic of human beings, we project onto the AIS that it would be concerned about it’s own self-preservation, or that it would discover that self-preservation was a necessary pre-condition to attaining it’s goal(s). We further project that if it adopts the goal of self-preservation, then it might do this at all cost, putting it’s own self-preservation ahead of even those of its creators. There are some good reasons for these fears because goals like self-preservation and accumulation of resources are instrumental to the achievement of any other goal and an AIS might easily reason that out (Bostrom 2012). There have been challenges to this line of reasoning but this debate is not a central concern here. Rather, I am more concerned with whether an AIS can align with the goals of an individual using the same sorts of social cues that we all use in the informal ways in which we, in general, cooperate with each other.
If we are already concerned that the economic and political systems currently in place can have some undesirable consequences, like other existential risks and concentrations of wealth in the hands of a few, then the last thing we would want to do is build into AISs the same mechanisms for evaluating choices as those assumed by classical economic theory. In these posts, I look primarily to psychology (and sometimes philosophy) to provide evidence and analysis of how people make decisions in a social world, particularly one in which we are taking into account our beliefs about other people’s mental states. Whether this provides an answer to the alignment problem remains to be seen, but it is, at least, another perspective that may help us understand the types of control mechanisms we may need as the development of AIS proceeds at an ever increasing pace.
Cooperation and Collaboration
The paradigm in which robots act as slaves to their human masters is gradually being replaced by one in which robots and humans work collaboratively together to achieve some goal (Sheridan 2016). This applies for individual human-robot interactions and for multi-robot teams (Rosenfeld et. al. 2017). If robots and AISs generally could infer the mental sates of the people around them when performing complex tasks, then this could potentially lead to more intuitive and efficient collaboration between the person and the machine. This requires trust on the part of the human that the robot will play its part in the interaction (Hancock et. al 2011).
As a step on the way, systems have been built where robots collaborate with each other without communication to perform complex tasks using only visual cues (Gassner et. al. 2017). Collaboration is especially useful in situations like care giving (Miyachi, Iga & Furuhata 2017) where giving explicit verbal instructions might be difficult (e.g. in cases of Alzheimers or autism). Gray et. al (2005) proposed a system of action parsing and goal simulation whereby a robot might infer goals and mental states of others in a collaborative task scenario.
Equipping AISs with the ability to recognise, infer and reason about the mental states of others could have some extra-ordinary advantages. Not only might we avoid existential risk to humanity (and could there be anything of greater significance) and make our interactions with robots and AISs generally easy and intuitive, but also we could be living along-side intelligent artefacts that have the robust capacity to carry out moral reasoning. Not only could they keep themselves in check, so that they made only justifiable moral decisions with respect to their own actions, but they might also help us adjudicate our own actions, offering fair, reasonable and justifiable remedies to human transgressions of the law and other social codes. They might become reliable and trustworthy helpers and companions, politely guiding us in solving currently intractable world problems, and protecting us from our own worse human biases, vices, and deficiencies. If they turned out to be better at moral reasoning than people, like wise philosophers they could offer us considered advice to help us achieve our goals and deal with the dilemmas’ of everyday life.
However, there is much that stands in the way of achieving this utopian relationship with the intelligent artefacts we create, especially if we want an AIS to infer mental states in the same way a person might, by observation and perhaps asking questions. We are beginning to understand patterns of neuronal activity sufficiently well to infer some mental states. For example, Haynes et. al. (2007) report being able to tell which of two choices a person is making from looking at neural activity. Elon Musk is creating ‘Neural Lace’ for such a purpose (Cuthbertson 2016) but could mental states be inferred using a non-invasive approaches.
In particular, could we create AISs that could infer our mental states, without inadvertently creating an even greater and more immediate existential risk? I will later argue that giving AISs theory of mind, without them having the same sort of controls on social behaviour that empathy gives people, could be a disaster that heightens existential risk in our very attempt to avoid it. In subsequent posts I first consider whether the artificial inferencing of human mental states is even a credible possibility?
Bostrom, N. (2012). The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22(2), 71–85. https://doi.org/10.1007/s11023-012-9281-3
Bostrom, N., (2014). Superintelligence: Paths, Dangers, Strategies (1st. ed.). Oxford University Press, Inc., USA.
Cuthbertson, A. (2016). Elon Musk: Humans Need ‘Neural Lace’ to Compete With AI. Retrieved from http://europe.newsweek.com/elon-musk-neural-lace-ai-artificial-intelligence-465638?rm=eu
Gassner, M., Cieslewski, T., & Scaramuzza, D. (2017). Dynamic collaboration without communication: Vision-based cable-suspended load transport with two quadrotors. In Proceedings – IEEE International Conference on Robotics and Automation (pp. 5196-5202). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICRA.2017.7989609
Gray, J., Breazeal, C., Berlin, M., Brooks, A., & Lieberman, J. (2005). Action parsing and goal inference using self as simulator. In Proceedings – IEEE International Workshop on Robot and Human Interactive Communication (Vol. 2005, pp. 202–209). https://doi.org/10.1109/ROMAN.2005.1513780
Hadfield-Menell, D., & Hadfield, G. K. (2019). Incomplete contracting and AI alignment. In AIES 2019 – Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 417–422). Association for Computing Machinery, Inc. https://doi.org/10.1145/3306618.3314250
Hadfield-Menell, D., Andrus, M., & Hadfield, G. K. (2019). Legible normativity for AI alignment: The value of silly rules. In AIES 2019 – Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 115–121). Association for Computing Machinery, Inc. https://doi.org/10.1145/3306618.3314258
Hancock, P. A., Billings, D. R., Schaefer, K. E., Chen, J. Y. C., De Visser, E. J., & Parasuraman, R. (2011). A meta-analysis of factors affecting trust in human-robot interaction. Human Factors, 53(5), 517–527. https://doi.org/10.1177/0018720811417254
Haynes, J. D., Sakai, K., Rees, G., Gilbert, S., Frith, C., & Passingham, R. E. (2007). Reading Hidden Intentions in the Human Brain. Current Biology, 17(4), 323–328. https://doi.org/10.1016/j.cub.2006.11.072
Malik D., Palaniappan M., Fisac J., Hadfield-Menell D., Russell S., and Dragan A., (2018) “An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning.” In Proc. ICML-18, Stockholm.
Miyachi, T., Iga, S., & Furuhata, T. (2017). Human Robot Communication with Facilitators
for Care Robot Innovation. In Procedia Computer Science (Vol. 112, pp. 1254-1262). Elsevier B.V. https://doi.org/10.1016/j.procs.2017.08.078
Rosenfeld, A., Agmon, N., Maksimov, O., & Kraus, S. (2017). Intelligent agent supporting human-multi-robot team collaboration. Artificial Intelligence, 252, 211-231. https://doi.org/10.1016/j.artint.2017.08.005
Russell S., (2019), ‘Human Compatible Artificial Intelligence and the Problem of Control’, Allen Lane; 1st edition, ISBN-10: 0241335205, ISBN-13: 978-0241335208
Sheridan, T. B. (2016). Human-Robot Interaction: Status and Challenges. Human Factors, 58(4), 525-32. https://doi.org/10.1177/0018720816644364
Taylor, J., Yudkowsky, E., Lavictoire, P., & Critch, A. (2017). Alignment for Advanced Machine Learning Systems. Miri, 1–25. Retrieved from https://intelligence.org/files/AlignmentMachineLearning.pdf
Vamplew, P., Dazeley, R., Foale, C., Firmin, S., & Mummery, J. (2018). Human-aligned artificial intelligence is a multiobjective problem. Ethics and Information Technology, 20(1), 27–40. https://doi.org/10.1007/s10676-017-9440-6