1
|
A comprehensive review of task understanding of command-triggered execution of tasks for service robots. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10347-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
2
|
Huang K, Han Y, Wu J, Qiu F, Tang Q. Language-Driven Robot Manipulation With Perspective Disambiguation and Placement Optimization. IEEE Robot Autom Lett 2022. [DOI: 10.1109/lra.2022.3146955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
3
|
Doering M, Brščić D, Kanda T. Data-Driven Imitation Learning for a Shopkeeper Robot with Periodically Changing Product Information. ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION 2021. [DOI: 10.1145/3451883] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Data-driven imitation learning enables service robots to learn social interaction behaviors, but these systems cannot adapt after training to changes in the environment, such as changing products in a store. To solve this, a novel learning system that uses neural attention and approximate string matching to copy information from a product information database to its output is proposed. A camera shop interaction dataset was simulated for training/testing. The proposed system was found to outperform a baseline and a previous state of the art in an offline, human-judged evaluation.
Collapse
|
4
|
Röder F, Özdemir O, Nguyen PDH, Wermter S, Eppe M. The Embodied Crossmodal Self Forms Language and Interaction: A Computational Cognitive Review. Front Psychol 2021; 12:716671. [PMID: 34484079 PMCID: PMC8415221 DOI: 10.3389/fpsyg.2021.716671] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Accepted: 07/16/2021] [Indexed: 11/13/2022] Open
Abstract
Human language is inherently embodied and grounded in sensorimotor representations of the self and the world around it. This suggests that the body schema and ideomotor action-effect associations play an important role in language understanding, language generation, and verbal/physical interaction with others. There are computational models that focus purely on non-verbal interaction between humans and robots, and there are computational models for dialog systems that focus only on verbal interaction. However, there is a lack of research that integrates these approaches. We hypothesize that the development of computational models of the self is very appropriate for considering joint verbal and physical interaction. Therefore, they provide the substantial potential to foster the psychological and cognitive understanding of language grounding, and they have significant potential to improve human-robot interaction methods and applications. This review is a first step toward developing models of the self that integrate verbal and non-verbal communication. To this end, we first analyze the relevant findings and mechanisms for language grounding in the psychological and cognitive literature on ideomotor theory. Second, we identify the existing computational methods that implement physical decision-making and verbal interaction. As a result, we outline how the current computational methods can be used to create advanced computational interaction models that integrate language grounding with body schemas and self-representations.
Collapse
Affiliation(s)
- Frank Röder
- Knowledge Technology, Department of Informatics, University of Hamburg, Hamburg, Germany
| | | | | | | | | |
Collapse
|
5
|
Mi J, Lyu J, Tang S, Li Q, Zhang J. Interactive Natural Language Grounding via Referring Expression Comprehension and Scene Graph Parsing. Front Neurorobot 2020; 14:43. [PMID: 32670046 PMCID: PMC7331387 DOI: 10.3389/fnbot.2020.00043] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Accepted: 05/27/2020] [Indexed: 11/13/2022] Open
Abstract
Natural language provides an intuitive and effective interaction interface between human beings and robots. Currently, multiple approaches are presented to address natural language visual grounding for human-robot interaction. However, most of the existing approaches handle the ambiguity of natural language queries and achieve target objects grounding via dialogue systems, which make the interactions cumbersome and time-consuming. In contrast, we address interactive natural language grounding without auxiliary information. Specifically, we first propose a referring expression comprehension network to ground natural referring expressions. The referring expression comprehension network excavates the visual semantics via a visual semantic-aware network, and exploits the rich linguistic contexts in expressions by a language attention network. Furthermore, we combine the referring expression comprehension network with scene graph parsing to achieve unrestricted and complicated natural language grounding. Finally, we validate the performance of the referring expression comprehension network on three public datasets, and we also evaluate the effectiveness of the interactive natural language grounding architecture by conducting extensive natural language query groundings in different household scenarios.
Collapse
Affiliation(s)
- Jinpeng Mi
- Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China
- Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Jianzhi Lyu
- Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Song Tang
- Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China
- Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Qingdu Li
- Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China
| | - Jianwei Zhang
- Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| |
Collapse
|
6
|
Arkin J, Park D, Roy S, Walter MR, Roy N, Howard TM, Paul R. Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions. Int J Rob Res 2020. [DOI: 10.1177/0278364920917755] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The goal of this article is to enable robots to perform robust task execution following human instructions in partially observable environments. A robot’s ability to interpret and execute commands is fundamentally tied to its semantic world knowledge. Commonly, robots use exteroceptive sensors, such as cameras or LiDAR, to detect entities in the workspace and infer their visual properties and spatial relationships. However, semantic world properties are often visually imperceptible. We posit the use of non-exteroceptive modalities including physical proprioception, factual descriptions, and domain knowledge as mechanisms for inferring semantic properties of objects. We introduce a probabilistic model that fuses linguistic knowledge with visual and haptic observations into a cumulative belief over latent world attributes to infer the meaning of instructions and execute the instructed tasks in a manner robust to erroneous, noisy, or contradictory evidence. In addition, we provide a method that allows the robot to communicate knowledge dissonance back to the human as a means of correcting errors in the operator’s world model. Finally, we propose an efficient framework that anticipates possible linguistic interactions and infers the associated groundings for the current world state, thereby bootstrapping both language understanding and generation. We present experiments on manipulators for tasks that require inference over partially observed semantic properties, and evaluate our framework’s ability to exploit expressed information and knowledge bases to facilitate convergence, and generate statements to correct declared facts that were observed to be inconsistent with the robot’s estimate of object properties.
Collapse
Affiliation(s)
- Jacob Arkin
- Robotics and Artificial Intelligence Laboratory, University of Rochester, USA
| | - Daehyung Park
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA
| | - Subhro Roy
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA
| | - Matthew R Walter
- Robot Intelligence through Perception Laboratory, Toyota Technological Institute at Chicago, USA
| | - Nicholas Roy
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA
| | - Thomas M Howard
- Robotics and Artificial Intelligence Laboratory, University of Rochester, USA
| | - Rohan Paul
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA
- Department of Computer Science and Engineering, Indian Institute of Technology Delhi, India
| |
Collapse
|
7
|
Mi J, Liang H, Katsakis N, Tang S, Li Q, Zhang C, Zhang J. Intention-Related Natural Language Grounding via Object Affordance Detection and Intention Semantic Extraction. Front Neurorobot 2020; 14:26. [PMID: 32477091 PMCID: PMC7238763 DOI: 10.3389/fnbot.2020.00026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 04/09/2020] [Indexed: 11/23/2022] Open
Abstract
Similar to specific natural language instructions, intention-related natural language queries also play an essential role in our daily life communication. Inspired by the psychology term “affordance” and its applications in Human-Robot interaction, we propose an object affordance-based natural language visual grounding architecture to ground intention-related natural language queries. Formally, we first present an attention-based multi-visual features fusion network to detect object affordances from RGB images. While fusing deep visual features extracted from a pre-trained CNN model with deep texture features encoded by a deep texture encoding network, the presented object affordance detection network takes into account the interaction of the multi-visual features, and reserves the complementary nature of the different features by integrating attention weights learned from sparse representations of the multi-visual features. We train and validate the attention-based object affordance recognition network on a self-built dataset in which a large number of images originate from MSCOCO and ImageNet. Moreover, we introduce an intention semantic extraction module to extract intention semantics from intention-related natural language queries. Finally, we ground intention-related natural language queries by integrating the detected object affordances with the extracted intention semantics. We conduct extensive experiments to validate the performance of the object affordance detection network and the intention-related natural language queries grounding architecture.
Collapse
Affiliation(s)
- Jinpeng Mi
- Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China.,Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Hongzhuo Liang
- Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Nikolaos Katsakis
- Human-Computer Interaction, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Song Tang
- Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China.,Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Qingdu Li
- Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China
| | - Changshui Zhang
- Department of Automation, State Key Lab of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China
| | - Jianwei Zhang
- Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany
| |
Collapse
|