In recent years, there has been a strong recognition of the importance of research to capture intuitive physics computationally. Humans’ innate ability to understand physical phenomena and study to make computers understand the real world has been vigorously pursued. There are various approaches to understanding the physical world. For example, methods of understanding based on object recognition through visual inference from image information and approaches to understanding real-world events using image features extracted from the video as input and the output results. In contrast, we propose a method to extract motion inflection points in the real world represented in the latent hierarchical structure of physical relationships of recognized objects. In concrete, we modified the Variational Temporal Abstraction (VTA) model so that it can extract inflection points from a given graph structure, which represents physical relationships among objects through their latent system. We conducted experiments on whether our method can correctly detect motion inflection points using a modified CLEVRER dataset [19] and confirmed that the results show high accuracy. |