Filtered-CoPhy : Unsupervised Learning of Counterfactual Physics in Pixel Space

Steeven Janny  Fabien Baradel Natalia Neverova Madiha Nadri Greg Mori Christian Wolf
LIRIS Naver Labs Europe Meta AI LAGEPP SFU Borealis AI Naver Labs Europe
INSA Lyon, France Grenoble, France London, UK Univ. Lyon 1, France Vancouver, Canada Work done at INSA Lyon
  




Abstract

Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios, that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i.e., data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.



Benchmark

We introduce Filtered-CoPhy, a counterfactual physics benchmark suite for counterfactual reasoning in pixel space. Building on the work of Baradel & al. (2019), our benchmark is composed of three scenarios : BlocktowerCF, BallsCF and CollisionCF. The benchmark has been carefully designed and generated imposing constraints on identifiability and counterfactuality. Each scenario includes training, validation and testing experiments. An experiment is represented by two RGB sequences :
  • An observed sequence \(\mathbf{AB}\) with initial condition \(\mathbf{A}=\mathbf{X}_0\) and outcome \(\mathbf{D} = \mathbf{X}_{t=1..T}\). This sequence is used during the abduction step, to identify the counfounder variables.
  • A counterfactual sequence \(\mathbf{CD}\) derived from \(\mathbf{AB}\). The initial conditions \(\mathbf{A}\) and \(\mathbf{C}\) are linked through a do-operator \(do(\mathbf{X}_0 = \mathbf{C})\), which modifies the initial condition through a visual intervention on the initial condition.
A do-operation consists in a visually observable change in the initial physical setup, such as object displacement or removal. Experiments are parameterized by a set of intrinsic physical parameters which are not observable from a single initial image A, refered as confounders. The counterfactual task consists in inferring the counterfactual outcome \(\mathbf{D}\) given the observed trajectory \(\mathbf{AB}\) and the counterfactual initial state \(\mathbf{C}\).

Each experiment involves \(K=3\) or \(4\) stacked cubes in resting contact, but potentially in an instable configuration. The confounder variables are masses, which we discretize in \(m \in \{1, 10\}\). When generating the experiments, we enforce an idenfiability constraint that guarantee that the masses mandatory for forecasting the counterfactual outcome \(\mathbf{CD}\) are idenfiable from the trajectory of the cube in \(\mathbf{AB}\).

Length Sampling rate # Training Exp. # Validation Exp. # Test Exp.
6 seconds 25 FPS 22 564 6 016 3 008

Do-operations consist in either removing the top cube (around 30% of the dataset), or moving one cube on the horizontal plane. In this case, we make sure that the cube does not move too far from the tower, in order to maintain contact.

Each experiment involves \(K=3, 4, 5\) or \(6\) sphere bouncing in a 2D squared arena. Each ball is set with an initial velocity that is not affected by the do-operation. The confounder variables are the mass of each sphere \(m \in \{1, 10\}\), and the initial velocities. When generating the experiments, we enforce an idenfiability constraint that guarantee that the masses mandatory for forecasting the counterfactual outcome \(\mathbf{CD}\) are idenfiable from the trajectory of the spheres in \(\mathbf{AB}\).

Length Sampling rate # Training Exp. # Validation Exp. # Test Exp.
6 seconds 25 FPS 22 723 6 000 3 008

Do-operations consist in either removing a sphere (around 30% of the dataset), or moving one sphere on the horizontal plane.

Each experiment involves two objects: a sphere and a cylinder, one of which being static in the center of the scene, the other moving toward the center. The moving object is set with an initial velocity that is not affected by the do-operation. The confounder variables are the masses of each object \(m \in \{1, 10\}\), and the initial velocity of the moving object. When generating the experiments, we enforce an idenfiability constraint that guarantee that the masses mandatory for forecasting the counterfactual outcome \(\mathbf{CD}\) are idenfiable from the trajectory of the spheres in \(\mathbf{AB}\).

Length Sampling rate # Training Exp. # Validation Exp. # Test Exp.
3 seconds 25 FPS 13 980 6 000 3 000

Do-operations consist of either a flip of the cylinder orientation between vertical or horizontal, or a shift of the position of the moving object relative to the resting one in one of the three canonical directions \(x\), \(y\) and \(z\).




Video explanation of the model


References of baseline methods :
  • PhyDNet : Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors forunsupervised video prediction. In Computer Vision and Pattern Recognition (CVPR), 2020.
  • V-CDN : Yunzhu Li, Antonio Torralba, Anima Anandkumar, Dieter Fox, and Animesh Garg.Causal discovery in physical systems from videos.Advances in Neural Information Processing Systems, 33,2020

Results

BlocktowerCF
BallsCF
CollisionCF



Download

You can download the FilteredCoPhy dataset with the link below. All useful information is available in the README of the project's GitHub page.

DOWNLOAD


Citation & Contact

Do not hesitate to contact us for any questions about our project at this address : steeven.janny@insa-lyon.fr


@inproceedings{janny2022FilteredCoPhy,
    title = "Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space",
    author = {Janny, Steeven  and
              Baradel, Fabien  and
              Neverova, Natalia  and
              Nadri, Madiha  and
              Mori, Greg  and
              Wolf, Christian},
    booktitle = "International Conference on Learning Representations (ICLR)",
    year = "2022"}