Review of Causal RL

Causal Inference Q-network
Intrumental variables for offline RL
RL with confounders

Causal Inference Q-network

This paper learns invariant representation of perturbated observations.

Intrumental variables for offline RL

Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning

This paper supposes an observable $z_t$ as the instrumental variable that affects the action $a_t$ jointly with state $x_t$. The difference between $x_t$ and $x_t$ is that $z_t$ affects the $x_{t+1}$ only through $a_t$. For example, $a_t$ is the treatment, $x_t$ is the current health status and $z_t$ is the physician’s preference for treatments.

But what’s the advantage of introducing such structure? It calculates $\hat{x}’=f(x’,a\mid z)$

RL with confounders

Deconfounding RL in observational settings

This paper considers an unobserved factor $u$(confounders in causal learning) that affects observations, actions and rewards.

And it uses VAE to build an inference model for predicting $x_{t+1}$ based on $(x_t,a_t)$ and $z_t$.

I am puzzled that how and why the $u$ can be a time-independent confounder.

In experiments, it provides a confounding benchmark where the action space is divided and $u$ decides which partition of the action space is avaliable. It assumes that $u$ influences the reward function and observations but estimate the reward and action without $u$. So I am confused about the $u$.

Markov Decision Processes with Unobserved Confounders: A Causal Approach

The main difference between MDP and MDPUC is that the value function is conditioned on the actual action $x’_t$ when evaluating the $v(s_t)$ and $q(s_t,x_t)$ ($x_t$ is the potential action).

WOULDA, COULDA, SHOULDA: COUNTERFACTUALLY-GUIDED POLICY SEARCH

This paper introduces unobserved variables in POMDP by $s_{t+1}=f(s_t,a_t,u_t)$.

And Counterfactual Inference (CFI) uses following method:

It uses data to attain posterior distribution of $u$ and re-computer the target value under estimation $\hat{u}$.

Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models (ICML19)

This paper introduces a class of SCMs to generate counterfactual trajectories in POMDP.

Cannot understand the example in section 3.1.

It defines the Counterfactual Stability, which means that in a categorical SCM if an intervention $I’$ increase the probability of oobserved state $i$ compared with the intervention $I$, then we can only observe $i$ under $I’$.

Off-Policy Evaluation in Partially Observable Environments

This paper defines Decoupled POMDP, where $(u,z)$ represents observed and unobserved states.

Review of Causal RL

Causal Inference Q-network

Intrumental variables for offline RL

RL with confounders

Similar Posts

Comments