LiM
Model
Linear Mixed (LiM) causal discovery algorithm [1] extends LiNGAM to handle the mixed data that consists of both continuous and discrete variables. The estimation is performed by first globally optimizing the log-likelihood function on the joint distribution of data with the acyclicity constraint, and then applying a local combinatorial search to output a causal graph.
This method is based on the LiM model as shown below,
i) As for the continuous variable, its value assigned to each of \(x_i\) is a linear function of its parent variables denoted by \(x_{\mathrm{pa}(i)}\) plus a non-Gaussian error term \(e_i\), that is,
where the error terms \(e_i\) are continuous random variables with non-Gaussian densities, and the error variables \(e_i\) are independent of each other. The coefficients \(b_{ij}\) and intercepts \(c_i\) are constants.
ii) As for the discrete variable, its value equals 1 if the linear function of its parent variables \(x_{\mathrm{pa}(i)}\) plus a Logistic error term \(e_i\) is larger than 0, otherwise, its value equals 0. That is,
where the error terms \(e_i\) follow the Logistic distribution, while the other notations are identical to those in continuous variables.
This method makes the following assumptions.
Continous variables and binary variables.
Linearity
Acyclicity
No hidden common causes
Baselines are the same when predicting one binary variable from the other for every pair of binary variables.
References
Import and settings
In this example, we need to import numpy
, and random
,
in addition to lingam
.
import numpy as np
import random
import lingam
import lingam.utils as ut
print([np.__version__, lingam.__version__])
['1.20.3', '1.6.0']
Test data
First, we generate a causal structure with 2 variables, where one of them is randomly set to be a discrete variable.
ut.set_random_seed(1)
n_samples, n_features, n_edges, graph_type, sem_type = 1000, 2, 1, 'ER', 'mixed_random_i_dis'
B_true = ut.simulate_dag(n_features, n_edges, graph_type)
W_true = ut.simulate_parameter(B_true) # row to column
no_dis = np.random.randint(1, n_features) # number of discrete vars.
print('There are %d discrete variable(s).' % (no_dis))
nodes = [iii for iii in range(n_features)]
dis_var = random.sample(nodes, no_dis) # randomly select no_dis discrete variables
dis_con = np.full((1, n_features), np.inf)
for iii in range(n_features):
if iii in dis_var:
dis_con[0, iii] = 0 # 1:continuous; 0:discrete
else:
dis_con[0, iii] = 1
X = ut.simulate_linear_mixed_sem(W_true, n_samples, sem_type, dis_con)
print('The true adjacency matrix is:\n', W_true)
There are 1 discrete variable(s).
The true adjacency matrix is:
[[0. 0. ]
[1.3082251 0. ]]
Causal Discovery for linear mixed data
To run causal discovery, we create a LiM
object and call the fit
method.
model = lingam.LiM()
model.fit(X, dis_con, only_global=True)
<lingam.lim.LiM at 0x174d475f850>
Using the _adjacency_matrix
properties, we can see the estimated adjacency matrix between mixed variables.
print('The estimated adjacency matrix is:\n', model._adjacency_matrix)
The estimated adjacency matrix is:
[[ 0. , 0. ],
[-1.09938457, 0. ]]