The likelihood of an Exponential Random Graph Model (ERGM) is defined as follows:
$$ \text{P}\left(\left.\mathbf{Y}= \mathbf{y}\vphantom{X = x}\right|\vphantom{\mathbf{Y}= \mathbf{y}}X = x\right) = \frac{% \text{exp}\left\{{\theta}^{\text{t}}g\left(\mathbf{y}, x\right)\right\} % }{% \sum_{\mathbf{y}'\in\mathcal{Y}} \text{exp}\left\{{\theta}^{\text{t}}g\left(\mathbf{y}', x\right)\right\} % },\quad \forall \mathbf{y}\in\mathcal{Y} $$
Where yโโโ๐ด is a random graph, X is a vector of attributes, ฮธ is a column-vector of length k (model parameters), and g(โ ) is a function that returns a column-vector of sufficient statistics, also of length k. In general, from the computational point of view, it is easier to manipulate the likelihood function centered at the observed sufficient statistics, this is:
$$ \text{P}\left(\left.\mathbf{Y}= \mathbf{y}\vphantom{X = x}\right|\vphantom{\mathbf{Y}= \mathbf{y}}X = x\right) = \frac{1}{% \sum_{\mathbf{y}'\in\mathcal{Y}} \text{exp}\left\{{\theta}^{\text{t}}\Delta{}g\left(\mathbf{y}',x\right)\right\} % },\quad \forall \mathbf{y}\in\mathcal{Y} $$
Where ฮg(yโฒ)โ=โg(yโฒ,โx)โ
โโ
g(y,โx).
In the case of ergmito
, we usually look at a pooled model
with n networks, i.e.
$$ \prod_{i \in N}\text{P}\left(\left.\mathbf{Y}= \mathbf{y}_i\vphantom{X = x_i}\right|\vphantom{\mathbf{Y}= \mathbf{y}_i}X = x_i\right) = \prod_{i \in N}\frac{1}{% \sum_{\mathbf{y}_i'\in\mathcal{Y}} \text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}_i'\right)\right\} % },\quad \forall \mathbf{y}_i\in\mathcal{Y} $$
Where Nโโกโ{1,โโฆ,โn} is a vector of indices.
In the case of a single network, the modelโs log-likelihood is given by
log(P(โ ))โ=โโlog(โyโฒโโโYexp{ฮธtฮg(yโฒ)}) In general, we can reduce the computational complexity of this calculations by looking at the isomorphic sufficient statistics, this is, group up elements based on the set of unique vectors of sufficient statistics:
โlog(โsโโโ๐ฎwsexp{ฮธts})
Where ๐ฎ is the support of the sufficient statistics under ๐ด, sโโโ๐ฎ is one of its realizations, and wsโโกโ|{yโโโ๐ดโ:โฮg(y)โ=โs}| is the number of networks in ๐ด which centered sufficient statistics equal s. Furthermore, we can write this in matrix form:
โlog(Wโ รโ exp{Sโ รโ ฮธ})
Where Wโโกโ{ws}sโโโS is a row vector of, length w and S is a matrix of size wโ รโ k. The log-likelihood of a pooled model is simply the sum of the individual ones.
The partial derivative of the log-likelihood with respect to j-th parameter equals:
$$ \frac{\delta}{\delta\theta_j}\text{log}\left(\text{P}\left(\cdot\right)\right) = % - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }, \quad\forall j $$
We can also write this using matrix notation as follows:
โlog(P(โ ))โ=โโStโ รโ [Wโ โโ exp{Sโ รโ ฮธ}]/ฮป(ฮธ)
Where โ is the element-wise product and ฮป(ฮธ)โ=โWโ รโ exp{Sโ รโ ฮธ}.
In the case of the hessian, each (j,โl) element of, $\frac{\delta^2}{\delta\theta_k\delta\theta_u}\text{log}\left(\text{P}\left(\cdot\right)\right)$, can be computed as:
$$ \frac{% -\left(\sum_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_j'\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}} \mathbf{s}\right\}\right) }{% \lambda(\theta)% } + \frac{% \left(\sum_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_j'\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}} \mathbf{s}\right\}\right) \left(\sum_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}} \mathbf{s}\right\}\right) }{% \lambda(\theta)^2% } $$ Where sj as the j-th element of the vector s. Once again, we can simplify this using matrix notation:
$$ \frac{% -\mathbf{W}\times\left[\mathbf{S}_j\circ\mathbf{S}_l \circ \text{exp}\left\{\mathbf{S}\times\theta\right\}\right]% }{% \lambda(\theta)% } + \frac{% \left(\mathbf{W}\times\left[\mathbf{S}_j \circ \text{exp}\left\{\mathbf{S}\times\theta\right\}\right]\right) \left(\mathbf{W}\times\left[\mathbf{S}_l \circ \text{exp}\left\{\mathbf{S}\times\theta\right\}\right]\right) }{% \lambda(\theta)^2% } $$
Where Sj is the j-th column of the matrix S.
In the case that the MLE does not exists, i.e.ย ฮธkโโโยฑโ, which occurs when the observed sufficient statistics are not in the interior of the space, for example, a fully connected network, the limit of the log-likelihood, the gradient, and hessian are finite. This is relevant as some special care needs to be taken when dealing with these cases.
While in general models in which the MLEs diverge may seem uninteresting, the fact that ERGMs describe a discrete and not continuous random variable makes this type of event easier to observed when compared to other families of distributions. Furthermore, in the case of methods such as bootstrapping, forward/backward selection, or other classes of algorithms that involve some level of automatization during the model fitting process, it is important to know how to correctly deal with the non-existence of MLEs. Fortunately, as we will show, the log-likelihood and its derivatives are well defined in such cases.
The main principle here is what happens in the argument of the exponential function that populates these functions. Depending on what is sk, two cases follow:
The observe value of the k-th sufficient statistic is in the lower bound skโ=โminsโฒโโโ๐ฎskโฒ This happens, for example, when trying to fit a model with triangles where there are no observed triangles. In this case, with skโฒโโฅโ0 for all cases, the theoretical MLE for ฮธk goes to โโ, thus, in the limit, the expression $$ \lim_{\theta_k\to-\infty}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}'\right\} = \left\{\begin{array}{ll} % \text{exp}\left\{\sum_{j\neq k}\mathbf{s}_j'\theta_j\right\} &\quad\text{if }\mathbf{s}_k' = 0 \\ 0 & \quad\text{if }\mathbf{s}_k' > 0 \end{array}\right. $$
The observe value of the k-th sufficient statistic is in the upper bound skโ=โminsโฒโโโ๐ฎskโฒ The most common case is when the sufficient statistic is satturated, for example, in a fully connected graph, where the MLE goes to infinity, ฮธkโโโ+โ. Similar to before, since skโฒโโคโ0 for all cases, in the limit the previous expression is well defined: $$ \lim_{\theta_k\to+\infty}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}'\right\} = \left\{\begin{array}{ll} % \text{exp}\left\{\sum_{j\neq k}\mathbf{s}_j'\theta_j\right\} &\quad\text{if }\mathbf{s}_k' = 0 \\ 0 & \quad\text{if }\mathbf{s}_k' < 0 \end{array}\right. $$
The two previous points can be interpreted as only graphs whose k-th sufficient statistic matches that of the observed data influence the model. Therefore, a key to compute the limiting values of the log-likelihood, gradient, and hessian will be on partitioning the summations over sโฒโโโ๐ฎ into two different sets as follows:
Where U is the set of observed sufficient statistics that are on the boundary and thus have an MLE that diverges to ยฑโ. In this partition ๐ฎ0 contains all the realizations of sufficient statistics for which the statistics in U are equal to the observed ones. With this definition in hand, we will now show what is the asymptotic behavior of the log-likelihood, gradient, and hessian when one or more observed sufficient statistics are on not in the interior of the support.
Using the partition of ๐ฎ, the log-likelihood can be written as follows
โlog(โsโฒโโโ๐ฎ0wsโฒexp{โjโโโUsjโฒฮธj}โ +โ โsโฒโโโ๐ฎ1wsโฒexp{ฮธtฮg(yโฒ)})
Then, WLOG as ฮธuโโโโโ, positive if su is in the upper bound, we have:
$$ \begin{align} & \lim_{\theta_u\to-\infty} -\text{log}\left( % \sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{\sum_{j\not\in U}\mathbf{s}_j'\theta_j\right\} + % \sum_{\mathbf{s}'\in\mathcal{S}_1}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}'\right)\right\} \right) \\ & = -\text{log}\left( \lim_{\theta_u\to-\infty} % \sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{\sum_{j\not\in U}\mathbf{s}_j'\theta_j\right\} + % \sum_{\mathbf{s}'\in\mathcal{S}_1}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}'\right)\right\} \right) \\ & = -\text{log}\left( % \sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{\sum_{j\not\in U}\mathbf{s}_j'\theta_j\right\} \right) \end{align} $$
Where the second equality follows from the fact that the logarithm function is continuous in (0,โ1). We can see this in the following example:
Suppose that we have five networks of size 4 as the one included in
the ergmito package, fivenets
, and we wanted to fit the
following model fivenets ~ edges + triadcensus(15)
, with
triadcensus(15)
equal to a fully connected triad, also
known as triad class 300 using Davis and Leinhardt triadic
classification system. Since the fivenets
dataset has no
fully connected triad, the MLE of that term diverges:
## Warning: The observed statistics (target.statistics) are near or at the
## boundary of its support, i.e. the Maximum Likelihood Estimates maynot exist or
## be hard to be estimated. In particular, the statistic(s) "edges",
## "triadcensus.300".
##
## ERGMito estimates
## note: A subset of the parameters estimates was replaced with +/-Inf, and the Hessian matrix is not p.s.d.
##
## Call:
## ergmito(model = fivenets ~ edges + triadcensus(15))
##
## Coefficients:
## edges triadcensus.300
## -0.6851 -Inf
The log-likelihood of this model can be computed directly with
ergmito using a large negative instead of -Inf
, or by using
the equation that shows the limiting value of the log-likelihood:
# Approach using the limiting value
l <- with(ans$formulae, {
sapply(1:nnets(ans), function(i) {
# Preparing suff stats
S <- t(t(stats_statmat[[i]]) - target_stats[i, ])
W <- stats_weights[[i]]
theta <- coef(ans)["edges"]
# Index of members of S0
s0idx <- which(S[,2] == 0)
- log(sum(W[s0idx] * exp(S[s0idx,1] * theta)))
})
})
sum(l)
## [1] -38.16338
# Which should be equivalent to the approach with large negative number
ans$formulae$loglik(c(coef(ans)["edges"], -1e100))
## [1] -38.16338
The gradient for ฮธj equals
$$ - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % } = % - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + % \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % } $$
WLOG if suโ=โminsโฒโโโ๐ฎsuโฒ the limit of the above expression as ฮธuโโโโโ is evaluated as follows:
$$ \begin{align} \lim_{\theta_u\to-\infty} & - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + % \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % } \\ & = % - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} }{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} } \end{align} $$
If jโ=โu, the above expression reduces to 0 as sโฒuโ=โ0โsโฒโโโ๐ฎ0, otherwise the number is well defined. We can calculate the gradient of the triad 300 model alternatively using the above expression:
g <- with(ans$formulae, {
sapply(1:nnets(ans), function(i) {
# Preparing suff stats
S <- t(t(stats_statmat[[i]]) - target_stats[i, ])
W <- stats_weights[[i]]
theta <- coef(ans)["edges"]
# Index of members of S0
s0idx <- which(S[,2] == 0)
- sum(W[s0idx] * S[s0idx,1] * exp(S[s0idx,1] * theta))/
sum(W[s0idx] * exp(S[s0idx,1] * theta))
})
})
sum(g)
## [1] 0.002926372
## [,1]
## [1,] 0.002926372
## [2,] 0.000000000
Just like the other cases, we can rewrite the hessian distingushing between ๐ฎ0 and ๐ฎ1. Without rewriting the whole expression (which would be simply taking too much space), and WLOG, the limiting Hessian, and in particular, its (j,โl)-th component, as ฮธuโโโโโ equals:
$$ \begin{align} & \frac{% -\left(\sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{s}_j'\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right) }{% \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}% } + \\ & \quad \frac{% \left(\sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{s}_j'\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right) \left(\sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right) }{% \left(\sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right)^2% } \end{align} $$
In the case of either j or l equal to u, the hessian equals 0. For values other than u, the hessian is non-zero, again, using the example with the triad 300 term we can compute the hessian as follows:
h <- with(ans$formulae, {
sapply(1:nnets(ans), function(i) {
# Preparing suff stats
S <- t(t(stats_statmat[[i]]) - target_stats[i, ])
W <- stats_weights[[i]]
theta <- coef(ans)["edges"]
# Index of members of S0
s0idx <- which(S[,2] == 0)
# First part
- sum(W[s0idx] * S[s0idx,1]^2 * exp(S[s0idx,1] * theta))/
sum(W[s0idx] * exp(S[s0idx,1] * theta)) +
# Second bit
sum(W[s0idx] * S[s0idx,1] * exp(S[s0idx,1] * theta)) ^ 2/
sum(W[s0idx] * exp(S[s0idx,1] * theta))^2
})
})
sum(h)
## [1] -12.97199
Which should be equal to using the full hessian function but with a very large negative velue for the parameter associated with the statistic triad 300:
## [,1] [,2]
## [1,] -12.97199 0
## [2,] 0.00000 0
This last result is useful to then apply the Moore-Penrose generalized inverse and thus a pseudo-covariance matrix, which in some cases can be better than nothing. Furthermore, the limiting expressions of the log-likelihood, gradient, and hessian have less terms to be consider, which in principle results in faster calculations.