2 Previous Work

The research described in this thesis draws on a broad body of knowledge. This literature review begins with a section on the context for this population synthesis effort, the Integrated Land Use Transportation, Environment (ILUTE) model.

The following section describes the mathematics and algorithms used in population synthesis, starting with a discussion of notation for contingency tables. The properties and history of the Iterative Proportional Fitting (IPF) procedure are reviewed, and some generalizations of the method are discussed in the following section. The discussion then shifts to log-linear modelling for contingency tables, and then looks briefly at the literature connecting IPF and log-linear modelling. The final section reviews the literature dealing with zeros in contingency tables.

The review then shifts to the methods used for population synthesis. Two broad classes of method are included: those using IPF and those using the Combinatorial Optimization method.

1 The ILUTE Model

Figure 2.1: The idealized integrated urban modelling system envisioned by Miller, Kriger & Hunt [34].
Image figure_ilute_core

The ILUTE research program aims to develop next generation models of urban land use, travel and environmental impacts. The project's ultimate goal is the “ideal model” described by Miller et al. [34]. As shown in Figure 2.1, the behavioural core of an ideal model would include land development, residential and business location decision, activity/travel patterns and automobile ownership. The boxes on the left show the main influences on the urban system: demographic shifts in the population, the regional economy, government policy and the transport system itself. Some of these may be exogenous inputs to the model, but Miller et al. suggest that both demographics and regional economics need to be at least partially endogenous.

The ILUTE model is intended to operate in concert with an activity-based travel demand model. The Travel/Activity Scheduler for Household Agents (TASHA) is an activity-based model designed on disaggregate principles similar to ILUTE, and connects personal decision making with household-level resources and activities to form travel chains and tours [35,36].

The operational prototype of the ILUTE system was described in detail by Salvini & Miller [40,39]. To validate the model, it is intended to be run using historical data, allowing comparison against the known behaviour of the urban system over recent years. The baseline year of 1986 was ultimately chosen as a starting point, since the Transportation Tomorrow Survey of travel behaviour in the Greater Toronto Area was first conducted in that year.

The prototype defines a wide range of agents and objects: persons, households, dwellings, buildings, business establishments and vehicles. It also defines various relationships between these agents and objects: in particular, family relationships between persons in households, occupancy relationships between households and their dwellings, ownership of dwellings/vehicles by households or persons, containment of dwellings within buildings, and employment of persons by business establishments.

These represent the full spectrum of possible agents and relationships that need to be synthesized as inputs to the ILUTE model. In earlier work within the ILUTE framework, Guan synthesized persons, families and households and a set of relationships between them [25]. In this thesis, the same agents and relationships are considered (in addition to dwelling units), with the goal of improving the method and quality of the synthetic populations.

The remaining agents and relationships are also important to the ILUTE model, but the focus here is on the demographic and dwelling attributes since these are central to both the ILUTE and TASHA models, and because rich data from the Canadian census is available to support the synthesis. In this research, families are proposed as a new class of agent for the ILUTE modelling framework. While the family is a central theme in both the ILUTE and TASHA models, it was only modelled distinct from the household in Guan's work. Furthermore, the original ILUTE prototype did not allow for multifamily households.

2 Mathematics for Fitting Contingency Tables

Almost all population synthesis procedures rely on data stored in multiway contingency tables. To help understand and explain this type of data, a consistent notation is first defined, and then the mathematical properties of contingency tables and the Iterative Proportional Fitting procedure are described.

1 Notation

Throughout this document, scalar values and single cells in contingency tables will be represented using a regular weight typeface (e.g., $ n$ or $ n_{ijk}$). Multiway contingency tables and their margins will be represented with boldface (e.g., $ \mathbf{n}$ or $ \mathbf{n}_{ijk}$) to indicate that they contain more than one cell. Contingency tables may be one-way, two-way or multiway; the number of subscripts indicates the dimension of the table (e.g., $ \mathbf{n}_{ijk}$).

Suppose three variables $ X$, $ Y$ and $ Z$ vary simultaneously, and are classified into $ I$, $ J$ and $ K$ categories respectively. The variables may be either inherently discrete or continuous, but continuous variables are grouped into a finite set of discrete categories. The variable $ i$ denotes a category of $ X$, and the categories are labelled as $ \{1, 2, \ldots, I\}$, and likewise for $ Y$ and $ Z$. (For example, suppose that these variables represent the attributes of a person, such as age, education and location.) Then, there is a probability $ \pi_{ijk}$ that a random observation will be classified in category $ i$ of the first variable, category $ j$ of the second variable and category $ k$ of the third variable. There are $ C=I\times J \times K$ cells in the table, each of which consists of a count $ n_{ijk}$ of the number of observations with the appropriate categories. Since the table consists of counts, the cells are Poisson distributed; these counts are observations of the underlying multinomial probability mass function $ \pi$$ _{ijk}$. The contingency table has a direct relationship to the list of observations; Figure 2.2 shows an example where a list of observations of three variables is used to form a two-variable contingency table.

Figure 2.2: The link between list-based and contingency table representations of multivariate categorical data. Left: a list of observations, where each row represents a single observation. Variables $ X$, $ Y$ and $ Z$ are observed to fall into different categories. Right: a cross-tabulation of the observations using only variables $ X$ and $ Y$. Each cell $ n_{ij}$ in the table is a count of observations with a given value $ X=i$ and $ Y=j$. It corresponds to a specific set of $ n_{ij}$ observations from the list-based representation.
Image figure_tablelist

Any contingency table can be collapsed to a lower-dimensional table by summing along one or more dimensions; a collapsed table is called a marginal table or margin. The notation $ \mathbf{n}_{i++}$ is used for the margin where the second and third variables are collapsed, leaving only the breakdown of the sample into the $ I$ categories of variable $ X$. The + symbols in the notation indicate that the margin is derived by summing $ \mathbf{n}_{ijk}$ over all categories $ j$ and $ k$. The total size of the tabulated sample is given by $ n_{+++}$, or more typically by $ n$ alone.

In this paper, multiple contingency tables are often considered simultaneously. In a typical application of the Iterative Proportional Fitting (IPF) procedure, a “source” population is sampled and cross-classified to form a multiway table $ \mathbf{n}_{ij}$. A similarly structured multiway table $ \mathbf{N}_{ij}$ is desired for some target population, but less information is available about the target: typically, some marginal totals $ \mathbf{N}_{i+}$ and $ \mathbf{N}_{+j}$ are known. (Depending upon the application, the target and source populations may be distinct or identical; in a common example, the populations are identical but the source sample is small (1-5%) while the target margins may be a complete 100% sample of the population.) The complete multiway table $ \mathbf{N}_{ij}$ of the target population is never known, but the IPF procedure is used to find an estimate $ \mathbf{\hat{N}}_{ij}$. This is achieved through repeated modifications of the table $ n_{ij}$. The entire process and associated notation are shown in Figure 2.3 and Table 2.1.

Table 2.1: Summary of the notation used for multiway tables and IPF. IPF is used to estimate a multiway contingency table for an unknown target population, by modifying a table of a source sample to match known margins of the target population. The notation shown is for three variables $ X,Y,Z$, but more can be used.
Symbol Description
$ C$ The total number of cells in the contingency table, $ C=I\times J \times K$.
$ I,J,K$ The number of categories for variables $ X$, $ Y$ and $ Z$ respectively, the three dimensions of the multiway tables.
$ \mathbf{n}_{ijk}$ or $ \mathbf{n}$ A multiway contingency table of the source sample, of size $ I \times J
\times K$.
$ n$ The size of the source sample. i.e., $ \sum_{i,j,k} n_{ijk}$
$ n_{ijk}$ A single cell of $ \mathbf{n}$, containing the count of observations in the source sample where variable $ X$ was in category $ i$, $ Y$ was in category $ j$ and $ Z$ was in category $ k$.
$ \mathbf{N}_{ijk}$ or $ \mathbf{N}$ A multiway contingency table of the target population, of the same size as $ \mathbf{n}$; never observed.
$ N$ The size of the target population.
$ N_{ijk}$ A single cell in target table $ \mathbf{N}$; never observed.
$ \mathbf{N}_{i++}$ A one-way table containing a margin of $ \mathbf{N}_{ijk}$ showing the total observations for each category $ i$ of variable $ X$. While the full table $ \mathbf{N}_{ijk}$ of the target population is never observed, some margins are known.
$ N_{i++}$ A single entry in $ \mathbf{N}_{i++}$. The + symbols indicate a sum over all categories in that dimension; that is, $ N_{i++} =
$ \mathbf{N}_{ij+}$ The two-way table containing a margin of $ \mathbf{N}$, showing the total observations for each category $ i$ of variable $ X$ and each category $ j$ of variable $ Y$.
$ \mathbf{\hat{N}}_{ijk}$ or $ \mathbf{\hat{N}}$ The IPF estimate of the target multiway table $ \mathbf{N}$, using the initial association pattern in source table $ \mathbf{n}$ and adjusting it to exactly fit a selected set of margins $ \mathbf{N}_{i++}$ (etc.)
$ \hat{N}_{ijk}$ A single cell in the IPF estimate $ \mathbf{\hat{N}}$.
$ \pi$$ _{ijk}$ (or $ \pi_{ijk}$) Table (or cell) of probabilities instead of counts, $ E[n_{ijk}] = n \pi_{ijk}$
$ \Pi$$ _{ijk}$ (or $ \Pi_{ijk}$) Table (or cell) of probabilities instead of counts, $ E[N_{ijk}] = N\Pi_{ijk}$
$ X$ or $ X(i)$ A variable split into $ I$ categories, making up the first dimension of each multiway contingency table.
$ Y$ or $ Y(j)$ A variable split into $ J$ categories.
$ Z$ or $ Z(k)$ A variable split into $ K$ categories. In most cases here, $ Z$ will specifically refer to geographic zones.

Figure 2.3: An illustration of the Iterative Proportional Fitting procedure with two variables $ X$ and $ Y$. The source table $ \mathbf{n}_{ij}$ is modified to match the known target marginals $ \mathbf{N}_{i+}$ and $ \mathbf{N}_{+j}$, producing a fitted table $ \mathbf{\hat{N}}_{ij}$ that approximates the unknown target table $ \mathbf{N}_{ij}$.
Image figure_ipf

Note that the source table $ n_{ij}$ and target margins $ N_{i+}$ are usually integer counts, but the estimated target table $ \hat{N}_{ij}$ produced by Iterative Proportional Fitting is real-valued.

2 History and Properties of Iterative Proportional Fitting

The Iterative Proportional Fitting (IPF) algorithm is generally attributed to Deming & Stephan [16]. (According to [14], it was preceded by a 1937 German publication applying the method to the telephone industry.) The method goes by many names, depending on the field and the context. Statisticians apply it to contingency tables and use the terms table standardization or raking. Transportation researchers use it for trip distribution and gravity models, and sometimes reference early papers in that field by Fratar or Furness [15,23]. Economists apply it to Input-Output models and call the method RAS [32].

Figure 2.4: A simple example algorithm for the Iterative Proportional Fitting procedure using a two-way table and one-way target margins.

Input: Source table $ \mathbf{n}_{ij}$, target margins $ \mathbf{N}_{i+}$ and $ \mathbf{N}_{+j}$, and tolerance $ \epsilon$

Output: Fitted table $ \mathbf{\hat{N}}_{ij}$

% latex2html id marker 693
[tb]\KwIn{Source table $\mathbf{n... procedure using a two-way table and one-way target margins.}

The IPF algorithm is a method for adjusting a source contingency table to match known marginal totals for some target population. Figure 2.4 shows a simple application of IPF in two dimensions. The table $ \mathbf{\hat{N}}^{(\tau)}$ computed in iteration $ \tau+1$ fits the row totals exactly, with some error in the column totals. In iteration $ \tau+2$, an exact fit to the column margins is achieved, but with some loss of fit to the row totals. Successive iteration yields a fit to both target margins within some tolerance $ \epsilon$. The procedure extends in a straightforward manner to higher dimensions, and also with higher-dimensional margins.

Deming and Stephan [16] initially proposed the method to account for variations in sampling accuracy. They imagined that the source table and target marginals were measured on the same population, and that the marginal totals were known exactly, but the source table had been measured through a sampling process with some inaccuracy. The IPF method would then adjust the sample-derived cells to match the more accurate marginal totals. They framed this as a fairly general problem with a $ d$-way contingency table, and considered both one-way margins and higher-order margins (up to $ d-1$ ways). They did not consider the effect of zero values in either the initial cell values or the margins.

Deming and Stephan claimed that the IPF algorithm produces a unique solution that meets two criteria. It exactly satisfies the marginal constraints

$\displaystyle \sum_j \hat{N}_{ij} = N_{i+},\quad \sum_i \hat{N}_{ij} = N_{+j}$ (1)

and they believed that it minimized the weighted least-squares criterion

$\displaystyle \sum_i{\sum_j{\frac{(n_{ij}/n - \hat{N}_{ij}/N)^2}{n_{ij}} }}$ (2)

In a later paper, Stephan realized that IPF only approximately minimized that criterion [50]. He proposed a different algorithm that minimized the least-squares criterion. However, Ireland and Kullback [30] returned to the original IPF algorithm and found that it had interesting properties. They showed that the $ \hat{N}_{ij}$ estimated by the IPF method minimizes the discrimination information criterion (also known as the Kullback-Leibler divergence, or relative entropy) [33,14]. This is conventionally defined in terms of probabilities $ \pi_{ij}$ and $ \hat{\Pi}_{ij}$,

$\displaystyle I(\mathbf{\hat{\Pi}} \Vert$$\displaystyle \mbox{\boldmath$\pi$}$$\displaystyle )$ $\displaystyle = \sum_i{\sum_j{\hat{\Pi}_{ij}\log(\hat{\Pi}_{ij}/\pi_{ij}) }}$ (3)

For the sake of discussion, it can be translated to counts by substituting $ \hat{N}_{ij}=N \hat{\Pi}_{ij}$ and $ n_{ij}=n \pi_{ij}$

$\displaystyle I(\mathbf{\hat{N}} \Vert {\bf n})$ $\displaystyle = I(\mathbf{\hat{\Pi}} \Vert$$\displaystyle \mbox{\boldmath$\pi$}$$\displaystyle )$    
  $\displaystyle = \log{(n/N)} + \frac{1}{N} \sum_i{\sum_j{\hat{N}_{ij}\log(\hat{N}_{ij}/n_{ij}) }}$ (4)
  $\displaystyle = \frac{1}{N}\left(-N\log{(N/n)} + \sum_i{\sum_j{\hat{N}_{ij}\log(\hat{N}_{ij}/n_{ij}) }}\right)$ (5)

For constant target population size $ N$, this is equivalent to minimizing

$\displaystyle \sum_i{\sum_j{\hat{N}_{ij}\log(\hat{N}_{ij}/n_{ij}) }}$ (6)

Note that discrimination information is not symmetric, since $ I(\mathbf{\hat{N}} \Vert \mathbf{n}) \neq I(\mathbf{n} \Vert \mathbf{\hat{N}})$ in general.

Ireland and Kullback included a proof of convergence. It omitted one step, and was corrected by Csiszár in a 1975 paper [13]. Csiszár's treatment was somewhat more general than previous papers. In particular, he adopted a convention for the treatment of zero values in the initial cells:

$\displaystyle \log 0 = -\infty,\quad \log \frac{a}{0}=+\infty,\quad 0 \cdot \pm \infty=0$ (7)

After adopting this convention, he proved convergence with allowance for zeros.

The IPF method is one of several ways of satisfying the marginal constraints (2.1) while minimizing entropy relative to the source table. Alternative algorithms exist for solving this system of equations, including Newton's method. Newton's method offers the advantage of a faster (quadratic) convergence rate, and is also able to estimate the parameters and variance-covariance matrix associated with the system (to be discussed in the following section). However, Newton's method is considerably less efficient in computational storage and is impractical for the large systems of equations that occur in high-dimensional problems. Using the asymptotic Landau $ \mathcal{O}()$ notation conventional in computer science [11], the IPF method requires $ \mathcal{O}(C)$ memory to fit a contingency table with $ C$ cells, while Newton's method requires $ \mathcal{O}(C^2)$ storage [1, chapter 8], [20].

Additionally, the minimum discrimination information of equation (2.6) is not the only possible optimization criterion for comparing the fitted table to the source table. Little & Wu [33] looked at a special case where the source sample and the target margins are drawn from different populations. In their analysis, they compared the performance of four different criteria: minimum discrimination information, minimum least squares, maximum log likelihood and minimum chi-squared. For certain problems, other optimization criteria may offer some advantages over minimum discrimination information.

In summary, the Iterative Proportional Fitting method is a data fusion technique for combining the information from a source multiway contingency table and lower-dimensional marginal tables for a target population. It provides an exact fit to the marginal tables, while minimizing discrimination information relative to the source table.

3 Generalizations of the IPF Method

Following the basic understanding of the IPF method in the late 1960s and early 1970s, the method received further attention in the statistical and information theory community. As discussed earlier, Csiszár's 1975 paper [13] was in part a correction and generalization of Ireland & Kullback's work. However, it also introduced a different conception of the underlying problem. Csiszár did not represent the multiway probability distribution as a $ d$-way contingency table with $ C$ cells. Instead, he conceived of $ I$-space, a much larger $ C$-dimensional space containing all possible probability distributions for the $ C$ cells in the contingency table. He exploited the geometric properties of this space to prove convergence of the algorithm.

The mechanics of his proof and construction of $ I$-space would be a theoretical footnote, except that further extensions and generalizations of the IPF method have been made using the $ I$-space notation and conceptualization. In $ I$-space, the marginal constraints form closed linear sets. The IPF algorithm is described as a series of $ I$-projections onto these linear constraints.

From a cursory reading of a 1985 paper by Dykstra [17], it appears that he generalized Csiszár's theory and proved convergence of the IPF method for a broader class of constraints: namely any closed, convex set in $ I$-space. Dykstra used a complicated example where the cells are not constrained to equal some marginal constraint, but the tables' marginal total vectors were required to satisfy an ordering constraint—for example, requiring that $ n_{i2} < n_{i3}$. Dykstra's iterative procedure was broadly the same as the IPF procedure: an iterative projection onto the individual convex constraints. In other words, small extensions to the IPF method can allow it to satisfy a broader class of constraints beyond simple equality.

Further generalizations of the IPF method are discussed by Fienberg & Meyer [20].

4 Log-Linear Models

Log-linear models provide a means of statistically analyzing patterns of association in a single contingency table. They are commonly used to test for relationships between different variables when data is available in the form of a simple, low-dimensional contingency table. The method itself derives from work on categorical data and contingency tables in the 1960s that culminated in a series of papers by Goodman in the early 1970s [19]. The theory behind log-linear models is well-established and is described in detail elsewhere [54,1,37]. In this section, a few examples are used to provide some simple intuition for their application. The notation used here follows [54], but is fairly similar to most sources.

Consider a single two-dimensional contingency table $ \mathbf{n}_{ij}$ containing observations of variables $ X$ and $ Y$. The general form of a log-linear model for such a table is

$\displaystyle \log{n_{ij}}$ $\displaystyle =$ constant$\displaystyle +$ row term$\displaystyle +$ column term$\displaystyle +$ association terms (8)

The log-linear name comes from this form: the logarithm of the individual cells' counts is the dependent variable (left hand side), and this variable is modelled as a linear sum of the parameters (right hand side). A concrete example is the model of independence,

$\displaystyle \log{n_{ij}}$ $\displaystyle = \lambda + \lambda_{X(i)} + \lambda_{Y(j)}$ (9)

The subscripts here make clear the idea of a “row term”: for a given row $ i$ of table $ \mathbf{n}$, each of the cells in that row of $ \mathbf{n}$ share a single parameter $ \lambda_{X(i)}$; a similar effect can be seen for the columns. This is a model of independence, in that it presumes that the counts can be explained without including any association between variables $ X$ and $ Y$. The alternative model including association is

$\displaystyle \log{n_{ij}}$ $\displaystyle = \lambda + \lambda_{X(i)} + \lambda_{Y(j)} + \lambda_{\mathit{XY}(ij)}$ (10)

This log-linear model can be used to test for statistically significant association between $ X$ and $ Y$ in the observations in a given table $ \mathbf{n}_{ij}$. The null hypothesis $ H_0$ is that the variables are independent. The hypotheses are defined as:

$\displaystyle H_0: \lambda_{\mathit{XY}(ij)} = 0 \quad$ for all $\displaystyle i,j$    
$\displaystyle H_1: \lambda_{\mathit{XY}(ij)} \neq 0 \quad$ for some $\displaystyle i,j$ (11)

If none of the association parameters are statistically different from zero, then the null hypothesis cannot be rejected. If one or more association parameters are statistically different from zero, then this supports the alternative hypothesis that some association exists. This is a typical application of a log-linear model: to test for the existence of association between variables in a contingency table. The association terms here are reminiscent of interaction in Analysis of Variance (ANOVA) although there are important differences. Wickens suggested that a two-way log-linear model is more similar to one-way ANOVA than to two-way ANOVA [54, §3.10].

As more variables are added, higher-order association patterns such as $ \mathit{XYZ}$ can be included, and the number of possible models grows. In practise, only hierarchical models are used, where an association term $ \mathit{XY}$ is only included when lower-order terms $ X$ and $ Y$ are also included. Hierarchical models are usually summarized using only their highest-order terms. For example, the model $ (\mathit{XY},Z)$ implicitly includes $ X$ and $ Y$ terms. For a given set of variables, the model that includes all possible association terms is called the saturated model.

To ensure uniqueness, constraints are usually applied to the parameters of a log-linear model. Two different conventions are common, and tools for estimating log-linear models may use either convention. The ANOVA-type coding or effect coding using the following constraints for a two-way table [37]:

$\displaystyle \sum_i\lambda_{X(i)} = \sum_j\lambda_{Y(j)} = \sum_i\lambda_{\mathit{XY}(ij)} = \sum_j\lambda_{\mathit{XY}(ij)}=0$ (12)

while the dummy-variable coding blocks out one category for each:

$\displaystyle \lambda_{X(1)} = \lambda_{Y(1)} = \lambda_{\mathit{XY}(1j)} = \lambda_{\mathit{XY}(i1)} = 0$

Provided that all cells are non-zero, the breakdown of parameters (and hence degrees of freedom) in a log-linear model is quite simple. For example, a saturated model of a two-way table consists of one constant parameter, $ I-1$ row parameters, $ J-1$ column parameters and $ (I-1)(J-1)$ association parameters for a total of $ IJ$ parameters. The presence of zeros can complicate the parameter counting substantially, however.

After estimating the parameters of a log-linear model based on observed counts $ n_{ij}$, the estimated counts $ \hat{n}_{ij}$ are obtained. The model fits can be tested on these estimates using either the Pearson statistic

$\displaystyle X^2$ $\displaystyle = \displaystyle \sum_i\sum_j{\frac{(n_{ij} - \hat{n}_{ij})^2} {\hat{n}_{ij}}}$ (13)

or the likelihood-ratio statistic

$\displaystyle G^2$ $\displaystyle = \displaystyle 2 \sum_i\sum_j{n_{ij} \log\frac{n_{ij}}{\hat{n}_{ij}}}$ (14)

Both of these are $ \chi^2$-distributed, and hierarchically related models can be compared in terms of fit provided that the number of degrees of freedom in the models are known. The $ G^2$ statistic is clearly related to the discrimination information of equation (2.3). After noting that $ n = \sum_i \sum_j n_{ij} = \sum_i\sum_j \hat{n}_{ij}$, it is clear that $ G^2 = 2nI(\mathbf{n} \Vert \mathbf{\hat{n}})$. This formula is sometimes also known as the minimum discrimination information (MDI) statistic [31].

Finally, the $ G^2$ statistic for the null model ( $ \log n_{ij} = \lambda$) is related to the entropy of a probability distribution. The formula for entropy is

$\displaystyle H($$\displaystyle \mbox{\boldmath$\pi$}$$\displaystyle _{ij})$ $\displaystyle = \displaystyle-\sum_{i,j}{\pi_{ij} \log \pi_{ij}}$ (15)

and can be translated to a table of counts as

$\displaystyle H(\mathbf{n}_{ij})$ $\displaystyle = -\frac{1}{n}\left(-n \log n + \displaystyle\sum_{i,j}{n_{ij} \log n_{ij}}\right)$ (16)

The fitted null model is a uniform probability distribution, $ n_{ij} = n/IJ$. Its $ G^2$ statistic can be shown to equal $ 2n(\log IJ -

5 IPF and Log-Linear Models

The Iterative Proportional Fitting procedure has long been associated with log-linear analysis. Given a log-linear model and a contingency table $ \mathbf{N}_{ijk}$, it is often useful to know what the fitted table of frequencies $ \mathbf{\hat{N}}_{ijk}$ would be under the model. (For intuitive purposes, this is the equivalent of finding the shape of the fitted line under the model of linear regression.)

In this problem, a log-linear model is called direct if the fitted table can be expressed as closed formulae, or indirect if the fitting can only be achieved using an iterative algorithm. For indirect models such as $ (\mathit{XY},\mathit{XZ},\mathit{YZ})$, IPF has long been employed as an efficient way to fit a table to the data. To achieve this, the source table $ \mathbf{n}_{ijk}$ is chosen to have no pattern of association whatsoever; typically, this is done by setting $ n_{ijk}=1$. The target margins used by the IPF procedure are the “minimal sufficient statistics” of the log-linear model; for example, to fit the model $ (\mathit{XY},Z)$, the marginal tables $ \mathbf{N}_{ij+}$ and $ \mathbf{N}_{++k}$ would be applied. The resulting table found by IPF is known to give the maximum likelihood fit.

Each step of the IPF procedure adjusts all cells contributing to a given marginal cell equally. As a result, it does not introduce any new patterns of association that were not present in the source table; and the source table was chosen to include no association whatsoever. The resulting table shows only the modelled patterns of association [54, §5.2].

The IPF procedure is hence an important tool for log-linear modelling. Additionally, log-linear models provide some useful insight into the behaviour of the IPF procedure. Stephan showed that the relationship between the fitted table $ \mathbf{\hat{N}}_{ijk}$ and the source table $ \mathbf{n}_{ijk}$ could be expressed as

$\displaystyle \log{\hat{N}_{ijk}/n_{ijk}}$ $\displaystyle = \lambda + \lambda_{X(i)} + \lambda_{Y(j)} + \lambda_{Z(k)}$ (17)

when fitting to the $ \mathbf{N}_{i++}$, $ \mathbf{N}_{+j+}$ and $ \mathbf{N}_{++k}$ margins [50,33,1] for some choice of $ \lambda$ parameters. This has the exact same form as a log-linear model, but it is not a model; rather, with a suitable choice for the $ \lambda$ parameters, the formula holds exactly for every cell. In other words, this model is sufficient to explain all of the variation between $ \mathbf{n}$ and the IPF estimate $ \mathbf{\hat{N}}$. A similar model can be constructed for any set of margins applied during the IPF procedure, by adding $ \lambda$ terms that correspond to the variables in the margins.

This view of the IPF procedure is mostly useful for interpreting its behaviour. IPF modifies the source table $ \mathbf{n}_{ijk}$ to create a fitted table $ \mathbf{\hat{N}}_{ijk}$; that change, represented by the left hand side of equation (2.18), can be expressed using a small number of parameters (the various $ \lambda$ terms). The number of parameters necessary is directly proportional to the size of the marginal tables used in the fitting procedure; in this case, $ 1 + (I-1) + (J-1) + (K-1)$. This insight is not unique to log-linear models, but it is perhaps easier to understand than the Lagrangian analysis used in early IPF papers.

6 Zero Cells

The only shortcoming of the preceding discussion of IPF and log-linear models concerns zeros in the source table or target margins. While Csiszár's treatment of zeros allows the IPF procedure to handle zeros elegantly, it remains difficult to determine the correct number of parameters used by IPF when either the source or target tables contain zeros.

The zeros can take the form of either structural or sampling zeros. Structural zeros occur when particular combinations of categories are impossible. For example, if a sample of women is cross-classified by age and by number of children, the cell corresponding to “age 0-10” and “1 child” would be a structural zero (as would all higher number of children for this age group). A sampling zero occurs when there is no a priori reason that a particular combination of categories would not occur, but the sample was small enough that no observations were recorded for a particular cell.

Wickens provides a detailed description of the consequences of zero cells for log-linear modelling [54, §5.6]. For high dimensional tables, the number of parameters in a particular model becomes difficult to compute, and this in turn makes it difficult to determine how many degrees of freedom are present. As he notes, however, “The degrees of freedom are off only for global tests of goodness of fit and for tests of the highest-order interactions.” Clogg and Eliason suggested that goodness-of-fit tests are futile when the data becomes truly sparse:

But there is a sense in which goodness-of-fit is the wrong question to ask when sparse data is analyzed. It is simply unreasonable to expect to be able to test a model where there are many degrees of freedom relative to the sample size. [10]

For a small number of zeros, then, it seems that some log-linear analysis may be possible. A sparse table with a large number of zeros, by contrast, is unlikely to be tested for goodness-of-fit.

3 Population Synthesis

Microsimulation and agent-based methods of systems modelling forecast the future state of some aggregate system by simulating the behaviour of a number of disaggregate agents over time [9].

In many agent-based models where the agent is a person, family or household the primary source of data for population synthesis is a national census. In many countries, the census provides two types of data about these agents. Large-sample detailed tables of one or two variables across many small geographic areas are the traditional form of census delivery, and are known as Summary Files in the U.S., Profile Tables or Basic Summary Tabulations (BSTs) in Canada, and Small Area Statistics in the U.K. In addition, a small sample of individual census records is now commonly available in most countries. These samples consist of a list of individuals (or families or households) drawn from some large geographic area, and are called Public-Use Microdata Samples (PUMS) in the U.S. and Canada, or a Sample of Anonymized Records in the U.K. The geographic area associated with a PUMS is the Public-Use Microdata Area (PUMA) in the U.S., and the Census Metropolitan Area (CMA) in Canada. The population synthesis procedure can use either or both of these data sources, and must produce a list of agents and their attributes, preferably at a relatively fine level of geographic detail.

In this document, the terms Summary Tables, PUMS and PUMA will be used. For small areas inside a PUMA, the Census Tract (CT) will often be used (or more generically a “zone”), but any fine geographic unit that subdivides the PUMA could be used. Further details about the data and definitions are presented in Chapter 3.

In most population synthesis procedures, geography receives special attention. This is not because geography is inherently different than any other agent attribute: like the other attributes, location is treated as a categorical variable, with a fine categorization system (small zones like census tracts) that can be collapsed to a coarse set of categories (larger zones, like the Canadian Census Subdivisions), or even collapsed completely (to the full PUMA) to remove any geographic variation. There are two reasons why geography receives special attention: first, because census data is structured to treat geography specially. One data set (the PUMS) provides data on the association between almost all attributes except geography; the other (Summary Tables) includes geography in every table, and gives its association with one or two other variables. Secondly, geography is one of the most important variables for analysis and often has a large number of categories; while an attribute like age can often be reduced to 15-20 categories, reducing geography to such a small number of categories would lose a substantial amount of variation that is not captured in other attributes. For transportation analysis, a fine geographic zone system is essential for obtaining a reasonable representation of travel distances, access to transit systems, and accurate travel demand. As a result, geography is usually broken up into hundreds of zones, sometimes more than a thousand.

1 Zone-by-Zone IPF

Figure 2.5: A zone-by-zone application of IPF for population synthesis, including a Monte Carlo integerization stage. The source table can either be constant ($ X$ independent of $ Y$), or created by cross-classifying a PUMS (shown here). Zone $ \kappa$ is synthesized without consideration of other zones, and under the assumption that its association pattern is the same as the pattern in the PUMS. After integerization, the table no longer exactly fits the margins.
Image figure_ipf_zonebyzone

The Iterative Proportional Fitting method is the most popular means for synthesizing a population. The simplest approach is to consider each small geographic zone independently. Suppose that the geographic zones are census tracts contained in some PUMA. Further, suppose that each agent needs two attributes $ X(i)$ and $ Y(j)$, in addition to a zone identifier $ Z(k)$. An overview of the process is shown in Figure 2.5.

The synthesis is conducted one zone at a time, and the symbol $ \kappa$ denotes the zone of interest. In the simplest approach, the variables $ X$ and $ Y$ are assumed to have no association pattern (i.e., they are assumed to vary independently), and hence the initial table $ \mathbf{n}_{ij\kappa}$ is set to a constant value of one. The summary tables provide the known information about zone $ \kappa$: the number of individuals in each category of variable $ X$ can be tabulated to form $ \mathbf{N}_{i+\kappa}$, and likewise with variable $ Y$ to give $ \mathbf{N}_{+j\kappa}$. These are used as the target margins for the IPF procedure, giving a population estimate for zone $ \kappa$.

However, the variables $ X$ and $ Y$ are unlikely to be independent. The PUMS provides information about the association between the variables, but for a different population: a small sample in the geographically larger PUMA. As discussed by Beckman, Baggerly & McKay [4], under the assumption that the association between $ X$ and $ Y$ in zone $ \kappa$ is the same as the association in the PUMA, the initial table $ \mathbf{n}_{ij\kappa}$ can be set to a cross-classification of the PUMS over variables $ X(i)$ and $ Y(j)$. IPF is then applied, yielding a different result.

The IPF process produces a multiway contingency table for zone $ \kappa$, where each cell contains a real-valued “count” $ \hat{N}_{ij\kappa}$ of the number of agents with a particular set of attributes $ X=i$ and $ Y=j$. However, to define a discrete set of agents integer counts are required. Rounding the counts is not a satisfactory “integerization” procedure for three reasons: the rounded table may not be the best solution in terms of discrimination information; the rounded table may not offer as good a fit to margins as other integerization procedures; and rounding may bias the estimates.

Beckman et al. handled this problem by treating the fitted table as a joint probability mass function (PMF), and then used $ N$ Monte Carlo draws [24] from this PMF to select $ N$ individual cells. These draws can be tabulated to give an integerized approximation $ \mathbf{\hat{N}'}$ of $ \mathbf{\hat{N}}$. This is an effective way to avoid biasing problems, but at the expense of introducing a nondeterministic step into the synthesis.

Finally, given an integer table of counts, individual agents can be synthesized using lookups from the original PUMS list. (See Figure 2.2 for an illustration of the link between the list and tabular representations.) Beckman et al. observed an important aspect of this process: if $ n_{ij}$ is zero (i.e., no records in the PUMS for a particular combination of variables), then the fitted count $ \hat{N}_{ij\kappa}$ will be zero, and this carries through to the integerized count $ \hat{N}'_{ij\kappa}$. Consequently, any cell in $ \mathbf{\hat{N}'}$ that has a non-zero count is guaranteed to have corresponding individual(s) in the PUMS.

2 Multizone IPF

Beckman, Baggerly & McKay [4] discussed the simple zone-by-zone technique and also extended it to define a multizone estimation procedure; their approach has been widely cited [5,21,2] and is described in great detail by [27]. They described a method for using IPF with a PUMS to synthesize a set of households. Their approach addresses a weakness of the zone-by-zone method: the PUMS describes the association pattern for a large geographic area, and the pattern within small zones inside that area may not be identical. Consequently, Beckman et al. made no assumptions about the association pattern of the individual zones, but instead required the sum over all zones to match the PUMS association pattern. This approach is illustrated graphically in Figure 2.6; their paper includes a more detailed numerical example. (The Monte Carlo integerization step is omitted for clarity.)

Figure 2.6: An illustration of Beckman et al.'s fitting procedure using two attributes $ X$ and $ Y$, plus a variable $ Z$ representing the census tract zone within the PUMA. In the left half, $ Z$ is ignored and the PUMS is adjusted to match the summary tables; they differ because the PUMS is derived from a smaller sample than the summary tables. In the right half, the variable $ Z(k)$ is added to represent the $ K$ zones that make up the PUMA. A constant initial table filled with ones is used for a second IPF, which is fitted to the summary tables and the adjusted PUMS. The summary tables now show variation of $ X$ by zone $ Z$ (and likewise $ Y \times Z$), while the adjusted PUMS provides information about the association between $ X$ and $ Y$.
Image figure_beckman

This multizone approach offers an important advantage over the zone-by-zone approach. The zone-by-zone approach uses the PUMS association pattern for the initial table, but it is overruled by the marginal constraints, and its influence on the final result is limited. By applying the PUMS association pattern as a marginal constraint on the IPF procedure, the multizone method guarantees that the known association pattern in the PUMS is satisfied exactly.

3 Synthesis Examples Using IPF

Many projects have applied Beckman et al.'s methods. Most microsimulation projects seem to use a zone-by-zone fitting procedure with a PUMS, followed by Monte Carlo draws as described by Beckman. Few seem to have adopted Beckman's multizone fitting procedure, however. This may be due to storage limitations: synthesizing all zones simultaneously using Beckman's multizone method requires substantially more computer memory, and would consequently limit the number of other attributes that could be synthesized.

In terms of different agent types, Beckman et al. considered households, families and individuals in a single unit. They segmented households into single-family, non-family and group quarters. They then synthesized family households, including a few individual characteristics (age of one family member) and “presence-of-children” variables. They did not synthesize person agents explicitly, did not associate families with dwellings, and did not synthesize the connection between families in multifamily households. Later work on their model (TRANSIMS) did synthesize persons from the households [27]. Their largest table was for family households with $ d=6$ variables and $ C=11,760$ cells before including geography, of which 609 cells were nonzero. Their sample was a 5% sample of a population of roughly 100,000 individuals.

Guo & Bhat [26] applied Beckman's procedure to a population of households in the Dallas-Fort Worth area in Texas. (It is not clear whether they used zone-by-zone synthesis or applied Beckman's multizone approach.) They modified Beckman's integerization procedure by making Monte Carlo draws without replacement, with some flexibility built into the replacement procedure in the form of a user-defined threshold. Further, they proposed a procedure for simultaneously synthesizing households and individuals. In their procedure, the household synthesis includes some information about the individuals within the household: the number of persons and the family structure. A series of individuals are synthesized to attach to the household, using the limited known information from the synthesized household. If any of the synthetic individuals are “improbable” given the number of demographically similar individuals already synthesized, then the entire household is rejected and resynthesized.

The linkage between households and individuals in this method remains weak, and the procedure is fairly ad hoc. Guo & Bhat showed that the method did improve fit in the individual table fits during a single trial, but the results were not adequate for any conclusive claims.

Huang & Williamson [28] implemented an IPF-based method for comparison against the Combinatorial Optimization method. They used a novel zone-by-zone synthesis procedure. In their approach, large zones (“wards”) within a PUMA were first synthesized one-at-a-time, under the assumption that each ward has the same association pattern as the larger PUMA. The ward results were then used as the initial table for the synthesis of finer zones (enumeration districts) within each ward. This multilevel approach improves on the conventional zone-by-zone method.

Huang & Williamson also used an incremental approach to attribute synthesis which they call “synthetic reconstruction” where after an initial fitting to produce four attributes, additional attributes are added one-at-a-time. The motivation for this approach is apparently to avoid large sparse tables, and includes collapsing variables to coarser categorization schemes to reduce storage requirements and sparsity. However, their method is complex and requires substantial judgment to construct a viable sequencing of new attributes, by leveraging a series of conditional probabilities. Furthermore, the connection to the PUMS agents is lost: the resulting population is truly synthetic, with some new agents created that do not exist in the initial PUMS. Since only a subset of the attributes are considered at a time, it is possible that some attributes in the synthetic agents may be contradictory. Nevertheless, the analysis is interesting, and reveals the effort sometimes expended when trying to apply IPF rigorously to obtain a population with a large number of attributes.

Finally, Huang & Williamson proposed a modification to the Monte Carlo procedure, by separating the integer and fractional components of each cell in the multiway table. The integer part is used directly to synthesize discrete agents, and the table of remaining fractional counts is then used for Monte Carlo synthesis of the final agents.

4 Reweighting and Combinatorial Optimization

The primary alternative to the Iterative Proportional Fitting algorithm is the reweighting approach advocated by Williamson. In a 1998 paper [57], Williamson et al. proposed a zone-by-zone method with a different parameterization of the problem: instead of using a contingency table of real-valued counts, they chose a list representation with an integer weight for each row in the PUMS. (As illustrated in Figure 2.2, there is a direct link between tabular and list-based representations.)

For each small zone $ \kappa$ within a PUMA, the zone population is much smaller than the observations in the PUMS; that is, $ N_{++\kappa} < n$. This made it possible for Williamson et al. to to choose a subset of the PUMS to represent the population of the zone, with no duplication of PUMS agents in the zone. In other words, the weight attached to each agent is either zero or one (for a single zone).

To estimate the weights, they used various optimization procedures to find the set of 0/1 weights yielding the best fit to the Summary Tables for a single zone. They considered several different measures of fit, and compared different optimization procedures including hill-climbing, simulated annealing and genetic algorithms. By solving directly for integer weights, Williamson obtained a better fit to the Summary Tables than Beckman et al., whose Monte Carlo integerization step harmed the fit.

Williamson et al. [57] proposed three primary reasons motivating their approach:

  1. Efficiency: a list-based representation uses considerably less storage than a tabular representation, particularly for a large number of attributes.
  2. Flexible aggregation: due to their storage limitations, array-based approaches often collapse finely-categorized attributes to a coarse categorization scheme. The list-based representation allows fine categorizations which can be flexibly aggregated into simple schemes as required. This can be done during the fitting procedure (to align with the categorization of a constraint), or after synthesis.
  3. Linkage: after synthesis, the list of individuals can be expanded to include new attributes from other data sources easily; cross-tabulations are more difficult to disaggregate in a similar manner.

The last two claims are somewhat weak. It is true that many IPF procedure require coarse categorization during fitting, in order to conserve limited memory. However, after completion, Beckman's approach does produce a list of PUMS records (and can be linked to other data sources easily). Even if a coarse categorization was used during fitting, it is still possible to use the fine categorization in the PUMS after synthesis. Nevertheless, IPF does require a carefully constructed category system to make fitting possible, and this can be time-consuming to design and implement.

The reweighting approach has three primary weaknesses. First, the attribute association observed in the PUMS ( $ \mathbf{n}_{ij}$) is not preserved by the algorithm. The IPF method has an explicit formula defining the relationship between the fitted table $ \mathbf{\hat{N}}_{ij}$ and the PUMS table $ \mathbf{n}_{ij}$ in equation (2.18). Beckman et al.'s multizone approach also treats the PUMS association pattern for the entire PUMA as a constraint, and ensures that the full population matches that association pattern. The reweighting method does operate on the PUMS, and an initial random choice of weights will match the association pattern of the PUMS. However, the reweighting procedure does not make any effort to preserve that association pattern. While the reweighting method has been evaluated in many ways [52,28,56,38], it does not appear that the fit to the PUMS at the PUMA level has been tested.

Secondly, the reweighting method is very computationally expensive. When solving for a single zone $ \kappa$, there are $ n$ 0/1 weights, one for each PUMS entry. However, this gives rise to $ \binom{n}{N_{++\kappa}}$ possible combinations; “incredibly large,” in the authors' words [57]. Of course, the optimization procedures are intelligent enough to explore this space selectively and avoid an exhaustive search; nevertheless, the authors reported a runtime of 33 hours using an 800MHz processor [28]. Since the number of permutations grows factorially with the number of individuals in the zone, it is not surprising that the authors chose to work with the smallest zones possible (1440 zones containing an average of 150 households each); it is possible that larger zones would not be feasible.

Finally, the reweighting method uses $ n \times K$ weights to represent a $ K$-zone area. This parameter space is quite large; larger, in fact, than the population itself. It is not surprising that good fits can be achieved with a large number of parameters, but the method is not particularly parsimonious and may overfit the Summary Tables. It is likely that a simpler model with fewer parameters could achieve as good a fit, and would generalize better from the 2% PUMS sample to the full population.