The existing population synthesis procedures have many limitations. In this chapter, the following concerns are discussed:
The chapter is structured largely as a discussion of these issues, together with an attempt to brainstorm solutions to these limitations, focusing particularly on the IPF-based methods; not all ideas are entirely successful. The new methods are intended to be largely independent, but most can be combined if desired. In the following chapters, several of these new methods are implemented and evaluated.
The IPF method has been applied to tables with a large number of dimensions, as many as eight. Such high dimensional tables are almost always sparse, with the vast majority of the cells in the table containing zeros. (High dimensional tables are generally notorious for their difficulties and differences from two-dimensional tables; see  for “ubiquitous” ill-behaved examples.) In these sparse tables, a large fraction of the non-empty cells contain only one observation, and the number of cells is often much larger than the number of observations. In other words, the sample does not provide a statistically meaningful estimate of the probability distribution for such a high dimensional table. However, the high-dimensional table can be collapsed to produce 2D or 3D tables, each of which is adequately sampled and gives a statistically valid distribution of counts. This is illustrated in Table 4.1, where 99.7% of the cells are zero in the 8-way table. The two-way and three-way margins of the table shown ( and ) have 0% and 5.6% zeros, and median cell counts of 90 and 30 respectively. Of course, an 8-way table has many other lower-dimensional margins, and only one such choice is illustrated in Table 4.1. (When collapsing a -way table to dimensions, there are possible choices of variables to keep in the low dimensional table.) Consequently, while a sparse 8-way table does not provide statistically testable 8-variable interaction, it could be viewed as a means of linking the many 2- or 3-way tables formed by its margins.
The cell counts in the high dimensional table say very little statistically; in fact, half of the non-zero cells contain just one observation. The most important information in an 8-way table is not necessarily the counts in the cells, but rather the co-ordinates of the non-zero cells, which determine how each cell influences the various low-dimensional margins. In this manner, the importance of the high-dimensional table is in many ways its sparsity pattern, which provides the link between the many 2- and 3-way tables. However, the sparsity pattern in the high dimensional table is in some ways an artefact of sampling. Some of the zeros in the table represent structural zeros: for example, it is essentially impossible for a family to exist where the wife's age is 15, her education is university-level and the family has eight children. Other zeros may be sampling zeros where the sample did not observe a particular combination of variables, but it does exist in the population.
All of the population synthesis strategies reviewed in Chapter 2 preserved the sparsity pattern of the high-dimensional PUMS. It would be useful if the synthesis procedure could account for the uncertainty surrounding the sparsity pattern. One strategy for resolving this would be to simplify the PUMS, building a variant that fits the low-dimensional margins but makes no assumptions about the associations or sparsity pattern in high dimensions where sampling is inadequate. The primary goal would be to correct sampling zeros in the high-dimensional PUMS, replacing them with small positive counts.
This strategy is illustrated in Figure 4.1. IPF is applied on a high-dimensional table, with the source sample set to a uniform distribution. The constraints applied to IPF are the full set of (say) 3-way cross-tabulations of the PUMS. For a 5-dimensional problem, this is equal to separate constraints. After IPF, the simplified PUMS matches all of the 3-way margins from the PUMS, but includes no 4-way or 5-way interaction between variables. In this example, a 5-way table is shown for illustration, but the intention is for higher-dimensional tables to be simplified. Also, 3-way margins of the PUMS are used as constraints for illustrative purposes, but the actual margins chosen could be selected using tests of statistically significant interaction; If a log-linear model shows some significant 4-way interactions in the PUMS, then these could be included as margins.
This method does indeed reduce the number of zeros in the high-dimensional table, correcting for sampling zeros and capturing some unusual individuals that were not included in the original PUMS. However, this strategy has significant downsides. In particular, it makes no distinction between structural and sampling zeros at high dimensions: there may be structural zeros involving four or more variables that do not appear in 3-way tables, but they will be treated the same as sampling zeros and be filled in with small positive counts.
Furthermore, it makes the entire population synthesis procedure more difficult, since agents can be synthesized who did not exist in the original PUMS. This makes the technique difficult to combine with some of the other methodological improvements discussed here.
For a microsimulation model spanning many aspects of society and the economy it is useful to be able to associate a range of attributes with each agent. Different attributes are useful for different aspects of the agent's behaviour. For a person agent, labour force activity, occupation, industry and income attributes are useful for understanding his/her participation in the labour force. Meanwhile, age, marital status, gender and education attributes might be useful for predicting demographic behaviour.
However, as more attributes are associated with an agent, the number of cells in the corresponding multiway contingency table grows exponentially. A multiway contingency table representing the association pattern between attributes has cells. If a new attribute with categories is added, then times more cells are needed. Asymptotically, the storage space is exponential in the number of attributes. As a result, fitting more than eight attributes with a multizone IPF procedure typically requires more memory than available on a desktop computer. However, the table itself is cross-classifying a fixed number of observations (i.e., a PUMS), and is extremely sparse when a large number of attributes are included as shown in Table 4.1. Is there a way that this sparsity can be exploited to allow the synthesis to create a large number of attributes?
Sparsity is a familiar problem in numerical methods. Many branches of science store large sparse 2D matrices using special data structures that hold only the non-zero sections of the matrix, instead of using a complete array that includes cells for every zero. In the data warehousing field, for example, multiway tables were initially stored in complete arrays accessed through a procedure called MOLAP (Multidimensional On-line Analytical Processing). More recently, that field has shifted to ROLAP (Relational OLAP) methods that allow data to be stored in its natural microdata form while retaining the ability to answer queries efficiently .
The IPF method itself has little impact on the sparsity pattern of a table. After the first iteration of fits to the constraints, the final sparsity pattern is essentially known; some cells may eventually converge to near-zero values, but few other changes occur. Once a cell is set to zero, it remains zero for the remainder of the procedure. Also, the IPF algorithm does not require a complete representation of the underlying table. All that it needs are three operations (described later), which could be implemented using either a complete or sparse representation of the data.
The benefits of using a sparse data structure are substantial. Williamson et al. presented many of the arguments when describing their Combinatorial Optimization method: efficiency in space, flexibility of aggregation and easier linking of data sources . In terms of efficiency, the method described here allows the IPF algorithm to be implemented using storage proportional to the number of non-zero cells in the initial table. For agent synthesis with the zone-by-zone method, this is proportional to (the number of observations in the PUMS) multiplied by (the number of attributes to fit). The multizone method combines several IPF stages, and requires considerably more memory: a similar in the first stage, but in the second stage, where is the number of zones. This is illustrated in Table 4.2.
There are many types of data structures that could be used to represent a sparse high dimensional contingency table. The data structure proposed here is not the most efficient, but is conceptually simple. It borrows directly from Williamson's Combinatorial Optimization method: the data is represented as a list of the PUMS microdata entries, with a weight attached to each. The weight is an expansion factor, representing the number of times to replicate that record to form a complete population. Williamson's representation includes only integer weights, and operates on a zone-by-zone basis. The new approach described here behaves exactly like IPF and hence allows fractional weights. With a small extension, it can also supports multiple zones: instead of attaching one weight to each PUMS entry, weights are attached with one for each zone. An illustration of the format of the data structure is shown in Table 4.3.
As Williamson et al. pointed out, flexible aggregation is a real advantage of a list-based representation. Complete array storage is proportional to the number of categories used for each attributes, while the sparse storage scheme is not affected by the categorization of the attributes. Many applications of IPF that used complete arrays were forced to abandon detailed categorization schemes to conserve space and allow more attributes to be synthesized (e.g., ). This in turn makes it difficult to apply several margins, since different margins may categorize a single attribute differently. When a large number of categories are possible, however, the attribute can be represented with a fine categorization and collapsed to different coarse categorizations as required during the fitting procedure.
The operation of a typical IPF algorithm was presented earlier in Figure 2.4. To implement this procedure with the sparse list structure, the following operations are necessary:
If the target population margins remain stored as a complete array, these operations are relatively straightforward. The collapse operation can be done in a single pass over the list, using the category numbers in each list row as co-ordinates into the complete array that stores the collapsed table. The update operation can likewise be done in a single pass over the list. All of these operations are fast, with complexity equal to the storage cost, . (The method is basically the same when non-PUMS attributes are included and multiple weights are stored per row in the list. The non-PUMS attribute must be handled as a special case, since it is stored slightly differently. Since there are times more weights, the computation cost grows proportionally to , again matching the storage cost.)
The only tricky part of the procedure is setting the initial weights. When using a complete array representation for zone-by-zone methods, the initial table needs to follow the distribution of the PUMS, while for Beckman et al.'s multizone method, the initial table needs to be a uniform distribution, usually by setting all cells to one. In the sparse representation described here, the situation is reversed because there is one row per PUMS entry, instead of one cell grouping many PUMS entries. As a result, the initial weights need to be 1.0 for the zone-by-zone method. For the multizone method, the sum of the weights in the rows that contribute to a single “cell” in the complete table needs to add to one. Therefore, the individual weights should be . The tricky part is finding how many rows share a cell, since the data is not structured for this purpose. To achieve this, the list is sorted by the table dimensions, which has the effect of grouping rows contributing to a single cell. In this manner, the initial weights can be set in time.
Additionally, the multizone procedure requires a fit to the distribution of all non-geographic variables simultaneously. (See the margin on the right half of Figure 2.6.) This margin is high-dimensional—it includes all variables except for the geographic variable. However, it is also sparse, and is in fact computed through an IPF procedure. As a result, it is already stored as a sparse list with a single weight per row. This weight can be treated as a constraint on the total for the weights in each row, and suitable collapse/update procedures are then easily defined. The computation cost is still for this type of constraint.
Finally, the Monte Carlo integerization for this sparse structure is quite simple, and little changed. The list of weights (or 2D array of weights for the multizone procedure) is normalized and treated as a probability mass function, and individual rows (cells for multizone) are synthesized using Monte Carlo draws.
This sparse data structure removes a substantial limitation from the IPF algorithm, but also raises new questions. Is there a limit to the number of attributes that can be synthesized? If there is a limit, how is it related to the size of the PUMS sample?
The answers to these questions remain elusive. Ultimately, the addition of a new attribute is a decision to increase the dimension of the multiway contingency table. As discussed earlier, the behaviour of this high-dimensional table and its relationship to lower-dimensional margins remains poorly understood.
Many census agencies apply random rounding procedures to published tables, including the agencies in Canada, the United Kingdom and New Zealand. Each agency has a base that it uses, and then modifies a cell count by rounded up to the nearest multiple of with a probability , or down with a probability . In most applications, a procedure called unbiased random rounding is used, where . The alternative is called unrestricted random rounding, where is constant and independent of the cell values; for example, with it is equally likely that a cell will be rounded up or down.
For example, cells and marginal totals in Canadian census tables are randomly rounded up or down to a multiple of using the unbiased procedure. For a cell with a count of , there is a 20% probability that it is published as and an 80% probability that it is published as . Most importantly, the expected value is equal to that of the unrounded count; it is therefore an unbiased random rounding procedure.
As discussed by Huang & Williamson , this can lead to conflicts between tables: two different cross-tabulations of the same variable or set of variables may be randomly rounded to different values. The standard IPF procedure will not converge in this situation. The procedure is also unable to take into account the fact that margins do not need to be fitted exactly, since there is a reasonable chance that the correct count is within of the reported count.
Using the termination criterion of Figure 2.4 (line 10), the IPF procedure will not necessarily terminate if two randomly rounded margins conflict. The termination criterion shown requires the fitted table to match all margins simultaneously:
For each cross-tabulation, statistical agencies publish a hierarchy of margins, and these margins are rounded independently of the cells in the table. For a three-way table randomly rounded to give , the data release will also include randomly rounded two-way margins , and , one-way margins , and , and a zero-way total . The sum of the cells does not necessarily match the marginal total. For example, the sum includes randomly rounded counts. The expected value of this sum is the true count , but the variance is large and the sum could be off by as much as in the worst case. By contrast, the reported marginal total also has the correct expected value, but its error is at most .
For this reason, it seems sensible to include the hierarchical margins in the fitting procedure, in addition to the detailed cross-tabulation itself.
As described in equations (2.6) and (2.1), the IPF procedure minimizes the Kullback-Leibler divergence ,
That is, any value within the range is an acceptable solution, with no preference for any single value within that range.
Dykstra's generalization of IPF  provides some fruitful ideas for handling this type of constraint. Csiszár  described the IPF procedure as a series of projections onto the subspace defined by each constraint. Csiszár was not working in a -dimensional space (where is the number of attributes being fitted), but in a -dimensional space (where is the number of cells in the table) representing all possible probability distributions, which has since been called -space.
Nevertheless, the idea of projection is still useful: each iteration of IPF is a modification of the probability distribution to fit a margin. It is a “projection” in that it finds the “closest” probability distribution in terms of Kullback-Leibler divergence, just as the projection of a point onto a plane finds the closest point on the plane in terms of Euclidean distance. (Note, however, that Kullback-Leibler divergence is not a true distance metric.)
Csiszár only considered equality constraints. Dykstra extended Csiszár's method to include a broader range of constraints: any closed convex set in -space. This class of constraints appears to include the desired inequality constraints defined by Equation 4.4. Dykstra's method for applying these constraints is also a projection procedure, finding the set of counts that satisfy the constraint while minimizing the Kullback-Leibler divergence.
To give an example, consider the algorithm of Figure 2.4. Line 5 would be replaced with
However, this projection procedure has its own problems. The standard IPF procedure ignores the probability distribution associated with each marginal value and uses only the published cell count. The projection procedure described here suffers from the opposite problem: it focuses on the range of possible values, without acknowledging that one outcome is known to be more likely than the others. To see this, consider the probability distribution for the unrounded value given the published value shown in Table 4.4. The distribution is triangular, with a strong central peak. The projection algorithm forces the fit to match the range of this distribution, but it treats all values inside this range as equally probable. This would be suitable if the census used unrestricted random rounding, but not for the more typical case where unbiased rounding is used.
Suppose that a population of person agents has been synthesized, with a limited amount of information about their relationships in families (such as a CFSTRUC, which classifies a person as married, a lone parent, a child living with parent(s), or a non-family person). In the absence of any information about how families form, the persons could be formed into families in a naïve manner: randomly select male married persons and attach them to female married persons, and randomly attach children to couples or lone parents. Immediately, problems would emerge: some persons would be associated in implausible manners, such as marriages with age differences over 50 years, marriages between persons living at opposite ends of the city, or parents who are younger than their children.
A well-designed relationship synthesis procedure should carefully avoid such problems. A good choice of relationships satisfies certain constraints between agents' attributes, such as the mother being older than her child, or the married couple living in the same zone. It also follows known probability distributions, so that marriages with age differences over 50 years have a low but non-zero incidence.
Most constraints and probability distributions are observed in microdata samples of aggregate agents, such as families and households. A complete Family PUMS includes the ages of mothers and children, and none of the records includes a mother who is younger than her children.3 Similarly, only a small fraction of the records include marriages between couples with ages differing by more than 50 years. The question, however, is one of method: how can relationships between agents be formed to ensure that the desired constraints are satisfied?
Guo & Bhat  used a top-down approach, synthesizing a household first and then synthesizing individuals to connect to the household. The attributes used to link the two universes were gender and age: the gender of the husband/wife or lone parent are known, and coarse constraints on the age of the household head (15-64 or 65+) and children (some 0-18 or all 18+). These constraints are quite loose, and no constraint is enforced between the husband/wife's ages or parent/child ages.
Guan  used a bottom-up approach to build families, with slightly stronger constraints. The persons were synthesized first, and then assembled to form families. Children are grouped together (and constrained to have similar ages), then attached to parents. Constraints between parent/child ages and husband/wife ages were included, although there are some drawbacks to the method used for enforcement. Guan likewise used a bottom-up approach to combine families and non-family persons into households.
Arentze & Timmermans  only synthesized a single type of agent, the household. Their synthesis included the age and labour force activity of both husband and wife, and the linkage to the number of children in the household. They did not connect this to a separate synthesis of persons with detailed individual attributes, but by synthesizing at an aggregate level, they guaranteed that the population was consistent and satisfied key constraints between family members.
Both Guo & Bhat and Guan's procedures suffer from inconsistencies between the aggregate and disaggregate populations. The family population may contain 50 husband-wife families in zone where the husband has age , while the person population contains only 46 married males of age in zone . In the face of such inconsistencies, either families or persons must be changed: a family could be attached to a male of age , or a person could be modified to fit the family. In both cases, either the family or person population is deemed “incorrect” and modified. The editing procedures are difficult to perform, and inherently ad hoc. Furthermore, as the number of overlapping attributes between the two populations grows, inconsistencies become quite prevalent.
What are the sources of these inconsistencies? They come from two places: first, the fitting procedure used to estimate the population distribution for persons and for families may not give the same totals for a given set of common attributes. Second, even if and agree on all shared attributes, the populations produced by Monte Carlo synthesis may not agree, since the Monte Carlo procedure is non-deterministic. In the following sections, a method is proposed to resolve these two issues.
For the purposes of discussion, consider a simple synthesis example: synthesizing husband-wife families. Suppose that the universe of persons includes all persons, with attributes for gender , family status , age , education and zone . The universe of families includes only husband-wife couples, with attributes for the age of husband and wife , and zone . IPF has already been used to estimate the contingency table cross-classifying persons ( ) and likewise for the table of families ( ). The shared attributes between the two populations are age and zone, and implicitly gender. The two universes do not overlap directly, since only a fraction of the persons belong to husband-wife families; the others may be lone parents, children, or non-family persons, and are categorized as such using the attribute.
In order for consistency between and , the following must be met for and any choice of :
That is, the number of married males of age in zone must be the same as the number of husband-wife families with husband of age in zone . While this might appear simple, it is often not possible with the available data. A margin giving the distribution is probably available to apply to the person population. However, a similar margin for just married males is not likely to exist for the family population; instead, the age breakdown for married males in the family usually comes from the PUMS alone. As a result, equation (4.6) is not satisfied.
One suggestion immediately leaps to mind: if the person population is fitted with IPF first and is known, the slice of where and could be applied as a margin to the family fitting procedure, and likewise for . This is entirely feasible, and does indeed guarantee matching totals between the populations. The approach can be used for the full set of attributes shared between the individual and family populations. There is one downside, however: it can only be performed in one direction. The family table can be fitted to the person table or vice versa, but they cannot be fitted simultaneously.4
Finally, there remains one wrinkle: it is possible that the family population will still not be able to fit the total margin from the individual population, due to a different sparsity pattern. For example, if the family PUMS includes no families where the male is (say) 15-19 years old but the individual PUMS does include a married male of that age, then the fit cannot be achieved. This is rarely an issue when a small number of attributes are shared, but when a large number of attributes are shared between the two populations it is readily observed. The simplest solution is to minimize the number of shared attributes, or to use a coarse categorization for the purposes of linking the two sets of attributes.
Alternatively, the two PUMS could be cross-classified using the shared attributes and forced to agree. For example, for and , then the pattern of zeros in and could be forced to agree by setting cells to zero in one or both tables. (In the earlier example, this would remove the married male of age 15-19 from the Person PUMS.) The person population is then fitted using this modified PUMS, and the family population is then fitted to the margin of the person population.
The second problem with IPF-based synthesis stems from the independent Monte Carlo draws used to synthesize persons and families. For example, suppose that mutually fitted tables and are used with Monte Carlo to produce a complete population of persons and families and . If it can be guaranteed for and that
A simplistic solution would be a stratified sampling scheme: for each combination of and , select a number of individuals to synthesize and make exactly that many draws from the subtables and . This approach breaks down when the number of strata grows large, as it inevitably does when more than one attribute is shared between persons and families.
The problem becomes clearer once the reason for mismatches is recognized. Suppose a Monte Carlo draw selects a family with husband age in zone . This random draw is not synchronized with the draws from the person population, requiring a person of age in zone to be drawn; the two draws are independent. Instead, synchronization could be achieved by conditioning the person population draws on the family population draws. Instead of selecting a random value from the joint distribution
This reversal of the problem guarantees that equation (4.7) is satisfied, and allows consistent relationships to be built between agents. While it has been described here in a top-down manner (from family to person), it can be applied in either direction. The two approaches are contrasted in Figures 4.2 and 4.3.
As demonstrated in the preceding sections, it is possible to synthesize persons and relate them together to form families, while still guaranteeing that the resulting populations of persons and families approximately satisfy the fitted tables and . By carefully choosing a set of shared attributes between the person and family agents and using conditional synthesis, a limited number of constraints can be applied to the relationship formation process. In the example discussed earlier, the ages of husband/wife were constrained; in a more realistic example, the labour force activity of husband/wife, the number of children and the ages of children might also be constrained. Furthermore, multiple levels of agent aggregation could be defined: families and persons could be further grouped into households and attached to dwelling units.
The synthesis order for the different levels of aggregation can be varied as required, using either a top-down or bottom-up approach. However, the method is still limited in the types of relationships it can synthesize: it can only represent nesting relationships. Each individual person can only belong to one family, which belongs to one household. Other types of relationships cannot be synthesized using this method, such as a person's membership in another group (e.g., a job with an employer).