Subsections


3 Data Sources and Definitions

The data for this project came largely from the Canadian Census administered by Statistics Canada. The census has been conducted every five years since 1981, and Toronto's travel activity survey (the Transportation Tomorrow Survey or TTS) is timed to coincide with census years. The TTS was first conducted in 1986, and this was therefore chosen as the baseline year for the ILUTE model and population synthesis.

The Canadian Census has been conducted as a mail-back self-administered survey since 1971. Eighty percent of households receive a short survey known as the 2A form, while twenty percent receive an expanded version called the 2B form. In 1986, the census was conducted on the first Tuesday of June, which fell on the third day of the month. The 1986 Census was Canada's first full mid-decade census, and was very nearly cancelled due to reduced federal government expenditure in the early 1980s. It was reinstated, but with limited resources. As a result, some useful information was never fully coded or tabulated, such as the place-of-work. However, the provincial government in Ontario did pay for geocoding place-of-work for the entire province, and some tables with geographic distributions of employment do exist, although they can be difficult to obtain [47].

Census data is aggregated by persons, census families, or households and is released in three distinct forms. Profile tables are assembled for each question from the census, showing the breakdown of responses to a single question within a geographic area. Basic Summary Tabulations (BSTs) are cross-tabulations of two to four questions from the census, also including geographic variation. Profile table and cross-tabulations may be derived from questions from the 2A or 2B forms, and may represent either a 100% sample or a 20% sample that has been expanded to a 100% basis. Finally, Statistics Canada also releases Public Use Microdata (PUMS), a 2% sample of all responses made by a person (and likewise a 1% sample of family responses and a 1-4% sample of household responses). Each PUMS data file is associated with a single Census Metropolitan Area (CMA), a large geographic area of more than 100,000 persons that acts as the equivalent of the Public Use Micro Area (PUMA) in the U.S.; the data contains no information about spatial variation within the CMA.

Figure 3.1: The major groups within the Canadian census' person universe. The numbers in parentheses show the size of each grouping (thousands of persons) within the Toronto Census Metropolitan Area (CMA) in the 1986 census. Adapted from [42].
Image figure_defn_popi


Table 3.1: Sample sizes of some data sources used for synthesis, at different levels of geography This gives a sense of the sample size in PUMS and Summary Table data, both at a broad geographical scale (the Toronto CMA), and at the finer scale of Census Tracts (CT). The example CT 59.00 is a downtown zone neighbouring the University of Toronto. The BSTs that include an “A” are drawn from the Census 2A form and have a 100% sample, while the “B” tables have a 20% sample.
      Sample Sample
Source Geography Universe % size
Persons
    PUMS CMA Non-inst. persons 2% 67,992
    BST DM86A01 CMA All persons 100% 3,427,165
    BST SC86B01 CMA Non-inst. persons, age 15+ 20% 546,470
    BST LF86B04 CMA Labour force 20% 395,965
    BST DM86A01 CT 59.00 All persons 100% 3,745
    BST SC86B01 CT 59.00 Non-inst. persons, age 15+ 20% 653
    BST LF86B04 CT 59.00 Labour force 20% 482
Census Families
    PUMS CMA Families in priv. dwellings 1% 9,061
    BST CF86A02 CMA Families in priv. dwellings 100% 906,385
    BST CF86A02 CT 59.00 Families in priv. dwellings 100% 800
Dwellings / Households
    PUMS CMA Occupied private dwellings 1% 11,998
    BST DW86A01 CMA Occupied private dwellings 100% 1,119,800
    BST DW86B02 CMA Occupied private dwellings 20% 239,960
    BST DW86A01 CT 59.00 Occupied private dwellings 100% 1,130
    BST DW86B02 CT 59.00 Occupied private dwellings 20% 226


The PUMS data and different summary tables may be drawn from different samples. The population of persons can be broken down into many subgroups, some of which are shown in Figure 3.1. The 2A census form (100% sample) is collected for the full population, while the 2B form (20% sample) is collected only for the non-institutional population 15 years of age and over. Some summary tables are defined on the 2A universe, where exact population counts are available. Others are defined on the 2B universe, by expanding the 20% sample to an estimate of the complete 2B universe. Combining data from tables derived from the 2A and 2B samples can be challenging, because of their differing universes and errors in the 2B estimates. The PUMS uses a different sample again; it is defined on a 2% sample of the full population excluding institutional residents (and residents of incompletely enumerated Indian reserves, which are not an issue in the Toronto CMA). The sample sizes associated with different universes and tables are summarized in Table 3.1.

The universe of persons is slightly complicated. The 1986 census excluded non-permanent residents from all tables, which includes foreign persons present on student authorization, employment authorization, Minister's permits and refugee claimants. These were included in 1991 and subsequent censuses, and do account for a sizeable fraction of the Toronto population. In 1991, there were 98,105 non-permanent residents in the Toronto CMA (2.5% of total); assuming a similar growth rate to the CMA as a whole, this would give approximately 89,000 in 1986. There is no data on this population, however. Institutional collective dwellings are defined as hospitals, orphanages, correctional/penal institutions and religious institutions, and the residents of these institutions are excluded from many tables (but not the staff). Non-institutional collective dwellings are defined as hotels, motels, tourist homes, lodging- and rooming-houses, work camps, military camps and Hutterite colonies. Temporary residents are persons with a usual dwelling elsewhere in Canada living temporarily in another dwelling; they are usually treated as part of their “permanent” household. However, some dwellings are occupied only by temporary residents, and are a separate category from both occupied and unoccupied dwellings. Finally, foreign residents are foreign diplomats or military personnel stationed in Canada. Temporary, foreign and collective (non-institutional) residents are included in most person-based tables, but not in family, household or dwelling tables.

Statistics Canada makes some modifications to the collected census data before publishing tables. Contradictions in the submitted form are resolved using an edit and impute method. Furthermore, to protect the privacy of individual persons and households, Statistics Canada applies two disclosure control techniques. In any released table, all numbers are randomly rounded (up or down) to a multiple of five and in special cases to a multiple of ten. This is a stronger measure than many countries; the UK and New Zealand use a multiple of three, and the American census does not use random rounding [18]. The UK and Australian agencies apply random rounding only to small cells, but the Canadian agency applies it to every cell in every table. In each reported table, the individual cells and the row and column totals are rounded independently using a procedure called Unbiased Random Rounding. The rounding tends toward the closer multiple of five, so a count of 4 has a probability of 80% of being rounded to 5 and a 20% probability of being rounded to 0 [55]. The alternative is called unrestricted random rounding, where there is a fixed probability $ p$ that a cell is rounded down, regardless of its value; typically, $ p=0.5$ is used.1

Finally, in geographic areas with less than forty persons, no data is released; this is called area suppression. Additionally, in areas with less than 250 persons, no income data is released.

1 Family and Household Definitions

The Canadian census family and household definitions are generally intuitive, but some special cases are tricky. As the Census Handbook notes, “it is very difficult to translate complex human relationships into tables” [42].

Figure 3.2: A breakdown of the Canadian census' person universe, by family membership. The numbers in parentheses show the size of each grouping in thousands, aggregated into groupings of persons (P), dwellings/households (D), economic (EF) and census families (CF) within the Toronto Census Metropolitan Area in 1986. Not to scale. Adapted from [42].
$ ^*$ Relatives other than spouse, common-law partner or never-married sons and daughters.
Image figure_defn_pophf

The census distinguishes between two types of families: the “census family” defines a relationship between cohabiting adults and children, while the “economic family” defines other types of family relationships within a single dwelling. The details of family definition are complicated, particularly when considering cohabiting multigeneration families. The household definition is straightforward, consisting of all persons sharing a “dwelling unit;” there is a one-to-one relationship between households and occupied dwelling units. The dwelling unit definition is slightly more complicated, and is defined as living quarters with a private entrance from the outside or from a common hallway. More formally, Figure 3.2 graphically shows the relationship between the different types of family membership.

“People living in the same dwelling are considered a census family only if they meet the following conditions: they are spouses or common-law partners, with or without never-married sons or daughters at home, or a lone parent with at least one son or daughter who has never been married. The census family includes all blood, step- or adopted sons and daughters who live in the dwelling and have never married. It is possible for two census families to live in the same dwelling; they may or may not be related to each other” [49] for 1996; essentially the same as 1986 definition [42,45].

No distinction is made between common-law and legal marriage; both are coded as “married.” While homosexual couples are recognized to exist, the census coding does not allow this type of family. Any household that reports a married/common-law couple with the same sex is recoded; either they are cohabiting unmarried individuals or the gender of one individual is changed, making it an opposite-sex marriage [45]. Finally, foster children are treated as lodgers rather than family members. Table 3.2 details an example household that illustrates several unusual aspects of these family definitions.


Table 3.2: An example household containing unusual family structure. As shown in the census family column, there are three census families here, and three persons who are not in any census family. Marc does not belong to a census-family because he is not a “never-married” child; Nicole is not in a census family because she is not a child of any person in the household; and Benjamin is a foster child and is hence treated as a lodger. The economic family column shows how these same persons can be grouped into two economic families, plus one non-family person (Benjamin). Source: [45].
    Marital   Census Economic
Person Age status Relationship family family
John 63 Now married Person 1 1 A
Marie 59 Now married Wife 1 A
Julie 37 Widowed Daughter 2 A
Robert 12 Single Grandchild 2 A
Lucie 09 Single Grandchild 2 A
Marc 25 Separated Son - A
Nicole 12 Single Niece - A
Benjamin 14 Single Lodger (ward) - -
Brian 24 Now married Lodger 3 B
Janet 21 Now married Lodger's wife 3 B
Jerry 03 Single Lodger's son 3 B


The connection between households and families is also illustrated in Figure 3.2. Each “private household” occupies one dwelling, in the language of the census. This one-to-one relationship between private households and “occupied private dwellings” means that the household PUMS can be used as a PUMS for dwellings. Occupied private dwellings are only one part of the dwelling universe, but almost no data is available on other types of dwellings. The missing parts of this universe are collective dwellings, dwellings occupied by foreign/temporary residents, unoccupied dwellings, some marginal dwellings (e.g., cottages that are not occupied year-round), and some dwellings under construction or conversion.2

2 Agent Attributes


Table 3.3: Overview of Person attributes, showing the number of categories for the attributes in each data source. Each column describes a single multiway cross-tabulation derived from the given data source. The rest of the profile tables add no further information, and are not shown.
    Data Source (and sample size)
Attribute Description Profile 2B (20%) CF86A04 (100%) DM86A01 (100%) LF86B01 (20%) LF86B03 (20%) LF86B04 (20%) SC86B01 (20%) Person PUMS (2%)
AGEP Age   4$ ^\ast$ 16 6     6 c
CFSTAT Census Family Status   5           11
HLOSP Highest Level Of Schooling         7   6 12
LFACT Labour Force Activity       3 3     15
OCC81P Occupation           24   17
SEXP Sex 2 2 2 2 2 2 2 2
TOTINCP Total Income 11             c
CTCODE Census Tract 731 731 731 731 731 731 731  
     c continuous, discretized to integer; large number of categories
    $ ^\ast$ missing breakdown for a few cells.



Table 3.4: Overview of Census Family attributes, showing the number of categories for the attributes in each data source. While HHSIZE and HHNUMCF are not present in any family tables, they are present in the Person PUMS, which can be reweighted to a family universe for synthesis. The profile tables add no information beyond that already in the BSTs, and are not shown.
    Data Source  
    (and sample size)  
Attribute Description CF86A02 (100%) CF86A03 (100%) LF86B08 (20%) Family PUMS (1%) Reweighted Person PUMS
AGEF Age (female)       c c
AGEM Age (male)       c c
CFSIZE Census Family Size       7$ ^\dagger$ 7
CFSTRUC Census Family Structure 3 3   16 3$ ^\dagger$
CHILDA Number of Children 0-5   2 2 3  
CHILDB Number of Children 6-14   2 2$ ^\ddagger$ 4  
CHILDC Number of Children 15-17   2 $ ^\ddagger$ 3  
CHILDDE Number of Children 18-24, 25+   2 $ ^\ddagger$ 9  
HHSIZE Household Size         8
HHNUMCF Number of Families in Household         3
LFACTF Labour Force Activity (female)     3 13 15
LFACTM Labour Force Activity (male)       13 15
NUCHILD Number of Children 6 2 2 9 8$ ^\dagger$
ROOM Dwelling # of Rooms       10 10
TENURE Tenure       2 2
CTCODE Census Tract 731 731 731 731    
     c continuous, discretized to integer; large number of categories  
    $ ^\dagger$ inferred from other attributes  
    $ ^\ddagger$ 2 categories for “number of children ages 6 and higher”.  



Table 3.5: Overview of Household/Dwelling Unit attributes, showing the number of categories for the attributes in each data source. Each column shows a single data source's coverage of different attributes. Note that HHNUMCF is missing from the Household PUMS, but present in the Person PUMS, where it can be reweighted to a household or economic family universe. The profile tables add no information beyond that already present in the BSTs, and are not shown.
    Data Source (and sample size)
Attribute Description DW86A01 (100%) DW86A02 (100%) DW86B02 (20%) DW86B04 (20%) HH86A01 (100%) HH86A02 (100%) HH86B01+B02 (20%) Household PUMS (1–4%) Reweighted Person PUMS
BUILTH Dwelling Age     8         7  
DTYPEH Dwelling Type 4 4 4 4       8  
HHNUEF # Econ. Fam. in HH               2 2
HHNUMCF # Cens. Fam. in HH         3 3 3   3
HHSIZE Household Size   10       10   8 8
PAYH Monthly Dwell. Cost             5 c c
PPERROOM Persons Per Room       5       5$ ^\dagger$ 5$ ^\dagger$
ROOM Dwelling # of Rooms               10 10
TENURH Household Tenure 3       2   2 2 2
CTCODE Census Tract 731 731 731 731 731 731 731    
     c continuous, discretized to integer; large number of categories    
    $ ^\dagger$ inferred from other attributes    


The census offers a broad range of attributes that could be used in synthesis. Tables 3.33.4 and 3.5 show the attributes selected for synthesis, and the relevant data sources that include these attributes.

Both the Household PUMS and the Family PUMS lack information on the number of census families sharing a dwelling, and the Family PUMS also lacks information about the household size. These attributes would be useful, but can fortunately be derived from another source: the Person PUMS. Suppose that we consider only the family persons in the Person PUMS, and treat each person as an observation of a census family. Then, the attributes from the Person PUMS could be used to derive information about census families. A similar procedure could be used to gain additional information about households.

However, persons in large families are over-represented in the person PUMS. For example, consider the complete population of families and persons, ignoring for the moment the small sample in the PUMS itself. A family of eight persons is repeated eight times in the person population, while a family of two persons is repeated twice. Large families are thus overrepresented in the person population, but this can be corrected by weighting each observation in the person population by $ 1/\textsc{Cfsize}$, the inverse of the family size. In the PUMS, not every member of an eight-person family will be present in the Person PUMS, but large families will still be observed proportionately more often, and the same reweighting method can be applied to correct this.

3 Exploration of a Summary Table


Table 3.6: The contents of the SC86B01 summary tables: population by sex, age and highest level of schooling. Since this table is derived from a 20% sample, these counts have been expanded by a factor of five from the original sample.
    Age
Sex Highest Level of Schooling 15-24 25-34 35-44 45-54 55-64 65+
Female Less than grade 9 6,440 14,330 30,050 41,980 47,515 69,550
  Grades 9-13 110,165 58,255 52,950 47,170 48,870 50,600
  High school 50,930 51,645 36,085 22,540 19,300 18,425
  Trades and non-uni 58,650 92,025 68,655 43,035 31,550 26,760
  University w/o degree 35,570 36,250 28,900 13,685 10,005 8,380
  University w/ degree 18,410 68,395 44,060 16,060 8,625 6,340
Male Less than grade 9 8,035 11,575 23,565 37,025 41,335 43,465
  Grades 9-13 128,325 57,110 41,030 37,470 35,780 30,505
  High school 46,955 34,400 21,725 14,985 12,580 10,575
  Trades and non-uni 48,870 89,200 69,015 48,350 35,765 21,055
  University w/o degree 36,505 39,805 31,245 15,990 12,240 8,325
  University w/ degree 14,735 72,130 64,040 30,060 18,755 12,105


Figure 3.3: A mosaic plot showing the breakdown of the SC86B01 summary tables: population by sex, age and highest level of schooling. Mosaic plots are useful tools for visualizing the breakdown of categories in low-dimensional contingency tables [22]. As usual for these plots, the area of each box represents the number of persons with a given sex, age and schooling. The difference in age breakdown between the two genders can be easily seen, and the differences in the schooling breakdown between each age group can also be seen. Shading has been added to make it easier to see similar schooling levels.
Image figure_sc86b01_mosaic

To help understand the census data (and contingency tables in general), a brief examination of a single summary table is useful. This exploration focuses on the SC86B01 table, a summary table that cross-classifies age, sex and education by zone. The study area is the Toronto CMA, and the geography has been simplified to a set of twelve zones. Table 3.6 shows the counts in SC86B01, excluding the geographic breakdown. Figure 3.3 shows the same information graphically.

What are the statistical properties of this table? Is there statistically significant association between these variables? Is there significant geographic variation? A log-linear model can be used to answer these questions. In the following, the variables $ W(h)$, $ X(i)$, $ Y(j)$ and $ Z(k)$ will be used to represent gender, age, level of schooling and zone respectively.

First, to consider statistically significant association between the variables (excluding geography), a hierarchy of models can be constructed. The final model in this hierarchy $ (\mathit{WXY})$ defines all-way association between the non-geographic variables, and is given by

$\displaystyle \log \hat{N}_{hijk}/5$ $\displaystyle = \lambda + \lambda_W + \lambda_X + \lambda_Y + \lambda_\mathit{WX}+ \lambda_\mathit{WY}+ \lambda_\mathit{XY}+ \lambda_\mathit{WXY}$ (18)


Table 3.7: Series of log-linear models to test for association between gender, age and highest level of schooling in SC86B01 table. Each row shows a model that adds one term to the model in the previous row. (The complete model is shown using symbols $ W$, $ X$ and $ Y$ for compactness, but these correspond to SEXP, AGEP and HLOSP.) The statistical significance of each model is tested using the chi-square statistic between adjacent rows; all models are significant at the 99% level.
          Residual
    Deviance Residual Deviance
Model New term Df ( $ \Delta G^2$) Df ($ G^2$)
NULL       863 628862
$ (W)$ SEXP 1 568 862 628294
$ (W,X)$ AGEP 5 40637 857 587657
$ (W,X,Y)$ HLOSP 5 63315 852 524342
$ (\mathit{WX},Y)$ $ \textsc{Sexp}\times \textsc{Agep}$ 5 1537 847 522806
$ (\mathit{WX},\mathit{WY})$ $ \textsc{Sexp}\times \textsc{Hlosp}$ 5 4338 842 518467
$ (\mathit{WX},\mathit{WY},\mathit{XY})$ $ \textsc{Agep}\times \textsc{Hlosp}$ 25 102424 817 416043
$ (\mathit{WXY})$ $ \textsc{Sexp}\times \textsc{Agep}\times \textsc{Hlosp}$ 25 3525 792 412517


To test for this three-way association, the $ G^2$ statistics of model $ (\mathit{WXY})$ and the restricted model $ (\mathit{WX},\mathit{WY},\mathit{XY})$ are compared and tested using a chi-squared distribution. Because SC86B01 is derived from a 20% sample that was expanded to 100%, the counts must be deflated by a factor of five before estimating the models. Table 3.8 shows the complete series of models leading up to $ (\mathit{WXY})$. Each model in the series exhibits statistically significant improvement in fit over the previous model. Consequently, we can reject the hypothesis that there is no three-way association between gender, age and highest level of schooling.


Table 3.8: Series of log-linear models testing association in SC86B01, including geography. Each row shows a model that adds one term to the model in the previous row. The statistical significance of each model is tested using the chi-square statistic between adjacent rows; all models are significant at the 99% level.
          Residual  
      Deviance Residual Deviance  
Model New term Df ( $ \Delta G^2$) Df ($ G^2$) $ P(>\vert X\vert)$
NULL       863 628862 0
$ (W)$ SEXP 1 568 862 628294 0
$ (W,X)$ AGEP 5 40637 857 587657 0
$ (W,X,Y)$ HLOSP 5 63315 852 524342 0
$ (W,X,Y,Z)$ ZONE 11 381164 841 143178 0
$ (\mathit{WX},Y,Z)$ $ \textsc{Sexp}\times \textsc{Agep}$ 5 1537 836 141641 0
$ (\mathit{WX},\mathit{WY},Z)$ $ \textsc{Sexp}\times \textsc{Hlosp}$ 5 4338 831 137303 0
$ (\mathit{WX},\mathit{WY},\mathit{XY},Z)$ $ \textsc{Agep}\times \textsc{Hlosp}$ 25 102424 806 34878 0
$ (\mathit{WX},\mathit{WY},\mathit{WZ},\mathit{XY})$ $ \textsc{Sexp}\times\textsc{Zone}$ 11 207 795 34671 0
$ (\mathit{WX},\mathit{WY},\mathit{WZ},\mathit{XY},\mathit{XZ})$ $ \textsc{Agep}\times\textsc{Zone}$ 55 11130 740 23541 0
$ (\mathit{WX},\mathit{WY},\mathit{WZ},\mathit{XY},\mathit{XZ},\mathit{YZ})$ $ \textsc{Hlosp}\times\textsc{Zone}$ 55 15950 685 7591 0
$ (\mathit{WXY},\mathit{WZ},\mathit{XZ},\mathit{YZ})$ $ \textsc{Sexp}\times \textsc{Agep}\times \textsc{Hlosp}$ 25 3520 660 4071 0
$ (\mathit{WXY},\mathit{WXZ},\mathit{YZ})$ $ \textsc{Sexp}\times\textsc{Agep}\times\textsc{Zone}$ 55 304 605 3767 0
$ (\mathit{WXY},\mathit{WXZ},\mathit{WYZ})$ $ \textsc{Sexp}\times\textsc{Hlosp}\times\textsc{Zone}$ 55 733 550 3034 0
$ (\mathit{WXY},\mathit{WXZ},\mathit{WYZ},\mathit{XYZ})$ $ \textsc{Agep}\times\textsc{Hlosp}\times\textsc{Zone}$ 275 2573 275 461 0
$ (\mathit{WXYZ})$ $ \textsc{Sexp}\times\textsc{Agep}\times\textsc{Hlosp}\times\textsc{Zone}$ 275 461 0 0 0


In a second series of models, the influence of geography is included. (In this analysis, the simplified 12-zone representation of geography is used; the full 731-zone system cannot be analyzed with a log-linear model, due to the memory requirements of generalized linear model estimation.) Table 3.8 shows the series of log-linear models leading up to the saturated model $ (\mathit{WXYZ})$. As shown, every model is statistically significant with respect to the next simplest model; we can therefore conclude that there is significant four-way association in this dataset. Furthermore, the $ (\mathit{WX},\mathit{WY},\mathit{XY},Z)$ model describes 95% of the deviance in the data; while the higher-order geographic associations are statistically significant, they are responsible for only a small part of the total deviance.


Table 3.9: Series of log-linear models testing association in SC86B01 relative to PUMS. The left hand side of the model is the ratio of the SC86B01 count to the PUMS count for the same cell.
          Residual  
      Deviance Residual Deviance  
Model New term Df ( $ \Delta G^2$) Df ($ G^2$) $ P(>\vert X\vert)$
NULL       863 413094  
$ (W)$ SEXP 1 0.07 862 413094 1
$ (W,X)$ AGEP 5 7 857 413088 0.23
$ (W,X,Y)$ HLOSP 5 11 852 413077 0.05
$ (W,X,Y,Z)$ ZONE 11 381164 841 31912 0
$ (\mathit{WX},Y,Z)$ $ \textsc{Sexp}\times \textsc{Agep}$ 5 2 836 31910 1
$ (\mathit{WX},\mathit{WY},Z)$ $ \textsc{Sexp}\times \textsc{Hlosp}$ 5 72 831 31838 0
$ (\mathit{WX},\mathit{WY},\mathit{XY},Z)$ $ \textsc{Agep}\times \textsc{Hlosp}$ 25 280 806 31558 0
$ (\mathit{WX},\mathit{WY},\mathit{WZ},\mathit{XY})$ $ \textsc{Sexp}\times\textsc{Zone}$ 11 207 795 31351 0
$ (\mathit{WX},\mathit{WY},\mathit{WZ},\mathit{XY},\mathit{XZ})$ $ \textsc{Agep}\times\textsc{Zone}$ 55 11130 740 20221 0
$ (\mathit{WX},\mathit{WY},\mathit{WZ},\mathit{XY},\mathit{XZ},\mathit{YZ})$ $ \textsc{Hlosp}\times\textsc{Zone}$ 55 15945 685 4276 0
$ (\mathit{WXY},\mathit{WZ},\mathit{XZ},\mathit{YZ})$ $ \textsc{Sexp}\times \textsc{Agep}\times \textsc{Hlosp}$ 25 205 660 4071 0
$ (\mathit{WXY},\mathit{WXZ},\mathit{YZ})$ $ \textsc{Sexp}\times\textsc{Agep}\times\textsc{Zone}$ 55 304 605 3767 0
$ (\mathit{WXY},\mathit{WXZ},\mathit{WYZ})$ $ \textsc{Sexp}\times\textsc{Hlosp}\times\textsc{Zone}$ 55 733 550 3034 0
$ (\mathit{WXY},\mathit{WXZ},\mathit{WYZ},\mathit{XYZ})$ $ \textsc{Agep}\times\textsc{Hlosp}\times\textsc{Zone}$ 275 2573 275 461 0
$ (\mathit{WXYZ})$ $ \textsc{Sexp}\times\textsc{Agep}\times\textsc{Hlosp}\times\textsc{Zone}$ 275 461 0 0 0


The final set of models shown in Table 3.9 simulate the effect of using IPF with a particular set of margins from SC86B01. The left hand side follows equation (2.18), dividing the fitted margins of SC86B01 ( $ \hat{N}_{hijk}/5$) by the PUMS ($ n_{hij}$):

$\displaystyle \log \hat{N}_{hijk}/5n_{hij}$ $\displaystyle = \lambda + \lambda_W + \lambda_X + \lambda_Y + \lambda_Z + \cdots$ (19)

(In practice, the PUMS term $ n_{hij}$ is used as an offset to the generalized linear model.) This series of models also shows statistically significant improvements, except for the first few terms. Effectively, the series shows the amount of information that SC86B01 adds beyond what is already available in the PUMS. The low deviance associated with the one-way models indicates that the 20% SC86B01 sample adds little information to the 2% PUMS sample of these variables. Terms involving ZONE, by contrast, add a lot of information, since the PUMS includes no geographic variation. The main difference between Tables 3.8 and 3.9 is that the deviance associated with terms that do not involve ZONE drops by 92% or more when the PUMS is included, and usually drops by more than 99%. Much of the non-geographic information is already present in the PUMS.

Furthermore, the inclusion of higher-order interactions shows diminishing returns in terms of the explained deviance. The one-way model $ (W,X,Y,Z)$ explains 92.3% of the deviance in the NULL model. Of the remaining 7.7% of total deviance, the two-way model $ (\mathit{WX},\mathit{WY},\mathit{WZ},\mathit{XY},\mathit{XZ},\mathit{YZ})$ explains 86.6%. Of the final 1.0% of total deviance, the three-way model $ (\mathit{WXY},\mathit{WXZ},\mathit{XYZ})$ explains 89.2% and the four-way model $ (\mathit{WXYZ})$ explains the final 10.8%. The total deviance does depend on the choice of variables and the fineness of the categories in the table, but this trend of diminishing returns is interesting. It suggests that the available census data—largely describing lower-order interactions, with only a few higher-order interactions, apart from the 2%-sample PUMS—may capture the bulk of the actual information about the population. However, this single table is clearly not sufficient to say anything conclusive.

In closing, this analysis has focused on a single contingency table, SC86B01. No attempt was made to find the best model for SC86B01, particularly in terms of model parsimony; instead, the analysis demonstrated that statistically significant higher-order interactions are present in the data. Furthermore, it is not possible to apply this type of log-linear analysis to multiple contingency tables, although if multiple tables are combined using a fitting procedure the result could be analyzed. The largest limitation, however, is one of software: log-linear analysis is not generally feasible with high-dimensional tables. Nevertheless, the analysis provides valuable insight about the utility of information recorded in contingency tables.