Constructing a Model to Study the Built Environment

This summer we are pleased to share research highlights from our Spatial Data Science Summer Fellows, showcasing their work on the US COVID Atlas and opioid risk environment research projects. 

This week, Built Environment Data Science Fellow Christian Villaneuva (BA ’22) shares an update on their work developing a model for measuring the built environment’s impact on opioid use in New Jersey.

This summer, as a HEROP Spatial Data Science Summer Fellow, I have been organizing, cleaning, and compiling data into one consolidated dataset which will be used to study links between the built environment and the ongoing opioid epidemic in New Jersey. Communities throughout the United States have struggled with substance abuse over the last several decades, and the hope of this project is to understand what factors of an environment may impact this public health crisis.  Our group supports a National Institute of Drug Abuse (NIDA) study directed by Dr. Barbara Tempalski at NDRI-USA, “Developing a public health measure of built environment to assess risk of non-medical opioid use and related mortality in urban and non-urban areas in New Jersey”, with conceptual model consideration; data cleaning, wrangling, and integration; and index generation. Our team at the University of Chicago leads the data warehouse collection and integration efforts, as well as some exploratory spatial data analysis, to better understand and gain new knowledge as to why opioid overdose events occur in some places versus others.

The first stage of our work involved collecting data at the tract level from public sources, including state and federal offices, the US Census and American Community Survey (ACS), and the Substance Abuse and Mental Health Services Administration (SAMHSA), among others. The challenge at this stage was to generate municipal level data from Census tract-level data. Tract-level data can be incredibly useful for investigating fine scale phenomena or studying a smaller area; however, for our purposes, the municipal level is more appropriate. New Jersey’s municipalities are political boundaries that separate one local government from the next. Census tracts, on the other hand, can often cross political boundaries or subdivide one political unit. There are 565 municipalities in New Jersey, compared to the 2,010 census tracts, or 21 counties (which are far larger than municipalities). The municipal level is therefore both a useful scale for spatial analysis in New Jersey, as well as a meaningful political and administrative boundary, understood for the public and policymakers.

To generate municipal level data, we aggregated tract level data using a “crosswalk” file created by former team member Gabe Morrison (BA ‘21, MSCAPP ‘22).  Geographic crosswalks can be useful for relating one spatial scale or area to a different spatial scale or area — in this case, relating Census tracts (small, Census population-based boundaries) to municipalities (varying sizes, political and at times geographical boundaries). This crosswalk describes the relationship of tracts to municipalities by breaking down the percentage of each tract within a municipality. Take this simplified map as an example:

There are two municipalities (M1 and M2) and 2 tracts (T1 and T2). In the crosswalk, T1 would be described as 50% in M1 and 50% in M2, and T2 would be described as 75% in M1 and 25% in M2. This breakdown allows us to weight any tract level data by its municipal proportion. An obvious application of this is in estimating populations: if we know that 100 people live in both Census tracts 1 and 2, we might weight them by the crosswalk to find that 125 people live in M1, and 75 people line in M2. Other applications could be weighting resource distances. For example, a syringe exchange site may be x miles away from the centroid (geometrical center) of tract 2, which falls in M1. This distance can then be weighted by the crosswalk to estimate a value for the portion of the tract in M2. This method allows us to generate values for both sides of tracts that fall between municipalities, whereas if we did not weight values, we may inaccurately represent data. Take T2; if we were not to weight it, we would use its centroid to determine it is a tract in M1, and would then ascribe all of its population, resources, and features to M1 exclusively, though we can see that this should not be the case based on the map. This method is not without its shortcomings, however, as it assumes populations are evenly spread and uses a geometric centroid as opposed to a population centroid.

Once data was aggregated and brought to the appropriate level, we compiled a master dataset containing 70 variables for 565 municipalities. These variables fall broadly into six categories which characterize the built environment in New Jersey: physical elements, residential elements, commercial elements, health services, strength of community participation, and strength of community economy.

Our initial exploratory data analysis suggests several interesting insights.  As a part of my summer fellowship, I conducted a primary component analysis (PCA) to reduce the dimensions of the 70 variables. PCAs are useful for large datasets like this, where we want to preserve as much of the variance as possible without having to deal with a significant number of variables.

Preliminary exploratory findings on explanatory power from the primary component analysis (PCA)

This plot shows that just under 50% of the variance in data can be explained by two components, and about 77% of the variance of the whole dataset is explained by 10 components. While we could have hoped for higher explanations, these results still suggest that there is a way to reduce our dimensions. This will inform the next steps of the PCA, where we look at identifying the variables that contribute to each component. We will also look to omit variables with a high number or rate of missing values, with the goal of generating stronger PCA results with less noise.

Preliminary exploratory plot on variables from dimension 1 and dimension 2 of the PCA.

The next steps in our group are to further refine NJ medical examiner data and aggregate all unintended deaths from opioid use by municipality. From there, we can run statistical tests to determine whether correlations exist between increased deaths and any built environment factors. In future explorations, our group can create heatmaps to see the distribution of these deaths across spatial scales to identify hotspots.  We look forward to continuing to collaborate with our research partners on this analysis.

This post was written by Christian Villanueva (BA ‘22), HEROP Spatial Data Science Summer Fellow.

Featured image at top courtesy of the Fourth National Climate Assessment (NCA4) of the U.S. Global Change Research Program (USGCRP).

%d bloggers like this: