Explorations into US COVID Atlas Data Infrastructure

This summer we are pleased to share research highlights from our Spatial Data Science Summer Fellows, showcasing their work on the US COVID Atlas and multiple opioid risk environment research projects. 

This week, Spatial Infrastructures Data Engineering Fellow Jinfei Zhu (MAPPS ‘22) shares an update on her work with the US COVID Atlas. This piece has been cross-posted from Atlas Insights, the US COVID Atlas blog. 

As a Spatial Data Science Summer Fellow with the Healthy Regions & Policies Lab, I am spending my summer researching and implementing data infrastructure solutions to support the development of the US COVID Atlas. We’ve been busy exploring options to support the growth of the Atlas — and open-source web mapping applications in general — as the Atlas expands its database of COVID and community data and interactive features.

The US COVID Atlas is a free, open-source data visualization tool tracking county- and state-level COVID-19 outcomes, including confirmed cases, deaths, vaccinations, and forecasting. It also links COVID data with community health indicators, such as food insecurity rates, median income levels, and other socioeconomic characteristics and social determinants of health. The Atlas features data from multiple sources, such as the CDC, USA Facts, The New York Times and 1Point3Acres. Data is available from January 21, 2020 through today, providing a historical and holistic view of the pandemic through reliable data sources.

There are two main ways for users to interact with data through the US COVID Atlas —  the API and the web application. On the Atlas backend, data are scraped daily as scheduled tasks running on AWS Elastic Container Service and are updated to Amazon Simple Storage Service (S3) and the public GitHub repository as CSV files. When users make an API call, they will trigger an AWS Lambda function that will read data from S3 bucket and return data to users. The web application is hosted on Netlify.com, which is a static website host and copies the whole GitHub repository to deploy the web page. Many web application features are powered by GeoDa library, such as Hotspot Map and binning calculations.

Data Infrastructure Challenges

The webpage now runs smoothly with jsGeoDa, a javascript library for spatial data analysis using webAssembly. However, taking a long-term view of the Atlas application , there are several bottlenecks: first, the API and web application read data separately, one from AWS S3 and the other from GitHub repository; second, the data still increases daily but the webpage always pulls the whole csv files that contain all data. Even though the webpage can currently load data quickly, in the future, as our database grows, this process will become slower and slower. Also, the current API processing is very slow, sometimes taking up to 10 minutes.


Current data infrastructure for the US COVID Atlas

In light of these challenges, our team has been exploring alternate data infrastructure solutions. The ideal solution could provide data to both web application and API and increase the API reading speed. Additionally, the web application would be able to pull only a piece of data at first, then load all the data — ensuring that the time to open the atlas would not increase much, even as the size of data grows.

One possible solution is to change the data storage formats. Currently, data are stored in CSV (Comma Separated Value) files. This makes data easy to read and parse. However, the types of data need to be inferred — the file doesn’t directly tell us if a column contains string, float, integer or spatial data. The reading speed is also slow.

JSON (JavaScript Object Notation) is another common data storage format and data is stored in key-value pairs. JSONs are easy to share and widely accepted. On the other hand, the data types in this format are not always read clearly, and the repeated keys for each row lead to large file size.

Potential Solutions

Apache Parquet is a relatively new data storage format for big data. It’s smaller and has high query speed with Apache Spark kernel. However, it is mainly for enterprise-level big data of hundreds of gigabytes (GB) or terabytes (TB). The COVID data, though increasing, is less likely to reach that level in the future. (The current data size is only 120 MB.) In a test, converting CSV files to Parquet files doesn’t increase the performance much.

Protocol Buffer is a binary file format with high reading speed. Although it cannot be opened by text editors or read by humans directly, it has fully typed data, is smaller and data is compressed automatically.

Another possible solution is to use a relational database to store all data. The GeoDa ecosystem provides an extension for PostgreSQL database called PostGeoDa, which contains all the functions we need for spatial data analysis. The disadvantage of using a database to manage all data is it’s hard to share due to different database dialects.

Therefore, a cloud (easy to share, simultaneous connection) PostgreSQL-like (fully typed data format) database solution that supports PostGeoDa extension (provides customized spatial data analysis functions) would be an ideal solution!

We have reviewed and weighed the advantages and disadvantages of many cloud database providers, including PlanetScale, Vitess, yugabyteDB, Supabase, Hasura, Cockroach, Google Spanner, Amazon Aurora and Amazon RDS for PostgreSQL. However, unfortunately, none of them support PostGeoDa since it’s still a relatively new extension. That means that for now, we have to keep the front-end calculations and lambda function structure. But we could still try a fast API query method and set the web application read data from the same place.

Enter Google BigQuery, which is Google’s enterprise data warehouse for data analysis. It’s different from Google Spanner, which is for massive scale data such as transactional data. The focus of BigQuery is data analysis, query, and sharing. It has been used by 34 institutions to host public COVID data tools, such as New York Times, USA Facts and CDC.

Google BigQuery platform has many advantages. First, it is serverless, so users don’t need to worry about low-level configuration. Second, it is well established through Google, so likely will continue to grow in the future.. Third, the Atlas database table could be updated directly with Python/Node.js using the BigQuery library. Fourth, it offers both SQL and API options. This means that future US COVID Atlas users could not only retrieve data with API calls, but also SQL queries. Lastly, it could potentially interact directly with Netlify (Atlas website host).

However, Google BigQuery is not a panacea. First, it doesn’t support PostGeoDa, at least for now. Second, the data migration process is slow and requires plenty of configurations. Thirdly, it does not support replacing rows. So, if we wanted to make a certain update or adjustment to historical data, we would have to replace the whole table. Finally, the pricing strategies are complex. There are different policies for data storage, steaming inserts, data query, PostGIS extension, DML, memory, disk, CPU, etc. Though it provides 10 GB free storage and 1 TB free query data for each user per month, it still takes a while to figure out how much it will cost.

Potential future data infrastructure for the US COVID Atlas

With the help of Google BigQuery, the possible future data infrastructure will be like the above diagram. Both users and the web application can read data directly from the database.

Currently, the team is still exploring the best way to migrate data to BigQuery, but a test with other public COVID data shows that the query speed will probably cost no more than a few seconds — a very promising result.

The team’s next steps will include designing the database schema and transferring all data files to BigQuery, setting automatic updates, integrating API to database, exploring the way to directly interact with BigQuery from Netlify.

Overall, it takes exploration to find an efficient way to store the US COVID Atlas data, which is still increasing everyday. In this process, we want our solution to be simple, integrated, and fast. This question is open-ended and requires us to try different new platforms in the cloud. But luckily, the fast-developing technology provides many possible solutions and we can find the one that best fits our needs.

Jinfei Zhu is a student in the Master of Arts Program in Computational Social Science (MAPSS) at the University of Chicago. Her recent research interests include analyzing and visualizing COVID-19 geospatial data and optimizing data solutions for transferring and utilizing time-series geospatial data.

%d bloggers like this: