A Satellite Imagery Dataset for Long-Term Sustainable Development in United States Cities
[ad_1]
We aim to provide a comprehensive and representative dataset that includes satellite imagery and corresponding SDG indicators covering long terms and multiple scales. To ensure that the indicators can thoroughly depict sustainable urban development, we select five SDGs altogether: SDG 1 No poverty (five indicators), SDG 3 Good health and well-being (five indicators), SDG 4 Quality education (five indicators), SDG 10 Reduced inequalities (two indicators), and SDG 11 Sustainable cities and communities (eight indicators). Overall, the target dataset generation process includes collecting, processing, and aligning multi-source data, and the overall workflow is presented in Fig. 2. We first select the 100 most populated cities and gather the corresponding CBG/city boundaries. Second, we collect satellite imagery, population, NTL, OSM, and ACS data from multiple sources. At last, we process the multi-source data and produce the final output data at the CBG and city levels, containing basic geographic statistics, satellite imagery attributes, and SDG indicators.
Determining the area-of-interests and boundary extraction
We select the 100 cities with the most population in the contiguous United States, which is explored on the ACS 2021 population data33. The population in the 100 cities varies from 222,194 to 8,467,513, with a mean population of 642,002. The city-of-interests and population in descending order are shown in Table 1.
Then we collect city geographic boundary files from the U.S. Census Bureau TIGER/Line shapefiles34. The shapefiles are divided by states, and each shapefile contains the city name (called “place” in the file), state name, Federal Information Processing Standard state code, and geographical boundary coordinates. We use the python packet shapely to access the shapefiles and extract the boundary coordinates using the city and state names. The geographic coordinate system is WGS84.
Next, we determine the corresponding CBGs within the cities. The boundaries of all CBGs in the U.S. are gathered from SafeGraph Open Census Data7. The CBG boundaries for the years 2014~2019 are the same, and the U.S. government adjusts the CBG boundary for the year 2020.
For each city, we overlap the CBG boundaries on the city boundary, and every CBG whose area intersection with city boundary takes up more than 10% of the corresponding CBG area is considered contained in the city. This process uses Python packets shapely and geopandas. The geographic lookup table between cities and corresponding CBGs is shown in Table 2. Till this step, we have the selected 100 most populated cities and corresponding CBGs spatially contained as the area-of-interests in our target dataset.
Processing of satellite imagery
Satellite imagery provides a near real-time bird’s-eye view of the earth’s surface. Combined with machine learning techniques, satellite imagery has been widely used in predicting socioeconomic status, especially in urban research, which includes poverty/asset prediction11,14,17, urban pattern mining15, commercial activity prediction16,35, and population prediction12,36. Inspired by the interpretable feature generation from satellite imagery14, we provide satellite imagery visual attributes in our dataset to promote the research of SDG monitoring. The processing of satellite imagery consists of three parts: imagery collection, object detection, and semantic segmentation.
First, we collect the satellite imagery in our dataset from Esri World Imagery37. It provides users access to the World Imagery of different versions created over time. The imagery is in RGB format collected from different satellites and of different spatial resolutions marked by different zoom levels, which split the entire world into different numbers of tiles. Overall, the imagery collection process includes generating image tile numbers according to the boundary of each city as well as the desired zoom level (spatial resolution) and downloading images with the tile numbers from the satellite imagery archive. In our target dataset, we set the zoom level to 19, which is about 0.3 m/pixel. We also select the Esri World Imagery archive of June from 2014 to 2023 to collect the satellite images of the 100 most populated cities, which generates altogether 12,269,976 images each year.
Second, many aspects of cities are related to people’s lives and can reveal SDG progress. Transportation in the city is integral to urban development38, and further, transportation and mobility were recognized as central to sustainable development at the 2012 United Nations Conference on Sustainable Development39. Sports & leisure are highly correlated to citizens’ life quality40,41. Children and young people benefit largely from sports, which are inseparable from a quality school education, promoting SDG 3 and SDG 442. The building characteristics (e.g., building type) can reveal the population and income status in urban areas43,44, and the impact of buildings on human well-being can not be neglected. Therefore, the buildings, cars, and other objects in satellite imagery contain certain correlations with SDG indicators. In our dataset, we consider 17 objects from the abovementioned aspects: transportation, sports & leisure, and building.
The urban object categories are presented in Table 3. We use the YOLOv5s model45,46 pre-trained on the MS COCO dataset47 and finetune it on xView dataset23 and DOTA v2 dataset25 to detect objects in the collected satellite imagery. The default parameters48 are used for finetuning the object detection models. We aggregate the number of objects detected from satellite images at the CBG and city levels to show visual object attributes at multiple scales.
Third, land cover information such as forests or water can also depict the urban environment and is not included in the detected objects. Therefore, we add the land cover semantic information inferred from satellite imagery in our generated dataset. We use the Vision Transformer (ViT)-Adapter-based semantic segmentation model49,50,51 pre-trained on the ADE20K dataset52 and finetune it on LoveDA dataset27 to generate semantic information from the collected satellite imagery, which includes background, building, road, water, barren, forest, and agriculture. Moreover, we compute the pixel-level percentage of each semantic information presented in Table 3 in each satellite imagery and aggregate them at the CBG and city levels, respectively.
Processing of basic geographic statistics
For each CBG/city, we present the population, area, centroid coordinates, and geographic boundary, which describe the essential information for the selected area-of-interests. Specifically, we collect the population data from 2014 to 2020 from the WorldPop project30,53. The population data is downloaded at a resolution of 3 arc (approximately 100 m at the Equator). We use Python packets shapely and gdal to crop the population data with the CBG/city geographic boundary and sum up the cropped pixel values as the total population. The area (km2) is calculated from the CBG/city boundary data with Python packet geopandas. The geographic centroid can also be computed with Python packet geopandas.
Processing of SDG indicators
There are five SDGs (SDG 1, SDG 3, SDG 4, SDG 10, and SDG 11) concerning poverty, health, education, inequality, and built environment collected in our produced dataset at the CBG/city level. SDG 1 “No poverty” focuses on income and population in poverty status. The indicators for “No poverty” are collected from ACS data. SDG 3 “Good health and well-being” and SDG 4 “Quality education” highlight people’s health insurance status and population with different academic degrees, and corresponding indicators are extracted from ACS data. SDG 10 “Reduced inequalities” intends to reduce inequality, and the indicators are from ACS data and from NTL combined with population data with a recent algorithm for monitoring regional inequality through NTL54. Finally, SDG 11 “Sustainable cities and communities” reflects the living conditions in CBG/city, and the related indicators are calculated from OSM historical data and ACS data. Altogether, we collect 25 indicators across five SDGs. The indicators and relevant SDG targets are described in Table 4. Specifically, there are eight SDG targets included in this dataset:
-
Target 1.2: By 2030, reduce at least by half the proportion of men, women, and children of all ages living in poverty in all its dimensions according to national definitions.
-
Target 1.4: By 2030, ensure that all men and women, in particular the poor and the vulnerable, have equal rights to economic resources, as well as access to basic services, ownership and control over land and other forms of property, inheritance, natural resources, appropriate new technology and financial services, including microfinance.
-
Target 3.8: Achieve universal health coverage, including financial risk protection, access to quality essential healthcare services, and access to safe, effective, quality, and affordable essential medicines and vaccines for all.
-
Target 4.1: By 2030, ensure that all girls and boys complete free, equitable, and quality primary and secondary education leading to relevant and effective learning outcomes.
-
Target 4.3: By 2030, ensure equal access for all women and men to affordable and quality technical, vocational and tertiary education, including university.
-
Target 10.2: By 2030, empower and promote the social, economic and political inclusion of all, irrespective of age, sex, disability, race, ethnicity, origin, religion or economic or other status.
-
Target 11.2: By 2030, provide access to safe, affordable, accessible and sustainable transport systems for all, improving road safety, notably by expanding public transport, with special attention to the needs of those in vulnerable situations, women, children, persons with disabilities and older persons.
-
Target 11.3: By 2030, enhance inclusive and sustainable urbanization and capacity for participatory, integrated and sustainable human settlement planning and management in all countries.
Indicators for SDG 1 “No poverty”
SDG 1 aims to end poverty in all its forms everywhere3. Our target dataset incorporates income and poverty status data to represent the SDG 1 indicators in cities. Specifically, median household income, population above poverty (number of population whose income in the past 12 months is at or above poverty level), population below poverty (number of population whose income in the past 12 months is below poverty level), and population with a ratio of income to poverty level (the total income divided by poverty level) under 0.5 and between 0.5 to 0.99 are collected to describe the income & poverty in CBG/city. The poverty threshold is computed by the Census Bureau according to the family size and ages of family members every year with variations to Consumer Price Index. The threshold is a country-specific value and does not change geographically55. Moreover, population above/below poverty and population with different ratios of income to poverty level are measurements of poverty status.
We collect the median household income, population above/below poverty, and population with a ratio of income to poverty level under 0.5 and between 0.5 to 0.99 at the CBG level from the ACS data6,7,56. Then, we generate the city-level indicators: population above/below poverty and population with a ratio of income to poverty level under 0.5 and between 0.5 to 0.99 by aggregating all the CBG data within the city. Median household income at the city level is related to the income distribution of the population in cities and is gathered directly from ACS data57. The boundary files and ACS data are both collected from the U.S. Census Bureau. And ACS data denotes the city as “place” as in the boundary files, and the ACS definition of a city boundary is the same as the U.S. Census Bureau TIGER/Line shapefiles.
Indicators for SDG 3 “Good health and well-being”
SDG 3 aims to ensure healthy lives and promote well-being for all populations at all ages3. In our target dataset, we use the population data with no health insurance covering all ages to represent SDG 3 indicators because health insurance is correlated to the health status of the population in urban regions58,59. Specifically, civilian noninstitutionalized population, population with no health insurance under 18, between 18 to 34, between 35 to 64, and over 65 years old are collected from ACS data7 to describe the health insurance at the CBG and city levels.
Indicators for SDG 4 “Quality education”
SDG 4 aims to ensure inclusive and equitable quality education and promote lifelong learning opportunities for all3. Therefore, indicators directly depicting city education status can be selected here. In dataset generation, we collect from ACS data7 population enrolled in college, population that graduated from high school, population with a bachelor’s degree, a master’s degree, and a doctorate for indicators of school enrollment & education attainment to monitor SDG 4.
Indicators for SDG 10 “Reduced inequalities”
SDG 10 aims to reduce inequality within and among countries3. We use income Gini60 and light Gini54 to monitor the process of SDG 10. The income Gini reveals the inequality status of income and is collected from ACS data. Light Gini can present the distribution of NTL per person and thus indirectly reveal regional development inequality. Similar to the income Gini, the lower the light Gini is, the more equally the region develops, which means the region moves towards eliminating inequality in SDG 10. The results in the original paper54 report the light Gini at a 1-degree grid cell, which can not be directly used in urban scenarios. Therefore, we calculate the light Gini following the method54. Specifically, the NTL per person is calculated by dividing the NTL value by the population number in all grids in each CBG/city. Then, the Gini index60 of NTL per person in the CBG/city boundary is computed as the light Gini. The NTL is the Visible Infrared Imaging Radiometer Suite (VIIRS) data29,54 with a spatial resolution of 15 arc seconds (500 m at the Equator). We download the VIIRS Nighttime Lights version 2 Median monthly radiance (the unit for light intensity is nW /cm2/sr) with background masked from EOG29,61,62,63. Compared with income Gini from traditional income survey data, light Gini measures the NTL inequality in urban regions by considering NTL as an indicator for economic development, which is a different measurement of inequality54.
Indicators for SDG 11 “Sustainable cities and communities”
SDG 11 aims to make cities and human settlements inclusive, safe, resilient, and sustainable3. We incorporate indicators related to the built environment and land use in the target dataset. Specifically, we generate building density, driving/cycling/walking road density, POI density, land use information, and residential segregation (index of dissimilarity and entropy index) as indicators to monitor SDG 11.
The source data of urban built environment and land use is collected from OSM31,64,65. We collect the U.S. state-level historical Protocolbuffer Binary Format files from Geofabrik32 from 2014 to 2023. Then we apply Python packet pyrosm to extract the building, driving road, cycling road, walking road, POI, and land use information in cities and CBGs by corresponding boundary polygons. For calculating building density, we divide the number of buildings by the area of CBG/city. For each of the three kinds of road density, we divide the total length of each kind of road by the corresponding area of CBG/city. The POI density, which is defined as the ratio of the number of all POIs and the area of CBG/city, can show urban venues with human information. The OSM POIs include all OSM elements with tags “amenity”, “shop” or “tourism”. The amenity tag is useful and important facilities for the urban population, which include Sustenance, Education, Transportation, Financial, Healthcare, Entertainment, Arts & Culture, Public Services, Waste Management, and Others. The shop tag includes locations of all kinds of shops and the sold products, such as Food & Beverages, General Store, Mall, Clothing, Shoes, Accessories, Furniture, etc. The tourism tag is the places for tourists, such as Museum, Gallery, Theme Park, Zoo, etc. Moreover, we generate the land use indicators (commercial, industrial, construction, and residential) by calculating the area percentage of each kind of land use in the area of CBG/city.
The indicators for the built environment quantitatively measure the density of buildings and roads. It should be noted that the indicators for SDG 11 are imperfect since the actual quality of buildings and roads is not provided in the dataset. Users can use the building/road/POI indicators as side information for depicting urban development.
Residential segregation is related to inclusivity in U.S. cities66. We calculate the index of dissimilarity67
$$D=\frac{1}{2}\mathop{\sum }\limits_{i=1}^{n}\left|\frac{{w}_{i}}{{w}_{T}}-\frac{{b}_{i}}{{b}_{T}}\right|,$$
(1)
where n is the number of CBGs in a city, wi is the number of race “w” (e.g., White) in CBG i, wT is the total number of race “w” in the city, bi is the number of race “b” (e.g., Black) in CBG i, and BT is the total number of race “b” in the city. We calculate the index of dissimilarity for four racial or ethnic groups: Non-Hispanic White (White), Non-Hispanic Black or African American (Black), Non-Hispanic Asian (Asian), and Hispanic66. There are altogether six categories of indices of dissimilarity: White-Black, White-Asian, White-Hispanic, Black-Asian, Black-Hispanic, and Asian-Hispanic.
Next, we calculate the entropy index68
$${h}_{i}=-\mathop{\sum }\limits_{j=1}^{k}{p}_{ij}ln({p}_{ij}),$$
(2)
where k is the number of racial/ethnic groups, pij is the proportion of jth race/ethnicity in CBG/city i. We include groups of the White, Black, Asian, and Hispanic population at the CBG or city level.
Limitations
The limitations of our dataset include errors from multiple data sources, partial coverage of SDG progress, and the shortcomings of selected indicators.
The errors from data sources include measurement errors in satellite imagery, ACS data collection, OSM, WorldPop population, and NTL data. The measurement errors in satellite imagery processing are mainly from the object detection and semantic segmentation tasks, and the accuracy metrics are shown in Table 5. And the errors in other data sources are usually tolerable in each field and the quality assessment can be referred to literature69 for ACS data, literature70,71,72,73 for OSM, literature74 for WorldPop, and literature62 for NTL data. ACS data uses sampling error to measure the difference between the true values for the entire population and the estimate based on the sample population. And the magnitude of sampling error is measured by the margin of error69. ACS provides a margin of error for all ACS estimate data which we collect as SDG indicators in our dataset. The dataset users can freely access the margin of error values of the ACS-oriented indicators in our dataset from the ACS official website. OSM data is a Volunteered Geographic Information (VGI) and is frequently updated by volunteers. In terms of the road network, OSM is about 83% complete globally70. The building completeness for OSM in San Jose city in the U.S. is about 72% and confirms the validity of OSM building density in our dataset71. Some cities show a large jump in building number in a consecutive year due to lagging annotations. The POIs in OSM are compared with the Foursquare POIs and 60% of the POIs can be matched with high accuracy72. At last, the accuracy of the OSM land use dataset73 for the U.S. is above 60%. The population data from WorldPop has a coefficients of determination75 R2 greater than 0.95 when evaluated on the population data in China74. The nighttime light intensity also shows a high consistency (R2 greater than 0.97) compared with different nighttime light datasets62.
And the provided dataset does not cover the whole SDG aspects, and thus cannot be used as the sole measurement for SDG monitoring. However, the dataset still has great reference value and aids decision-making for urban researchers and policymakers.
At last, some indicators cannot always be the best indicators for corresponding SDGs. For example, the indicator health insurance for SDG 3 (Good health and well-being) may not be the best measurement of health status because health insurance usage is affected by the income or wealth of the insurance owners.
[ad_2]
Source link