lec-location-based-services/ex23-24.Rmd

---
title: "Location Based Services Exercise 23/24"
author: "Erik Neller"
date: "`r Sys.Date()`"
output: pdf_document
---
# Moran's I
A measure of clustering for spatial data, defined as
$$
I = \frac{N}{W} \frac{\sum_{i=1}^N \sum_{j=1}^N w_{ij}(x_i-\overline{x})(x_j - \overline{x})}
{\sum_{i=1}{N}(x_i - x)^2}
$$
where
- $N$ is the number of spacial units indexed by $i$ and $j$
- $x$ is the variable of interest
- $\overline{x}$ is the mean of $x$
$w_{ij} are the elements of a matrix of spatial weights that denote adjacency
- $W = \sum_{i=1}^N \sum_{j=1}^N w_{ij}$ is the sum of all $w_{ij}$

It may be considered time series stationarity-agnostic as the calculation does not make assumptions about temporal behavior of the underlying data.
The deviation from the global mean $\overline{x}$ is calculated at a snapshot in time and weighted by $w_{ij}$,
resulting in a value that ranges from $[-1;1]$.

## Sources
- [https://doi.org/10.2307/2332142](http://www.stat.ucla.edu/~nchristo/statistics_c173_c273/moran_paper.pdf)
- https://en.wikipedia.org/wiki/Moran%27s_I

## Calculation

```{r}
library(spdep) # for moran calculation
library(dplyr)
european_iso2 <- c(
  "AL", "AD", "AT", "BY", "BE", "BA", "BG", "HR", "CY", "CZ", "DK", "EE", "FI", "FR",
  "DE", "GR", "HU", "IS", "IE", "IT", "XK", "LV", "LI", "LT", "LU", "MT", "MD", "MC",
  "ME", "NL", "MK", "NO", "PL", "PT", "RO", "RU", "SM", "RS", "SK", "SI", "ES", "SE",
  "CH", "UA", "GB", "VA")
cities = read.csv('worldcities.csv')
capitals <- cities %>% subset( capital == "primary") %>% subset(iso2 %in% european_iso2)
gdp = read.csv('flat-ui__data-Mon Jan 12 2026.csv')
result <- merge(capitals,gdp, by.x= 'iso3', by.y = 'Country.Code', all.x = TRUE)

# Group by a unique identifier (e.g., iso3) and filter for the most recent year
result <- result %>%
  group_by(iso3) %>%  # Replace 'iso3' with the appropriate column for unique identification
  slice(which.max(Year)) %>%
  ungroup()

# Convert the result dataframe to an sf object
coordinates <- result %>% select(lng, lat)
result_sf <- st_as_sf(result, coords = c("lng", "lat"), crs = 4326)

# Create a spatial weights matrix using k-nearest neighbors
k <- 5  # Number of nearest neighbors
knn_nb <- knn2nb(knearneigh(coordinates, k = k))
weights <- nb2listw(knn_nb, style = "W")

# Ensure the variable of interest is numeric and handle NA values
result_sf$gdp <- as.numeric(result_sf$`Value`)
result_sf <- result_sf %>% na.omit()  # Remove rows with NA values

# Calculate Moran's I
n <- length(result_sf$gdp)
s0 <- Szero(weights)
moran_result <- moran(result_sf$gdp,n=n, weights, S0 = s0)
print(moran_result)

# Perform Monte Carlo simulation
set.seed(123)  # For reproducibility
moran_mc_result <- moran.mc(result_sf$gdp, listw = weights, nsim = 999)
print(moran_mc_result)

towrite <- result[, c('lat','lng', 'Value', 'city', 'iso3')]

write.csv(towrite, file = 'gdp.csv')

```

# Interpretation
Moran's I close to 0 is an indicator for low autocorrelation, meaning low clustering in the underlying data. The gdp does not seem to follow a clustering.