Code and Data for the paper "Online Collaborative Prediction of Regional Vote Results"

Online Collaborative Prediction of Regional Vote Results

This webpage contains the source code and the data used in the paper Online Collaborative Prediction of Regional Vote Results, published in October 2016 by Vincent Etter, Mohammad Emtiyaz Khan, Matthias Grossglauser, and Patrick Thiran at the 3rd IEEE International Conference on Data Science and Advanced Analytics.

If you make use of the code or the data, please cite our paper:

Citation copied to clipboard...

@inproceedings{etter2016dsaa,
	author = "Etter, Vincent and Khan, Mohammad Emtiyaz and Grossglauser, Matthias and Thiran, Patrick",
	title = "Online Collaborative Prediction of Regional Vote Results",
	booktitle = "Proceedings of the 3rd IEEE International Conference on Data Science and Advanced Analytics",
	series = "DSAA '16",
	year = "2016",
}

Quick links

Read the paper

Download the data

Browse the code

Summary

This paper studies a novel dataset of voting data, containing the outcome of issue votes that took place in Switzerland between 1981 and 2014. The vote results are available at the level of municipalities (the smallest administrative regions in Switzerland), resulting in fine-grained outcomes for each vote. On top of these results, the dataset contains metadata about each vote (voting recommendations from politcal parties) and each region (demographic data, location, language, etc.).

We develop predictive models that use the results of a few regions for a new vote, in conjunction with the metadata about the vote and the regions, in order to predict the result of the vote in other unobserved regions.

The main findings of the paper are the following:

we obseve a bi-clustering of vote results, where the outcomes are correlated both across votes and across resgions
we show that combining a matrix-factorization model with regression on vote and region features results in good predictive performances with any number of observed results
we demonstrate that using a Bayesian model is key to properly setting the model's hyperparameters, in order to find the appropriate combination of the different terms

Data

Download the data

The dataset used in this paper has three data files: vote results, vote features, and region features. We give more details about each file below.

Vote results

The first file data/votes-data.mat contains the outcome of 289 nationwide votes that took place in Switzerland between January 1981 and December 2014. For each vote, we record the result (proportion of yes) in every Swiss municipalities (municipalities are the smallest government division in Switzerland). The raw data can be obtained from the Swiss Federal Statistical Office.

In December 2014, there were 2352 regions (municipalities) in Switzerland. However, because of administrative fusions and divisions, the regions change over time. To have a complete dataset over the 34 years spanned by the votes, we used a fixed set of regions (the ones exisiting in December 2014) and interpolated the result of regions that did not exist at some point in time. To compute the interpolated result of a region at a given time, we simply kept track all the fusions and divisions, took all regions existing at the time of the vote that have a part in the missing region, and computed the average of their results, weighted by their population. To have more details about the interpolation procedure, please refer to Section 5.2.1 of this thesis.

The resulting vote results are stored as a 2352x289 matrix in data/votes-data.mat. For some votes, however, the interpolation procedure failed, e.g. when historical data was missing. When loading the data (see code/utils/load_data.m), we thus discard these votes, resulting in 281 votes for our analysis.

The outcome of individual votes can be easily visualized on www.predikon.ch.

Vote features

For each vote, the Swiss political parties emit voting recommendations, such as in favor, against, or no recommendation. These recommendations can be obtained from the Swiss Federal Statistical Office.

We gathered recommendations from 25 parties, and stored them in the file data/votes-features.mat. It's a 289x26 matrix, where the first column contains the ID of the vote, and the rest of the columns the vote recommendations (encoded as -1 for against, 0 for no recommendation, and 1 for in favor).

Some parties did not exist for all votes (in which case we encode it as 0), or always emit the same recommendations. For these reasons, we only keep parties (columns of the votes features matrix) that have a variance larger than 0.5 (see code/utils/load_data.m).

Region features

Each region is characterized by 25 features: location, elevation, demographic attributes, political profile, and languages spoken. The data was obtained directly from the Swiss Federal Statistical Office.

These features are stored as a 2352x25 matrix in the file data/regions-features.mat.

We summarize below the features and their main statistics. The x and y coordinates are defined in the Swiss coordinate system. The election features correspond to the proportion of votes for each of the main political parties during the national elections of 2011.

Feature	Unit	Min	Max	Mean	Median
`x`	meters	487 212.59	826 224.39	633 278.12	627 998.66
`y`	meters	76 505.50	294 280.03	200 725.71	206 342.33
Elevation	meters	196.00	1 960.00	607.49	521.00
Population	count	12.00	380 777.00	3 419.26	1 333.00
Population density	inhabitants/km²	0.80	11 866.48	389.49	157.83
Age 0-19	%	0.00	38.24	21.63	21.60
Age 20-64	%	33.33	80.00	61.09	61.17
Age 65+	%	4.76	66.67	17.28	16.81
Social aid	%	0.00	11.45	1.71	1.27
Foreigners	%	0.00	60.76	14.84	12.55
Jobs	count	4.00	444 198.92	2 064.51	453.91
Election BDP	%	0.00	82.15	7.35	4.84
Election CVP	%	0.00	87.20	14.20	8.42
Election PEV	%	0.00	24.13	2.56	1.89
Election FDP	%	0.00	92.11	14.42	12.13
Election SP	%	0.00	55.33	16.46	16.11
Election PST	%	0.00	28.50	1.58	0.52
Election GL	%	0.00	18.13	5.35	4.81
Election SVP	%	0.00	100.00	30.45	30.08
Election Greens	%	0.00	32.10	7.10	6.27
Election other right	%	0.00	60.73	3.26	1.60
Speaks German	0 = No, 1 = Yes	0	1	0.64	1
Speaks French	0 = No, 1 = Yes	0	1	0.30	0
Speaks Italian	0 = No, 1 = Yes	0	1	0.06	0
Speaks Romansh	0 = No, 1 = Yes	0	1	0.03	0

The file data/regions-feature-names.mat contains the name of each feature, relative to its position in data/regions-features.mat.

Code

Browse the code

All models are implemented in Matlab (we used version R2014b). We make use of GPML and minFunc for the Gaussian Process-based models.

To train all the models described in the paper, and then get the performance results on the test set, run code/run.m. It trains each model, saves the resulting model in the models folder, computes the prediciton accuracy on the test votes, and stores the results in the results folder.

To see the implementation of the models, look at the classes in the code/models folder. All models share a common interface, with the following methods:

% Fit the model m to the given training data Y, indicated by the indicator matrix train_idx.
% The training error (RMSE or negative loglikelihood) is returned, along with the validation
% RMSE computed on the entries indicated by valid_idx.
[train_error, valid_rmse] = fit(m, Y, train_idx, valid_idx, options, varargin);

% Predict the elements of y identified by test_idx, when observing
% the elements identified by obs_idx.
y_hat = predict(m, y, obs_idx, test_idx, varargin);

The models are named following the paper naming convention. For example, model_mf_gp_r_liniso.m corresponds to the MF + GP(r) (linear) model.

Contact

If you have any question about the code or the data, feel free to email vincent@etter.io.