Deduplicating real estate ads using Naive Bayes record linkage
For our parent company Wüest Partner we implemented an application to deduplicate 60 million real estate ads from Germany and Switzerland using a multi-step naive bayes record linkage model.
Real estate platforms publish millions of rental flat and condominium ads yearly. A given region or country of interest is normally covered by various competing platforms, leading to multiple published ads for a single real world object.
Because quantifying and modelling the real estate market requires unbiased input data, our aim was to deduplicate real estate ads using Naive Bayes record linkage.
We used commercially available German and Swiss real estate ad data from 2012 to 2019 consisting of approximately 60 million individual records. After multiple data cleaning and preparation steps we employed a Naive Bayes weighting of 12-14 variables to calculate similarity scores between ads and determined a linkage threshold based on expert judgment.
The deduplication pipeline consisted of three steps:
- Linking ads based on identity comparisons
- Linking similar ads within small regional areas (municipalities)
- Linking similar ads within large regional areas (cantons, states)
The pipeline was deployed as a containerized setup with in-memory calculations in R and out-of-memory calculations and data storage in PostgreSQL. Deduplication linked the around 60 million ads to around 14 million object groups (Germany: 10 millions, Switzerland: 4 millions). The distribution of similarity scores showed high separation power and the resulting object groups displayed high homogeneity in geographic location and price distribution. Furthermore, yearly results corresponded well with published relocation rates.
Using Naive Bayes record linkage to deduplicate real estate ads resulted in a sensible grouping of ads into object groups (rental flats, condominiums). We were able to combine similarities across different variables into a single similarity score. An advantage of the Naive Bayes approach is the high interpretability of the influence of individual variables. However, by manually determining the linkage threshold our results are heavily influenced by possible expert biases. The containerized R and PostgreSQL setup proved it’s portability and scaling capabilities. The same approach could easily be transferred to other domains requiring deduplication of multivariate data sets.
Find out more
Are you facing similar challenges or do you have a similar project where you need help? Then please do not hesitate to contact our Data Scientist Thomas Maier.