Skip to content

Deduplication model

10. September 2021

Text, Face, Clothing

Dedupli­cating real estate ads using Naive Bayes record linkage

For our parent company Wüest Partner we imple­mented an appli­cation to deduplicate 60 million real estate ads from Germany and Switz­erland using a multi-step naive bayes record linkage model.

Initial position

Real estate platforms publish millions of rental flat and condo­minium ads yearly. A given region or country of interest is normally covered by various competing platforms, leading to multiple published ads for a single real world object.

Because quantifying and modelling the real estate market requires unbiased input data, our aim was to deduplicate real estate ads using Naive Bayes record linkage.

Approach

We used commer­cially available German and Swiss real estate ad data from 2012 to 2019 consisting of appro­xi­mately 60 million individual records. After multiple data cleaning and prepa­ration steps we employed a Naive Bayes weighting of 12-14 variables to calculate similarity scores between ads and determined a linkage threshold based on expert judgment.

The dedupli­cation pipeline consisted of three steps:

  1. Linking ads based on identity comparisons
  2. Linking similar ads within small regional areas (municipalities)
  3. Linking similar ads within large regional areas (cantons, states)

Setup

The pipeline was deployed as a contai­nerized setup with in-memory calcu­lations in R and out-of-memory calcu­lations and data storage in PostgreSQL. Dedupli­cation linked the around 60 million ads to around 14 million object groups (Germany: 10 millions, Switz­erland: 4 millions). The distri­bution of similarity scores showed high separation power and the resulting object groups displayed high homogeneity in geographic location and price distri­bution. Furthermore, yearly results corre­sponded well with published relocation rates.

Findings

Using Naive Bayes record linkage to deduplicate real estate ads resulted in a sensible grouping of ads into object groups (rental flats, condo­miniums). We were able to combine simila­rities across different variables into a single similarity score. An advantage of the Naive Bayes approach is the high inter­pre­ta­bility of the influence of individual variables. However, by manually deter­mining the linkage threshold our results are heavily influenced by possible expert biases. The contai­nerized R and PostgreSQL setup proved it’s porta­bility and scaling capabi­lities. The same approach could easily be trans­ferred to other domains requiring dedupli­cation of multi­variate data sets.

Find out more

Are you facing similar challenges or do you have a similar project where you need help? Then please do not hesitate to contact our Data Scientist Thomas Maier.