Kontaktieren Sie uns

Blog, Data science, Uncategorized

Understanding data through visualizations

Data visualization

18 Oct 22

The human brain processes images and graphics 60,000 times faster than text. No wonder, around 90 percent of all information transported to our brain is visual. The following infographic shows the most frequent birth dates of people born in Switzerland since 1969. A table with births on 366 days a year is difficult to interpret. But the heatmap reveals interesting insights at first glance:

  • February 29, which only exists in leap years, has (unsurprisingly) by far the fewest births
  • Holidays such as January 1st, Christmas and the national holiday on August 1st are also rare. Probably due to scheduled C-sections, which hospitals prefer to schedule on workdays before or after holidays
  • Most birthdays are celebrated in the second half of September (around nine months after the Christmas holidays and New Year’s Eve 😉)

The same applies to the business environment: Only those who can visualize their data in a comprehensible way allow users to interpret it meaningfully and generate added value from the findings. These graphics are presented on interactive dashboards, created for example with the business intelligence tool «Superset» or the R-Package «Shiny».

Choosing the right visualization for a complex data set

Often, complex data sets contain more dimensions than our two-dimensional smartphone and computer screens can easily represent. The concept of the «Grammar of Graphics» by Leland Wilkinson shows how a data set is appropriately transformed into a figure. It breaks down a data visualization into functional layers: from the raw data, to graphical elements and scales, to the coordinate system. The renowned data scientist Hadley Wickham has built the well-known R package ggplot2 on this basis. It allows to define graphics directly over these layers.

The «Grammar of Graphics» by means of an example

To understand this concept, let’s look at an example. A fictitious company operating six restaurants and four bars in Switzerland analyzes its sales in September 2022, and we also consider the years of operation. We transfer the main dimensions «years in operation» and «revenue» to the two axes and as graphic elements we choose points with location labels. The first graph immediately shows: the older, established locations generate more revenue than the newly opened stores.

The package ggplot2 allows us to add another dimension to the graph with only one additional argument. The color shows whether it is a restaurant or a bar. It seems that newly opened restaurants generate comparatively little revenue, while the new bars start better.

This assumption can be confirmed with ggplot2’s function for plotting layers with statistical transformations – like regression lines in this case. Restaurants start with lower revenues, but overtake bars after a few years of operation. For restaurateurs, this is an indication that the additional investment in restaurants could pay off. By adding and changing individual layers in ggplot2, similar evaluations on other topics such as personnel costs or store space can be created with little effort.

The R package knows all relevant diagram types, such as the heatmap shown above. The data can be scaled and transformed as desired, interactive graphics can show specific details.

Data visualization: ggplot2 can do even more

Technically, with a package as comprehensive as ggplot2, there are hardly any limits. However, it is important to question whether additional information in a graphic offers added value or whether, in the worst case, interpretability suffers. Because with the «Grammar of Graphics» it is like in the language: The grammar is only the first step to a good sentence – respectively to a good graphic. We at Datahouse are happy to support you in designing your personal visualization solution.