The Data Team @ The Data Lab

Using Shiny for interactive displays of health data: The Scottish Burden of Diseases

Thu, 09 May 2019 11:31:31 +0100

The Accelerator programme run by The Data Lab between 19 April 2018 - 06 September 2018 was a Scottish Government collaborative project, open to employees of the Scottish Government, the Information Services Division, the National Records of Scotland and Registers of Scotland. Employees applying to take part had a background in statistics, economics, operational research and social research, and sought to improve their data skills across a variety of areas.

For instance, I mentored one of the applicants as part of this initiative - Maite Thrower (Senior Analyst from Public Health Intelligence, ISD within NHS / National Services Scotland), and supported her in her journey to learn R and Shiny in order to create insightful visalisations for the Burden of disease in Scotland, including over 100 conditions and injuries. Previously, the complex Scottish Burden of Disease data had been visualised using static graphs, hence it was essential to transition to more powerful interactive visualisations via Shiny.

To understand this data, we need to focus on two key concepts / measures: the years lost of life (an estimate for how the extent to which someone’s life may have been cut short, due to an existing condition), and the years lived with disability (an estimate for how long a person has lived with the consequences of a condition). These measures are considered to be influenced by factors such as age, gender and the level of deprivation in the area of living - all ideally implemented as inputs in a Shiny app (but please see below for a similar solution implemented for the Global Burden of Disease data, from The Institute for Health Metrics and Evaluation).

Maite describes her experience as participant in the Accelerator programme:

I wanted to develop an interactive data visualisation for the Scottish Burden of diseases to present our results and design modern visualizations with the aim to reach a higher audience and increase the number of users that refer to our statistics and graphics in their reports and websites. […] Originally, I was going to produce a static tree map and one visualization but I managed, with the direction of the mentor from The Data Lab, to move to a higher level and produce the Shiny app. Public Health Scotland has agreed to fund further development of the interactive visualisation and to publish it in the ScotPHO website. I have also been presenting the results of the program to other teams in ISD.

For my part, I really enjoyed working with Maite and seeing her progress over time, from the initial planning stages of the Accelerator programme, to when Maite was able to deliver a comprehensive interactive visualisation for the Scottish Burden of Diseases data, using R Shiny.

Maite’s interactive Shiny solution will be deployed soon - updates to follow. For the time being, you can access the mentorship material used throughout the Accelerator programme freely in the following GitHub repo:

Constantinescu, A.C. (2018, August). Exploring the Scottish Burden of Diseases Data using R Shiny [R script]. Edinburgh, Scotland: The Data Lab Innovation Centre. Retrieved [Month] [Day], [Year], from https://github.com/TheDataLabScotland/Public_ScotGovAccelerator_2018

Four tips for creating interactive visualisations with Shiny

Wed, 08 May 2019 15:47:08 +0100

I’ve recently presented a toy Shiny app at the Edinburgh Data Visualization Meetup to demonstrate how Shiny can be used to explore data interactively.

In my code-assisted walkthrough, I began by discussing the data used: a set of records detailing customer purchases made on Black Friday (i.e., each customer was given a unique ID, which was repeated in long format in the case of multiple purchases). Both the customers and the items purchased are described along various dimensions (e.g., customer city type, or item category etc.). You can find more details about this dataset on Kaggle here.

After a basic set of data manipulations using data.table in R (see code below for details), the data was ready to be visualised with ggplot2. It is at this stage that I can share my first tip for designing Shiny apps:

Tip #1

Consider starting with a simple, static visualisation (rather than building your Shiny app directly). This strategy helps to streamline the design process and deal with any potential problems one at a time. Starting with a static plot can also help to identify the best visualisation that can highlight the particular relationships you are trying to show in your data.

library( data.table )
library( ggplot2 )
library( viridis )

BFsales <- fread( "BlackFriday.csv" )

BFsales[ , User_ID := as.factor( User_ID ) ]
BFsales[ , Product_ID := as.factor( Product_ID ) ]
BFsales[ , Occupation := as.factor( Occupation ) ]
BFsales[ , Gender := as.factor( Gender ) ]
levels( BFsales$Gender ) <- c( "Female", "Male" )

BFsales[ , Stay_In_Current_City_Years := ordered( Stay_In_Current_City_Years, levels = sort( unique( Stay_In_Current_City_Years ) ) ) ]

BFsales[ , Marital_Status := factor( Marital_Status ) ]
levels( BFsales$Marital_Status ) <- c( "Married", "Single" )

BFsales[ , Product_Category_1 := as.factor( Product_Category_1 ) ]
BFsales[ , Product_Category_2 := as.factor( Product_Category_2 ) ]
BFsales[ , Product_Category_3 := as.factor( Product_Category_3 ) ]

BFsales[ , Age := ifelse( Age == "0-17", "Under 17", Age ) ]
BFsales[ , Age := ifelse( Age == "55+", "Over 55", Age ) ]
BFsales[ , Age := ordered( Age, levels = c( "Under 17", "18-25", "26-35",  "36-45", "46-50", "51-55", "Over 55" ) ) ]


# How much did *individuals* spend on average depending on city category and age?
purchase_by_age_agr <- aggregate( Purchase ~ User_ID + Age + City_Category, data = BFsales, FUN = sum )
purchase_by_age_agr <- aggregate( Purchase ~ Age + City_Category, data = purchase_by_age_agr, FUN = mean )

ggplot( purchase_by_age_agr, 
        aes( x = City_Category, y = Purchase, group = Age, color = Age ) ) + 
  geom_point( size = 2.5 ) +
  geom_line( lwd = 1.5 ) +
  scale_color_viridis_d( direction = -1, begin = 0.20, end = 0.85, option = "B" ) +
  labs( x = "City category",
        color = "Age band" ) +
  ggtitle( "Average spend according to customer age and city type",
           subtitle = "- Add note here -" )

Following the R code above, this is the plot you would get:

After creating this static prototype, we can now start thinking about how to generalise it and integrate elements of interactivity (input menus) via Shiny. This would help us to investigate questions such as whether the product category affects the relationship shown, or whether customers’ marital status, gender, or occupation have any influence as well? Tackling questions such as these with Shiny is a more powerful and elegant option, relative to generating large numbers of individual plots for each such scenario.

So, how can we move to Shiny from here? I won’t go into the details here (which I have done instead at the meetup), since several great tutorials are already available - notably Dean Attali’s. You can also check out other important resources / documentation pages, e.g., the RStudio tutorials here and here. You can also have a look at the Shiny app gallery to get inspiration and choose a format that suits your needs.

Tip #2

Think about the initial state of the app: should the view contain full data? If so, make sure the default options for the inputs cover all the options that exist in your data (e.g., a menu for selecting binary gender should have both checkboxes ticked by default, but users can later opt for looking at a single gender if they so wish). A less obvious case is when missing values exist: these would be filtered out automatically by input menus with specific options, but is this something you want handled in this way?

Tip #3

How should the app handle multiple connected sessions? It might be a good idea to have larger data objects / constants visible across all connected sessions for efficiency. It is worth thinking about this in more detail and setting up your app according to Scoping guidance.

Tip #4

Make sure you are setting up the correct dependencies between reactive layers in your Shiny app: to make full use of Shiny’s clever reactive system, you need to pay special attention to setting up the correct links between objects. A great post on Execution scheduling will go a long way towards clarifying this.

If you are curious about Black Friday sales, you can see the Shiny app in action below (click the image to visit the app):

The code controlling the app’s behaviour can be found on GitHub here.

Further practice

If you like, feel free to build on my example further. I’ve left out various menus that could still be included. So as practice, you could:

Tweak the UI to include suitable inputs for:
- Occupation
- Marital status
- Product category 2 or 3 (careful about handling missing data here)
Update the server function to use these newly-added inputs!

Interactive Intelligence

Mon, 22 Apr 2019 12:25:39 +0100

Artificial Intelligence and Machine Learning has captured a large share of academic and industry attention during recent years, both in terms of new capabilities and the implications to society. Many state-of-the-art techniques are able to provide important capabilities for different fields, yet we are far from creating artificial general intelligence. Human-In-The-Loop (HITL) is a branch of Artificial Intelligence (AI) where natural (human) and Artificial (machine) intelligence combine to create more accurate AI algorithms. In such systems, humans are involved in every stage of the algorithm’s output by creating a feedback loop from training to testing stages resulting in a more accurate model. This approach is a blend of supervised learning (using labelled training data) and active learning (interacting with users for feedback).

Recent advances in the field of AI have given rise to techniques such as active learning and co-operative learning . The back bone of any machine learning algorithm is data, and usually these datasets are unlabelled (e.g. Images). In the training stage, a human is required to label this dataset (the output, e.g. cat or dog) manually. This data is then fed to the machine learning model to train; this is referred to as supervised learning. In this technique the algorithms learn from labelled data to then predict unseen cases. Using what we already know we can go deeper and create more sophisticated techniques to uncover other insights and features that exist in the training dataset with the goal of getting more accurate and automated results.

As can be seen in the graph below, in the cases when the AI classifier is not confident about its output the need for human intervention arises which in turn leads to a more accurate output.

In the testing and evaluation phase, humans’ and machines’ expertise are combined by allowing the human to correct any inaccurate results that has been produced. Specifically, the human in this case will correct the labels that the machine was not able to spot with a high accuracy (i.e. classified a dog for a cat). The same approach is carried out by the human when the machine is overly confident about a wrong prediction. In each iteration the performance of the algorithm will increase allowing the path towards automated lifelong learning by mitigating the need for future human intervention. At the end of such work the results are then sent to a domain expert to make decisions that enables bigger impact. For example, in a hospital, images of tumours could be assessed by the model, comparing new images with all previous records and once the output is produced it could then be sent to a cancer expert to further classify and feedback to the model.

Allowing HITL will significantly change the way business workflows are carried out at scale by creating a pipeline that includes data collection, model training, testing, deployment and maintenance. However, much work on roles and definitions of HITL is needed to create an impactful Machine Learning ecosystem. Knowing what the most efficient ways are to incorporate human-machine know-how will require hybrid processes in order to enable future paths for automated systems which extend beyond Machine Learning workflows to other fields such as robotics. This is a very exciting time to be involved in this field as industry and academia are pushing the limits further every day.