Background

The ongoing COVID-19 epidemic is a global pandemic and could at some point affect up to two third of the population. Containment is possible if a local outbreak is detected early. However, not all surveillance and health care systems have the capacity or infrastructure to find early cases. In recent years, Google search trends (GT) has been studied as a potential early warning system for various infectious diseases. The results are mixed. For seasonal influenza, GT can predict the timing of the peaks accurately, but case number prediction is more difficult (see for example here). The question is, whether GT can be used to predict outbreaks of COVID-19 in various countries. One major problem with the use of GT data is that it is unclear whether search trends reflect disease activity in a population or whether they reflect the reaction of a population to media coverage of a disease. The following short study will investigate this issue.

Methods

In this case study I will investigate three questions:

  1. Which countries that are currently affected with COVID-19 show a substantial increase in GT activity in Januar/February 2020 compared to the previous three years?

  2. Is the GT a reflection of media coverage for COVID-19 and how does GT activity relate to epidemiological data for COVID-19?

  3. Do we see any increases in GT activity in unaffected countries that are not related to media coverage?

I have retrieved two sets of GT data: the weekly “web search” data for the period of 2017-01-01 to 2020-03-05 (3-year data) and the daily “web search” and “news” data for the period of 2020-01-01 to 2020-03-05 (60 days data). I am here assuming that the GT activity for “news” is a good reflection of the actual media coverage.

The search terms I collected GT data on are: “pneumonia”, “cough” and “fever” for web data and “coronavirus” for news coverage. The GT were queried with the gtrendsR R package.

I have translated these terms into local languages for each country using the translateR R package. The COVID-19 data was digitized from the WHO situation reports, the datasets can be found in my Github repository.

I first used the 3-year GT data in countries with at least one case and calculated the mean relative search activity for the pre- and post-COVID periods (Jan-2017 to Dec-2019 and Jan to Feb 2020) and the mean increase as post-mean/pre-mean. I then looked the 60-days GT data for all countries with at least a twofold increase in pre- vs. post-COVID-19 activity. I have omitted data from countries that generally have a low-search volume defined as 10 or more days (i.e. >=6.5%) with no search activity. For the remaining countries, I smoothed the 60-day GT by a moving-average procedure with a 3 days window to reduce the noise and compared the “web search” activity to the “news” activity and the incident confirmed cases.

Results

1. Identification of countries with increased GT activity in the past two months

I first looked into the search activity of the past 3 years for “pneumonia”, “cough” and “fever” in the most widely spoken local language in all countries with at least one case of coronavirus as of March 5, 2020.

Many European countries show seasonal peaking of all search terms, which are indicative of seasonal influenza activity. Pneumonia seems to be the search term which stands out for many countries.

The following plots shows the x-fold increase in mean search activity for the three search terms in the language spoken by the majority (first half of the bars) and in english (second half of the bars).

## Warning: Removed 1 rows containing missing values (geom_bar).

Almost all affected countries show a small increase in activity, but countries with large transmission show a massive increase in search activity for pneumonia after January 1, 2020 compared to the last 3 years (e.g. Taiwan, China, Japan, Hong Kong). These are the countries with a least twofold increase in GT activity for “pneumonia”:

2. Comparison of GT web search, GT news activity and COVID-19 incidence

For countries with at least twofold increase in activity for “pneumonia”, I compared the web search trends, the media coverage according to GT news activity and the incident confirmed cases of COVID-19. I have omitted countries with a low search volume, because these data provide more noise than information. The time series data of GT activity were smoothed to reduce the noise.

The same for english:

Generally, the GT curves are smoother for countries with widespread transmission: Singapore, South Korea, Indonesia, Hongkong, Vietnam, Japan, Italy and Germany. For these countries it appears that web and news GT activity as well as COIVD-19 activity coincide over time. We can use cross-correlation with different time lags to examine to examine whether the web activity preceeds the news activity or not. For this I calculated the pearson correlation coefficient between web search hits and news hits for each country and language with lags of -7 to 6. The following plots show the distribution of the lag times for which the correlation was maximal for each country, for local language (left) and english (right):

These estimates show that on average, the time lag between web and news activity at which we have the highest correlation is 0. This is the correlation of the smoothed GT web and news activity by country for the most often spoken local language for a 0 time lag:

For countries with smoother GT data and widespread transmission (Germany, Italy, Japan), there is a very good agreement between web and news activity at time lag 0. The median correlation coefficent is 0.4782256. Countries with noisier GT data obviously have a lower correlation.

Hongkong, Indonesia, Singapore, Macao, Taiwan, South Korea and Malaysia seem to show an increase in web search activity for pneumonia in the first half of January without corresponding news activity, but it is unclear whether this is due to alternative media coverage (i.e. not reflected in GT news activity), reactions to rumours, noise or actual cases googling their symptoms.