This page was generated from examples/od_prophet_weather.ipynb.

Time-series outlier detection using Prophet on weather data

Method

The Prophet outlier detector uses the Prophet time series forecasting package explained in this excellent paper. The underlying Prophet model is a decomposable univariate time series model combining trend, seasonality and holiday effects. The model forecast also includes an uncertainty interval around the estimated trend component using the MAP estimate of the extrapolated model. Alternatively, full Bayesian inference can be done at the expense of increased compute. The upper and lower values of the uncertainty interval can then be used as outlier thresholds for each point in time. First, the distance from the observed value to the nearest uncertainty boundary (upper or lower) is computed. If the observation is within the boundaries, the outlier score equals the negative distance. As a result, the outlier score is the lowest when the observation equals the model prediction. If the observation is outside of the boundaries, the score equals the distance measure and the observation is flagged as an outlier. One of the main drawbacks of the method however is that you need to refit the model as new data comes in. This is undesirable for applications with high throughput and real-time detection.

Dataset

The example uses a weather time series dataset recorded by the Max-Planck-Institute for Biogeochemistry. The dataset contains 14 different features such as air temperature, atmospheric pressure, and humidity. These were collected every 10 minutes, beginning in 2003. Like the TensorFlow time-series tutorial, we only use data collected between 2009 and 2016.

[1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import tensorflow as tf

from alibi_detect.od import OutlierProphet
from alibi_detect.utils.saving import save_detector, load_detector

Load dataset

[2]:
zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname='jena_climate_2009_2016.csv.zip',
    extract=True
)
csv_path, _ = os.path.splitext(zip_path)
df = pd.read_csv(csv_path)
df['Date Time'] = pd.to_datetime(df['Date Time'], format='%d.%m.%Y %H:%M:%S')
print(df.shape)
df.head()
(420551, 15)
[2]:
Date Time p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%) VPmax (mbar) VPact (mbar) VPdef (mbar) sh (g/kg) H2OC (mmol/mol) rho (g/m**3) wv (m/s) max. wv (m/s) wd (deg)
0 2009-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.3 3.33 3.11 0.22 1.94 3.12 1307.75 1.03 1.75 152.3
1 2009-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.4 3.23 3.02 0.21 1.89 3.03 1309.80 0.72 1.50 136.1
2 2009-01-01 00:30:00 996.53 -8.51 264.91 -9.31 93.9 3.21 3.01 0.20 1.88 3.02 1310.24 0.19 0.63 171.6
3 2009-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.2 3.26 3.07 0.19 1.92 3.08 1309.19 0.34 0.50 198.0
4 2009-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.1 3.27 3.08 0.19 1.92 3.09 1309.00 0.32 0.63 214.3

Select subset to test Prophet model on:

[3]:
n_prophet = 10000

Prophet model expects a DataFrame with 2 columns: one named ds with the timestamps and one named y with the time series to be evaluated. We will just look at the temperature data:

[4]:
d = {'ds': df['Date Time'][:n_prophet], 'y': df['T (degC)'][:n_prophet]}
df_T = pd.DataFrame(data=d)
print(df_T.shape)
df_T.head()
(10000, 2)
[4]:
ds y
0 2009-01-01 00:10:00 -8.02
1 2009-01-01 00:20:00 -8.41
2 2009-01-01 00:30:00 -8.51
3 2009-01-01 00:40:00 -8.31
4 2009-01-01 00:50:00 -8.27
[5]:
plt.plot(df_T['ds'], df_T['y'])
plt.title('T (in °C) over time')
plt.xlabel('Time')
plt.ylabel('T (in °C)')
plt.show()
../_images/examples_od_prophet_weather_8_0.png

Load or define outlier detector

The pretrained outlier and adversarial detectors used in the example notebooks can be found here. You can either manually download the relevant files in the od_prophet_weather folder to e.g. the local directory my_dir. Alternatively, if you have Google Cloud SDK installed, you can download the whole folder as follows:

!gsutil cp -r gs://seldon-models/alibi-detect/od_prophet_weather my_dir
[6]:
load_outlier_detector = False
[7]:
filepath = './od_prophet_weather/'  # change to directory where model is downloaded
if load_outlier_detector:  # load pretrained outlier detector
    od = load_detector(filepath)
else:  # initialize, fit and save outlier detector
    od = OutlierProphet(threshold=.9)
    od.fit(df_T)
    save_detector(od, filepath)
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.

Please check out the documentation as well as the original Prophet documentation on how to customize the Prophet-based outlier detector and add seasonalities, holidays, opt for a saturating logistic growth model or apply parameter regularization.

Predict outliers on test data

Define the test data. It is important that the timestamps of the test data follow the training data. We check this below by comparing the first few rows of the test DataFrame with the last few of the training DataFrame:

[8]:
n_periods = 1000
d = {'ds': df['Date Time'][n_prophet:n_prophet+n_periods],
     'y': df['T (degC)'][n_prophet:n_prophet+n_periods]}
df_T_test = pd.DataFrame(data=d)
df_T_test.head()
[8]:
ds y
10000 2009-03-11 10:50:00 4.12
10001 2009-03-11 11:00:00 4.62
10002 2009-03-11 11:10:00 4.29
10003 2009-03-11 11:20:00 3.95
10004 2009-03-11 11:30:00 3.96
[9]:
df_T.tail()
[9]:
ds y
9995 2009-03-11 10:00:00 2.69
9996 2009-03-11 10:10:00 2.98
9997 2009-03-11 10:20:00 3.66
9998 2009-03-11 10:30:00 4.21
9999 2009-03-11 10:40:00 4.19

Predict outliers on test data:

[10]:
od_preds = od.predict(
    df_T_test,
    return_instance_score=True,
    return_forecast=True
)

Visualize results

We can first visualize our predictions with Prophet’s built in plotting functionality. This also allows us to include historical predictions:

[11]:
future = od.model.make_future_dataframe(periods=n_periods, freq='10T', include_history=True)
forecast = od.model.predict(future)
fig = od.model.plot(forecast)
../_images/examples_od_prophet_weather_18_0.png

We can also plot the breakdown of the different components in the forecast. Since we did not do full Bayesian inference with mcmc_samples, the uncertaintly intervals of the forecast are determined by the MAP estimate of the extrapolated trend.

[12]:
fig =  od.model.plot_components(forecast)
../_images/examples_od_prophet_weather_20_0.png

It is clear that the further we predict in the future, the wider the uncertainty intervals which determine the outlier threshold.

Let’s overlay the actual data with the upper and lower outlier thresholds predictions and check where we predicted outliers:

[13]:
forecast['y'] = df['T (degC)'][:n_prophet+n_periods]
[14]:
pd.plotting.register_matplotlib_converters()  # needed to plot timestamps
forecast[-n_periods:].plot(x='ds', y=['y', 'yhat', 'yhat_upper', 'yhat_lower'])
plt.title('Predicted T (in °C) over time')
plt.xlabel('Time')
plt.ylabel('T (in °C)')
plt.show()
../_images/examples_od_prophet_weather_23_0.png

Outlier scores and predictions:

[15]:
od_preds['data']['forecast']['threshold'] = np.zeros(n_periods)
od_preds['data']['forecast'][-n_periods:].plot(x='ds', y=['score', 'threshold'])
plt.title('Outlier score over time')
plt.xlabel('Time')
plt.ylabel('Outlier score')
plt.show()
../_images/examples_od_prophet_weather_25_0.png

The outlier scores naturally trend down as uncertainty increases when we predict further in the future.

Let’s look at some individual outliers:

[16]:
df_fcst = od_preds['data']['forecast']
df_outlier = df_fcst.loc[df_fcst['score'] > 0]
[17]:
print('Number of outliers: {}'.format(df_outlier.shape[0]))
df_outlier[['ds', 'yhat', 'yhat_lower', 'yhat_upper', 'y']]
Number of outliers: 53
[17]:
ds yhat yhat_lower yhat_upper y
273 2009-03-13 08:20:00 1.935789 -1.826629 5.516289 5.54
280 2009-03-13 09:30:00 2.546610 -1.601865 6.204580 7.22
281 2009-03-13 09:40:00 2.661307 -1.495502 6.633411 7.11
282 2009-03-13 09:50:00 2.782348 -1.494948 6.871585 7.22
283 2009-03-13 10:00:00 2.909426 -1.298733 6.698220 7.50
284 2009-03-13 10:10:00 3.042175 -1.093707 6.964684 7.71
285 2009-03-13 10:20:00 3.180168 -0.841032 7.068118 7.93
286 2009-03-13 10:30:00 3.322914 -0.884821 7.502285 7.98
287 2009-03-13 10:40:00 3.469862 -0.631244 7.934278 7.97
288 2009-03-13 10:50:00 3.620397 -0.496608 7.327824 8.11
289 2009-03-13 11:00:00 3.773845 -0.408267 7.595441 8.31
290 2009-03-13 11:10:00 3.929473 -0.072862 7.958980 8.22
291 2009-03-13 11:20:00 4.086493 -0.010458 8.128684 8.47
292 2009-03-13 11:30:00 4.244068 0.054846 8.501898 8.65
293 2009-03-13 11:40:00 4.401316 0.314271 8.553597 8.73
294 2009-03-13 11:50:00 4.557318 0.436760 8.401234 8.86
295 2009-03-13 12:00:00 4.711127 0.829825 8.690445 9.03
296 2009-03-13 12:10:00 4.861773 0.724462 8.896792 9.20
297 2009-03-13 12:20:00 5.008280 0.990100 8.728768 9.27
306 2009-03-13 13:50:00 5.989570 1.914167 9.902415 10.18
310 2009-03-13 14:30:00 6.146584 1.897654 9.957388 10.43
314 2009-03-13 15:10:00 6.108253 1.927911 10.028861 10.24
316 2009-03-13 15:30:00 6.021822 1.821570 9.899489 10.40
317 2009-03-13 15:40:00 5.963937 1.839981 9.926509 10.32
320 2009-03-13 16:10:00 5.741674 1.660523 9.957886 10.07
321 2009-03-13 16:20:00 5.654549 1.465388 9.434279 9.85
322 2009-03-13 16:30:00 5.562586 1.611620 9.407458 9.62
323 2009-03-13 16:40:00 5.466817 1.135502 9.618235 9.75
324 2009-03-13 16:50:00 5.368270 1.342344 9.310384 9.52
325 2009-03-13 17:00:00 5.267951 1.177343 9.045009 9.19
435 2009-03-14 11:20:00 4.284622 0.283918 8.698228 8.77
437 2009-03-14 11:40:00 4.597406 0.064846 8.831519 9.11
439 2009-03-14 12:00:00 4.904994 0.271837 9.091282 9.51
440 2009-03-14 12:10:00 5.054459 0.467917 9.607358 9.71
441 2009-03-14 12:20:00 5.199739 0.144739 9.485994 9.64
442 2009-03-14 12:30:00 5.339857 0.837788 9.698491 9.76
443 2009-03-14 12:40:00 5.473850 0.917273 9.392643 9.74
444 2009-03-14 12:50:00 5.600781 1.123531 9.744170 9.91
469 2009-03-14 17:00:00 5.406697 0.639853 9.572675 9.66
473 2009-03-14 17:40:00 4.996114 0.359447 9.173769 9.49
474 2009-03-14 17:50:00 4.897644 0.070645 9.390001 9.42
476 2009-03-14 18:10:00 4.710216 -0.004431 9.217274 9.22
477 2009-03-14 18:20:00 4.622153 0.300309 8.932628 9.16
514 2009-03-15 00:30:00 2.740187 -2.492916 7.336338 7.41
515 2009-03-15 00:40:00 2.698383 -2.213351 7.422480 7.56
516 2009-03-15 00:50:00 2.658244 -2.476084 7.266544 7.56
517 2009-03-15 01:00:00 2.619797 -2.000476 7.501970 7.53
518 2009-03-15 01:10:00 2.583029 -2.474907 7.270634 7.44
520 2009-03-15 01:30:00 2.514298 -2.567598 7.188325 7.36
521 2009-03-15 01:40:00 2.482139 -2.154455 7.286245 7.36
523 2009-03-15 02:00:00 2.421542 -2.598990 7.052340 7.28
524 2009-03-15 02:10:00 2.392767 -2.674122 7.125171 7.23
527 2009-03-15 02:40:00 2.310348 -2.907795 6.809362 6.92

The outliers occur in 3 consectuive days which were warmer than predicted.