autots.datasets package

Submodules

autots.datasets.fred module

FRED (Federal Reserve Economic Data) Data Import

requires API key from FRED and pip install fredapi

autots.datasets.fred.get_fred_data(fredkey: str, SeriesNameDict: dict | None = None, long=True, observation_start=None, sleep_seconds: int = 1, **kwargs)

Imports Data from Federal Reserve. For simplest results, make sure requested series are all of the same frequency.

Parameters:
  • fredkey (str) – an API key from FRED

  • SeriesNameDict (dict) – pairs of FRED Series IDs and Series Names like: {‘SeriesID’: ‘SeriesName’} or a list of FRED IDs. Series id must match Fred IDs, but name can be anything if None, several default series are returned

  • long (bool) – if True, return long style data, else return wide style data with dt index

  • observation_start (datetime) – passed to Fred get_series

  • sleep_seconds (int) – seconds to sleep between each series call, reduces failure chance usually

autots.datasets.synthetic module

Synthetic Daily Data Generator with Labeled Changepoints, Anomalies, and Holidays

@author: winedarksea with Claude Sonnet v4.5

Matching test file in tests/test_synthetic_data.py

class autots.datasets.synthetic.SyntheticDailyGenerator(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, trend_changepoint_freq=0.5, level_shift_freq=0.1, level_shift_strength=0.4, anomaly_freq=0.05, shared_anomaly_prob=0.2, shared_level_shift_prob=0.2, weekly_seasonality_strength=1.0, yearly_seasonality_strength=1.0, noise_level=0.1, include_regressors=False, anomaly_types=None, disable_holiday_splash=False)

Bases: object

Generate realistic synthetic daily time series data with labeled components.

Creates multivariate time series with: - Piecewise linear trends with changepoints - Level shifts (instantaneous and ramped) - Seasonality (weekly, yearly) with stochastic variation - Holiday effects (common and custom) with splash and bridge effects - Anomalies with various post-event patterns - Noise with regime changes - Optional regressor impacts - Business day series with weekend NaN - Multiple scales across series

All components are labeled and stored for model evaluation.

Variability Across Series: - Noise levels vary 0.5x-2.0x the base noise_level per series - Weekly seasonality strength varies 0.3x-2.5x per series - Yearly seasonality strength varies 0.2x-2.0x per series - Level shift frequency varies across series (some have none, some have several) - This creates a range from subtle, hard-to-detect patterns to very obvious ones

Event Scaling with Dataset Length: - Events (anomalies, level shifts, etc.) scale appropriately with n_days - Short datasets (< 1 year) use probabilistic event generation - Longer datasets use Poisson-based event counts - Level shifts are rare events, appropriately distributed

Template Compatibility: - Template structure is compatible with TimeSeriesFeatureDetector - Both use same JSON-friendly format for components and labels - Templates can be saved/loaded and used for model evaluation

Parameter Tuning: - Use tune_to_data() to optimize parameters to match real-world data - Tuning adjusts frequency and strength parameters based on statistical properties - See TUNING_GUIDE.md for detailed usage examples

Parameters:
  • start_date (str or pd.Timestamp) – Start date for the time series

  • n_days (int) – Number of days to generate

  • n_series (int) – Number of time series to generate

  • random_seed (int) – Random seed for reproducibility

  • trend_changepoint_freq (float) – Probability per year of a trend changepoint (default 0.5)

  • level_shift_freq (float) – Probability per year of a level shift (default 0.1)

  • level_shift_strength (float) – Controls the magnitude of level shifts as a percentage of the series baseline. Shifts will be sampled from 10% to this value (skewed toward 10%), but always at least 5x the noise standard deviation for detectability (default 0.4 = 40%)

  • anomaly_freq (float) – Probability per week of an anomaly (default 0.05)

  • weekly_seasonality_strength (float) – Base strength of weekly seasonality (default 1.0) Actual per-series strength will vary 0.3x-2.5x this value

  • yearly_seasonality_strength (float) – Base strength of yearly seasonality (default 1.0) Actual per-series strength will vary 0.2x-2.0x this value

  • noise_level (float) – Base noise level (default 0.1, relative to signal) Actual per-series level will vary 0.5x-2.0x this value

  • include_regressors (bool) – Whether to include regressor effects (default False)

  • anomaly_types (list of str or None) – List of anomaly types to generate. Valid types are: ‘point_outlier’, ‘noisy_burst’, ‘impulse_decay’, ‘linear_decay’, ‘transient_change’ If None (default), all types will be generated

  • disable_holiday_splash (bool) – If True, holidays will only affect a single day with no splash or bridge effects (default False)

Examples

Basic usage:

>>> from autots.datasets import generate_synthetic_daily_data
>>> gen = generate_synthetic_daily_data(n_days=365, n_series=5)
>>> data = gen.get_data()
>>> labels = gen.get_all_labels()

Tuning to real-world data:

>>> import pandas as pd
>>> real_data = pd.read_csv('real_data.csv', index_col=0, parse_dates=True)
>>> gen = generate_synthetic_daily_data(
...     start_date=real_data.index[0],
...     n_days=len(real_data),
...     n_series=len(real_data.columns),
... )
>>> results = gen.tune_to_data(real_data, n_iterations=20, verbose=True)
>>> gen._generate()  # Regenerate with tuned parameters
>>> tuned_data = gen.get_data()
SERIES_TYPE_DESCRIPTIONS = {'autocorrelated_noise': 'Autocorrelated Noise (AR)', 'business_day': 'Business Day (weekend NaN)', 'granger_lagged': 'Granger Lagged (7-day lag from Lunar Holidays)', 'lunar_holidays': 'Lunar Holidays', 'multiplicative_seasonality': 'Multiplicative Seasonality (AR noise)', 'no_level_shifts': 'No Level Shifts', 'ramadan_holidays': 'Ramadan Holidays', 'saturating_trend': 'Saturating Trend (logistic)', 'seasonality_changepoints': 'Seasonality Changepoints', 'standard': 'Standard', 'time_varying_seasonality': 'Time-Varying Seasonality', 'variance_regimes': 'Variance Regimes (GARCH)'}
TEMPLATE_VERSION = '1.0'
get_all_labels(series_name=None)

Get all labels in a structured format for easy model evaluation.

Parameters:

series_name (str, optional) – If provided, return labels for specific series only.

Returns:

Comprehensive dictionary of all labels and metadata.

Return type:

dict

get_anomalies(series_name=None)

Get anomaly labels: {series_name: [(date, magnitude, type, duration, shared), …]}

get_components(series_name=None)

Get individual components for analysis.

Parameters:

series_name (str, optional) – If provided, return components for specific series. If None, return all components.

Returns:

Dictionary of {series_name: {component_name: array}}

Return type:

dict

get_data()

Get the generated time series data.

get_holiday_config()

Get holiday splash/bridge configuration: {holiday_name: {‘has_splash’: bool, ‘has_bridge’: bool}}

get_holiday_impacts(series_name=None)

Get holiday impact labels (main holiday dates only): {series_name: {date: impact}}

get_holiday_splash_impacts(series_name=None)

Get holiday splash/bridge day impacts: {series_name: {date: impact}}

get_lagged_influences(series_name=None)

Get lagged influence information for Granger-style causal relationships.

Parameters:

series_name (str, optional) – If provided, return lagged influence info for specific series. If None, return all lagged influences.

Returns:

Dictionary of {series_name: {‘source’: source_series, ‘lag’: lag_days, ‘coefficient’: coef}} or single dict if series_name is specified

Return type:

dict

get_level_shifts(series_name=None)

Get level shift labels: {series_name: [(date, magnitude, type, shared), …]}

get_noise_changepoints(series_name=None)

Get noise distribution changepoints: {series_name: [(date, old_params, new_params), …]}

get_noise_to_signal_ratios()

Get noise-to-signal ratios for all series.

get_regressor_impacts(series_name=None)

Get regressor impacts: {series_name: {‘by_date’: {date: {regressor: impact}}, ‘coefficients’: {…}}}

get_regressors()

Get the generated regressors (if any).

get_seasonality_changepoints(series_name=None)

Get seasonality changepoints: {series_name: [(date, description), …]}

get_series_noise_levels()

Get per-series noise levels.

get_series_scales()

Get scale factors for all series.

get_series_seasonality_strengths()

Get per-series seasonality strengths.

get_series_type_description(series_name)

Get human-readable description for a series type.

Parameters:

series_name (str) – Name of the series

Returns:

Human-readable description of the series type

Return type:

str

get_template(series_name=None, deep=True)

Get the JSON-friendly template describing the generated data.

get_trend_changepoints(series_name=None)

Get trend changepoint labels: {series_name: [(date, old_slope, new_slope), …]}

machine_summary(series_name=None, include_events=True, include_regressors=True, max_events_per_type=25, round_decimals=6, as_json=False)

Return a structured summary tailored for LLM or tool consumption.

plot(series_name=None, figsize=(16, 12), save_path=None, show=True)

Plot a series with all its labeled components clearly marked.

Parameters:
  • series_name (str, optional) – Name of series to plot. If None, randomly selects one.

  • figsize (tuple, optional) – Figure size (width, height) in inches. Default (16, 12).

  • save_path (str, optional) – If provided, saves the plot to this path instead of displaying.

  • show (bool, optional) – Whether to display the plot. Default True.

Returns:

fig – The generated figure object

Return type:

matplotlib.figure.Figure

Raises:

ImportError – If matplotlib is not installed

classmethod render_template(template, return_components=False)

Render a template into time series using the generator’s renderer.

summary()

Print a summary of the generated data.

to_csv(filepath, include_regressors=False)

Save generated data to CSV.

Parameters:
  • filepath (str) – Path to save the CSV file

  • include_regressors (bool) – Whether to include regressors in the output

tune_to_data(target_df, n_iterations=20, n_standard_series=None, metric='mse', verbose=True, random_seed=None)

Tune generator parameters to match real-world time series data.

This method optimizes the generator’s parameters to minimize the difference between synthetic data and real-world data based on distributional statistics. Special series types are not tuned but will still be generated with optimized base parameters.

TODO: this is a fairly basic implementation, and won’t tune many aspects of real world data

Parameters:
  • target_df (pd.DataFrame) – Real-world time series data to match (DatetimeIndex, numeric columns)

  • n_iterations (int, optional) – Number of optimization iterations (default 20)

  • n_standard_series (int, optional) – Number of standard series to generate for comparison during tuning. If None, uses min(target_df.shape[1], 5) series.

  • metric (str, optional) – Distance metric to minimize: ‘mse’, ‘mae’, ‘wasserstein’ (default ‘mse’)

  • verbose (bool, optional) – Whether to print progress (default True)

  • random_seed (int, optional) – Random seed for tuning process (default None, uses current random_seed)

Returns:

Dictionary containing: - ‘best_params’: Optimized parameter dictionary - ‘best_score’: Best score achieved - ‘target_stats’: Statistics from target data - ‘synthetic_stats’: Statistics from best synthetic data (scaled) - ‘scale_multiplier’: Factor to multiply synthetic data by to match target magnitude

Return type:

dict

Notes

Updates self with best parameters found. After calling this method, new data generation will use the tuned parameters.

Important: The synthetic data is generated on a base scale (~50), which may differ from your real-world data scale. The returned ‘scale_multiplier’ should be applied to generated data to match the magnitude of the target data:

>>> gen._generate()  # Regenerate with tuned parameters
>>> scaled_data = gen.data * gen.tuning_results['scale_multiplier']

The scale multiplier matches the mean of absolute means between target and synthetic data, ensuring the overall magnitude is similar.

Raises:
  • ImportError – If scipy is not installed (required for optimization)

  • ValueError – If target_df is invalid

autots.datasets.synthetic.augment_with_synthetic_bounds(X: DataFrame, Y, ratio: float, random_seed: int = 0, max_fraction: float = 0.25)

Prepend synthetic boundary samples to X/Y to anchor scaling.

Parameters:
  • X (pandas.DataFrame) – Feature matrix with datetime-based index.

  • Y (array-like) – Response array aligned with X (1d or 2d).

  • ratio (float) – Share of synthetic samples relative to len(X). Capped by max_fraction.

  • random_seed (int, default 0) – Base seed to make augmentation repeatable.

  • max_fraction (float, default 0.25) – Maximum share of synthetic samples permitted.

autots.datasets.synthetic.generate_synthetic_daily_data(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, **kwargs)

Quick function to generate synthetic daily data.

Parameters:
  • start_date (str) – Start date for the time series

  • n_days (int) – Number of days to generate

  • n_series (int) – Number of series to generate

  • random_seed (int) – Random seed for reproducibility

  • **kwargs – Additional parameters passed to SyntheticDailyGenerator

Returns:

generator – Generator object with data and labels

Return type:

SyntheticDailyGenerator

Module contents

Tools for Importing Sample Data

class autots.datasets.SyntheticDailyGenerator(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, trend_changepoint_freq=0.5, level_shift_freq=0.1, level_shift_strength=0.4, anomaly_freq=0.05, shared_anomaly_prob=0.2, shared_level_shift_prob=0.2, weekly_seasonality_strength=1.0, yearly_seasonality_strength=1.0, noise_level=0.1, include_regressors=False, anomaly_types=None, disable_holiday_splash=False)

Bases: object

Generate realistic synthetic daily time series data with labeled components.

Creates multivariate time series with: - Piecewise linear trends with changepoints - Level shifts (instantaneous and ramped) - Seasonality (weekly, yearly) with stochastic variation - Holiday effects (common and custom) with splash and bridge effects - Anomalies with various post-event patterns - Noise with regime changes - Optional regressor impacts - Business day series with weekend NaN - Multiple scales across series

All components are labeled and stored for model evaluation.

Variability Across Series: - Noise levels vary 0.5x-2.0x the base noise_level per series - Weekly seasonality strength varies 0.3x-2.5x per series - Yearly seasonality strength varies 0.2x-2.0x per series - Level shift frequency varies across series (some have none, some have several) - This creates a range from subtle, hard-to-detect patterns to very obvious ones

Event Scaling with Dataset Length: - Events (anomalies, level shifts, etc.) scale appropriately with n_days - Short datasets (< 1 year) use probabilistic event generation - Longer datasets use Poisson-based event counts - Level shifts are rare events, appropriately distributed

Template Compatibility: - Template structure is compatible with TimeSeriesFeatureDetector - Both use same JSON-friendly format for components and labels - Templates can be saved/loaded and used for model evaluation

Parameter Tuning: - Use tune_to_data() to optimize parameters to match real-world data - Tuning adjusts frequency and strength parameters based on statistical properties - See TUNING_GUIDE.md for detailed usage examples

Parameters:
  • start_date (str or pd.Timestamp) – Start date for the time series

  • n_days (int) – Number of days to generate

  • n_series (int) – Number of time series to generate

  • random_seed (int) – Random seed for reproducibility

  • trend_changepoint_freq (float) – Probability per year of a trend changepoint (default 0.5)

  • level_shift_freq (float) – Probability per year of a level shift (default 0.1)

  • level_shift_strength (float) – Controls the magnitude of level shifts as a percentage of the series baseline. Shifts will be sampled from 10% to this value (skewed toward 10%), but always at least 5x the noise standard deviation for detectability (default 0.4 = 40%)

  • anomaly_freq (float) – Probability per week of an anomaly (default 0.05)

  • weekly_seasonality_strength (float) – Base strength of weekly seasonality (default 1.0) Actual per-series strength will vary 0.3x-2.5x this value

  • yearly_seasonality_strength (float) – Base strength of yearly seasonality (default 1.0) Actual per-series strength will vary 0.2x-2.0x this value

  • noise_level (float) – Base noise level (default 0.1, relative to signal) Actual per-series level will vary 0.5x-2.0x this value

  • include_regressors (bool) – Whether to include regressor effects (default False)

  • anomaly_types (list of str or None) – List of anomaly types to generate. Valid types are: ‘point_outlier’, ‘noisy_burst’, ‘impulse_decay’, ‘linear_decay’, ‘transient_change’ If None (default), all types will be generated

  • disable_holiday_splash (bool) – If True, holidays will only affect a single day with no splash or bridge effects (default False)

Examples

Basic usage:

>>> from autots.datasets import generate_synthetic_daily_data
>>> gen = generate_synthetic_daily_data(n_days=365, n_series=5)
>>> data = gen.get_data()
>>> labels = gen.get_all_labels()

Tuning to real-world data:

>>> import pandas as pd
>>> real_data = pd.read_csv('real_data.csv', index_col=0, parse_dates=True)
>>> gen = generate_synthetic_daily_data(
...     start_date=real_data.index[0],
...     n_days=len(real_data),
...     n_series=len(real_data.columns),
... )
>>> results = gen.tune_to_data(real_data, n_iterations=20, verbose=True)
>>> gen._generate()  # Regenerate with tuned parameters
>>> tuned_data = gen.get_data()
SERIES_TYPE_DESCRIPTIONS = {'autocorrelated_noise': 'Autocorrelated Noise (AR)', 'business_day': 'Business Day (weekend NaN)', 'granger_lagged': 'Granger Lagged (7-day lag from Lunar Holidays)', 'lunar_holidays': 'Lunar Holidays', 'multiplicative_seasonality': 'Multiplicative Seasonality (AR noise)', 'no_level_shifts': 'No Level Shifts', 'ramadan_holidays': 'Ramadan Holidays', 'saturating_trend': 'Saturating Trend (logistic)', 'seasonality_changepoints': 'Seasonality Changepoints', 'standard': 'Standard', 'time_varying_seasonality': 'Time-Varying Seasonality', 'variance_regimes': 'Variance Regimes (GARCH)'}
TEMPLATE_VERSION = '1.0'
get_all_labels(series_name=None)

Get all labels in a structured format for easy model evaluation.

Parameters:

series_name (str, optional) – If provided, return labels for specific series only.

Returns:

Comprehensive dictionary of all labels and metadata.

Return type:

dict

get_anomalies(series_name=None)

Get anomaly labels: {series_name: [(date, magnitude, type, duration, shared), …]}

get_components(series_name=None)

Get individual components for analysis.

Parameters:

series_name (str, optional) – If provided, return components for specific series. If None, return all components.

Returns:

Dictionary of {series_name: {component_name: array}}

Return type:

dict

get_data()

Get the generated time series data.

get_holiday_config()

Get holiday splash/bridge configuration: {holiday_name: {‘has_splash’: bool, ‘has_bridge’: bool}}

get_holiday_impacts(series_name=None)

Get holiday impact labels (main holiday dates only): {series_name: {date: impact}}

get_holiday_splash_impacts(series_name=None)

Get holiday splash/bridge day impacts: {series_name: {date: impact}}

get_lagged_influences(series_name=None)

Get lagged influence information for Granger-style causal relationships.

Parameters:

series_name (str, optional) – If provided, return lagged influence info for specific series. If None, return all lagged influences.

Returns:

Dictionary of {series_name: {‘source’: source_series, ‘lag’: lag_days, ‘coefficient’: coef}} or single dict if series_name is specified

Return type:

dict

get_level_shifts(series_name=None)

Get level shift labels: {series_name: [(date, magnitude, type, shared), …]}

get_noise_changepoints(series_name=None)

Get noise distribution changepoints: {series_name: [(date, old_params, new_params), …]}

get_noise_to_signal_ratios()

Get noise-to-signal ratios for all series.

get_regressor_impacts(series_name=None)

Get regressor impacts: {series_name: {‘by_date’: {date: {regressor: impact}}, ‘coefficients’: {…}}}

get_regressors()

Get the generated regressors (if any).

get_seasonality_changepoints(series_name=None)

Get seasonality changepoints: {series_name: [(date, description), …]}

get_series_noise_levels()

Get per-series noise levels.

get_series_scales()

Get scale factors for all series.

get_series_seasonality_strengths()

Get per-series seasonality strengths.

get_series_type_description(series_name)

Get human-readable description for a series type.

Parameters:

series_name (str) – Name of the series

Returns:

Human-readable description of the series type

Return type:

str

get_template(series_name=None, deep=True)

Get the JSON-friendly template describing the generated data.

get_trend_changepoints(series_name=None)

Get trend changepoint labels: {series_name: [(date, old_slope, new_slope), …]}

machine_summary(series_name=None, include_events=True, include_regressors=True, max_events_per_type=25, round_decimals=6, as_json=False)

Return a structured summary tailored for LLM or tool consumption.

plot(series_name=None, figsize=(16, 12), save_path=None, show=True)

Plot a series with all its labeled components clearly marked.

Parameters:
  • series_name (str, optional) – Name of series to plot. If None, randomly selects one.

  • figsize (tuple, optional) – Figure size (width, height) in inches. Default (16, 12).

  • save_path (str, optional) – If provided, saves the plot to this path instead of displaying.

  • show (bool, optional) – Whether to display the plot. Default True.

Returns:

fig – The generated figure object

Return type:

matplotlib.figure.Figure

Raises:

ImportError – If matplotlib is not installed

classmethod render_template(template, return_components=False)

Render a template into time series using the generator’s renderer.

summary()

Print a summary of the generated data.

to_csv(filepath, include_regressors=False)

Save generated data to CSV.

Parameters:
  • filepath (str) – Path to save the CSV file

  • include_regressors (bool) – Whether to include regressors in the output

tune_to_data(target_df, n_iterations=20, n_standard_series=None, metric='mse', verbose=True, random_seed=None)

Tune generator parameters to match real-world time series data.

This method optimizes the generator’s parameters to minimize the difference between synthetic data and real-world data based on distributional statistics. Special series types are not tuned but will still be generated with optimized base parameters.

TODO: this is a fairly basic implementation, and won’t tune many aspects of real world data

Parameters:
  • target_df (pd.DataFrame) – Real-world time series data to match (DatetimeIndex, numeric columns)

  • n_iterations (int, optional) – Number of optimization iterations (default 20)

  • n_standard_series (int, optional) – Number of standard series to generate for comparison during tuning. If None, uses min(target_df.shape[1], 5) series.

  • metric (str, optional) – Distance metric to minimize: ‘mse’, ‘mae’, ‘wasserstein’ (default ‘mse’)

  • verbose (bool, optional) – Whether to print progress (default True)

  • random_seed (int, optional) – Random seed for tuning process (default None, uses current random_seed)

Returns:

Dictionary containing: - ‘best_params’: Optimized parameter dictionary - ‘best_score’: Best score achieved - ‘target_stats’: Statistics from target data - ‘synthetic_stats’: Statistics from best synthetic data (scaled) - ‘scale_multiplier’: Factor to multiply synthetic data by to match target magnitude

Return type:

dict

Notes

Updates self with best parameters found. After calling this method, new data generation will use the tuned parameters.

Important: The synthetic data is generated on a base scale (~50), which may differ from your real-world data scale. The returned ‘scale_multiplier’ should be applied to generated data to match the magnitude of the target data:

>>> gen._generate()  # Regenerate with tuned parameters
>>> scaled_data = gen.data * gen.tuning_results['scale_multiplier']

The scale multiplier matches the mean of absolute means between target and synthetic data, ensuring the overall magnitude is similar.

Raises:
  • ImportError – If scipy is not installed (required for optimization)

  • ValueError – If target_df is invalid

autots.datasets.generate_synthetic_daily_data(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, **kwargs)

Quick function to generate synthetic daily data.

Parameters:
  • start_date (str) – Start date for the time series

  • n_days (int) – Number of days to generate

  • n_series (int) – Number of series to generate

  • random_seed (int) – Random seed for reproducibility

  • **kwargs – Additional parameters passed to SyntheticDailyGenerator

Returns:

generator – Generator object with data and labels

Return type:

SyntheticDailyGenerator

autots.datasets.load_artificial(long=False, date_start=None, date_end=None)

Load artifically generated series from random distributions.

Parameters:
  • long (bool) – if True long style data, if False, wide style data

  • date_start – str or datetime.datetime of start date

  • date_end – str or datetime.datetime of end date

autots.datasets.load_daily(long: bool = True)

Daily sample data.

``` # most of the wiki data was chosen to show holidays or holiday-like patterns wiki = [

‘United_States’, ‘Germany’, ‘List_of_highest-grossing_films’, ‘Jesus’, ‘Michael_Jackson’, ‘List_of_United_States_cities_by_population’, ‘Microsoft_Office’, ‘Google_Chrome’, ‘Periodic_table’, ‘Standard_deviation’, ‘Easter’, ‘Christmas’, ‘Chinese_New_Year’, ‘Thanksgiving’, ‘List_of_countries_that_have_gained_independence_from_the_United_Kingdom’, ‘History_of_the_hamburger’, ‘Elizabeth_II’, ‘William_Shakespeare’, ‘George_Washington’, ‘Cleopatra’, ‘all’

]

df2 = load_live_daily(

observation_start=”2017-01-01”, weather_years=7, trends_list=None, gov_domain_list=None, wikipedia_pages=wiki, fred_series=[‘DGS10’, ‘T5YIE’, ‘SP500’,’DEXUSEU’], sleep_seconds=10, fred_key = “93873d40f10c20fe6f6e75b1ad0aed4d”, weather_data_types = [“WSF2”, “PRCP”], weather_stations = [“USW00014771”], # looking for intermittent tickers=None, london_air_stations=None, weather_event_types=None, earthquake_min_magnitude=None,

) data_file_name = join(“autots”, “datasets”, ‘data’, ‘holidays.zip’) df2.to_csv(

data_file_name, index=True, compression={

‘method’: ‘zip’, ‘archive_name’: ‘holidays.csv’, ‘compresslevel’: 9 # Maximum compression level (0-9)

}

)

Sources: Wikimedia Foundation

param long:

if True, return data in long format. Otherwise return wide

type long:

bool

autots.datasets.load_hourly(long: bool = True)

Traffic data from the MN DOT via the UCI data repository.

autots.datasets.load_linear(long=False, shape=None, start_date: str = '2021-01-01', introduce_nan: float | None = None, introduce_random: float | None = None, random_seed: int = 123)

Create a dataset of just zeroes for testing edge case.

Parameters:
  • long (bool) – whether to make long or wide

  • shape (tuple) – shape of output dataframe

  • start_date (str) – first date of index

  • introduce_nan (float) – percent of rows to make null. 0.2 = 20%

  • introduce_random (float) – shape of gamma distribution

  • random_seed (int) – seed for random

autots.datasets.load_live_daily(long: bool = False, observation_start: str | None = None, observation_end: str | None = None, fred_key: str | None = None, fred_series=['DGS10', 'T5YIE', 'SP500', 'DCOILWTICO', 'DEXUSEU', 'WPU0911'], tickers: list = ['MSFT'], trends_list: list = ['forecasting', 'cycling', 'microsoft'], trends_geo: str = 'US', weather_data_types: list = ['AWND', 'WSF2', 'TAVG', 'PRCP'], weather_stations: list = ['USW00094846', 'USW00014925', 'USW00014771'], weather_years: int = 5, noaa_cdo_token: str | None = None, london_air_stations: list = ['CT3', 'SK8'], london_air_species: str = 'PM25', london_air_days: int = 180, earthquake_days: int = 180, earthquake_min_magnitude: int = 5, gsa_key: str | None = None, nasa_api_key: str = 'DEMO_KEY', gov_domain_list=['nasa.gov'], gov_domain_limit: int = 600, wikipedia_pages: list = ['Microsoft_Office', 'List_of_highest-grossing_films'], wiki_language: str = 'en', weather_event_types=['%28Z%29+Winter+Weather', '%28Z%29+Winter+Storm'], caiso_query: str | None = None, eia_key: str | None = None, eia_respondents: list = ['MISO', 'PJM', 'TVA', 'US48'], timeout: float = 300.05, sleep_seconds: int = 10, **kwargs)

Generates a dataframe of data up to the present day. Requires active internet connection. Try to be respectful of these free data sources by not calling too much too heavily. Pass None instead of specification lists to exclude a data source.

Parameters:
  • long (bool) – whether to return in long format or wide

  • observation_start (str) – %Y-%m-%d earliest day to retrieve, passed to Fred.get_series and yfinance.history note that apis with more restrictions have other default lengths below which ignore this

  • observation_end (str) – %Y-%m-%d most recent day to retrieve

  • fred_key (str) – https://fred.stlouisfed.org/docs/api/api_key.html

  • fred_series (list) – list of FRED series IDs. This requires fredapi package

  • tickers (list) – list of stock tickers, requires yfinance pypi package

  • trends_list (list) – list of search keywords, requires pytrends pypi package. None to skip.

  • weather_data_types (list) – from NCEI NOAA api data types, GHCN Daily Weather Elements PRCP, SNOW, TMAX, TMIN, TAVG, AWND, WSF1, WSF2, WSF5, WSFG

  • weather_stations (list) – from NCEI NOAA api station ids. Pass empty list to skip.

  • noaa_cdo_token (str) – API token from https://www.ncdc.noaa.gov/cdo-web/token (free, required for weather data)

  • london_air_stations (list) – londonair.org.uk source station IDs. Pass empty list to skip.

  • london_species (str) – what measurement to pull from London Air. Not all stations have all metrics.

  • earthquake_min_magnitude (int) – smallest earthquake magnitude to pull from earthquake.usgs.gov. Set None to skip this.

  • gsa_key (str) – api key from https://open.gsa.gov/api/dap/

  • nasa_api_key (str) – API key for https://api.nasa.gov/. Set to None to skip NASA DONKI data.

  • gov_domain_list (list) – dist of government run domains to get traffic data for. Can be very slow, so fewer is better. some examples: [‘usps.com’, ‘ncbi.nlm.nih.gov’, ‘cdc.gov’, ‘weather.gov’, ‘irs.gov’, “usajobs.gov”, “studentaid.gov”, ‘nasa.gov’, “uk.usembassy.gov”, “tsunami.gov”]

  • gov_domain_limit (int) – max number of records. Smaller will be faster. Max is currently 10000.

  • wikipedia_pages (list) – list of Wikipedia pages, html encoded if needed (underscore for space)

  • weather_event_types (list) – list of html encoded severe weather event types https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/Storm-Data-Export-Format.pdf

  • caiso_query (str) – ENE_SLRS or None, can try others but probably won’t work due to other hardcoded params

  • timeout (float) – used by some queries

  • sleep_seconds (int) – increasing this may reduce probability of server download failures

autots.datasets.load_monthly(long: bool = True)

Federal Reserve of St. Louis monthly economic indicators.

autots.datasets.load_sine(long=False, shape=None, start_date: str = '2021-01-01', introduce_random: float | None = None, random_seed: int = 123)

Create a dataset of just zeroes for testing edge case.

autots.datasets.load_weekdays(long: bool = False, categorical: bool = True, periods: int = 180)

Test edge cases by creating a Series with values as day of week.

Parameters:
  • long (bool) – if True, return a df with columns “value” and “datetime” if False, return a Series with dt index

  • categorical (bool) – if True, return str/object, else return int

  • periods (int) – number of periods, ie length of data to generate

autots.datasets.load_weekly(long: bool = True)

Weekly petroleum industry data from the EIA.

autots.datasets.load_yearly(long: bool = True)

Federal Reserve of St. Louis annual economic indicators.

autots.datasets.load_zeroes(long=False, shape=None, start_date: str = '2021-01-01')

Create a dataset of just zeroes for testing edge case.