autots.datasets package¶
Submodules¶
autots.datasets.fred module¶
FRED (Federal Reserve Economic Data) Data Import
requires API key from FRED and pip install fredapi
- autots.datasets.fred.get_fred_data(fredkey: str, SeriesNameDict: dict | None = None, long=True, observation_start=None, sleep_seconds: int = 1, **kwargs)¶
Imports Data from Federal Reserve. For simplest results, make sure requested series are all of the same frequency.
- Parameters:
fredkey (str) – an API key from FRED
SeriesNameDict (dict) – pairs of FRED Series IDs and Series Names like: {‘SeriesID’: ‘SeriesName’} or a list of FRED IDs. Series id must match Fred IDs, but name can be anything if None, several default series are returned
long (bool) – if True, return long style data, else return wide style data with dt index
observation_start (datetime) – passed to Fred get_series
sleep_seconds (int) – seconds to sleep between each series call, reduces failure chance usually
autots.datasets.synthetic module¶
Synthetic Daily Data Generator with Labeled Changepoints, Anomalies, and Holidays
@author: winedarksea with Claude Sonnet v4.5
Matching test file in tests/test_synthetic_data.py
- class autots.datasets.synthetic.SyntheticDailyGenerator(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, trend_changepoint_freq=0.5, level_shift_freq=0.1, level_shift_strength=0.4, anomaly_freq=0.05, shared_anomaly_prob=0.2, shared_level_shift_prob=0.2, weekly_seasonality_strength=1.0, yearly_seasonality_strength=1.0, noise_level=0.1, include_regressors=False, anomaly_types=None, disable_holiday_splash=False)¶
Bases:
objectGenerate realistic synthetic daily time series data with labeled components.
Creates multivariate time series with: - Piecewise linear trends with changepoints - Level shifts (instantaneous and ramped) - Seasonality (weekly, yearly) with stochastic variation - Holiday effects (common and custom) with splash and bridge effects - Anomalies with various post-event patterns - Noise with regime changes - Optional regressor impacts - Business day series with weekend NaN - Multiple scales across series
All components are labeled and stored for model evaluation.
Variability Across Series: - Noise levels vary 0.5x-2.0x the base noise_level per series - Weekly seasonality strength varies 0.3x-2.5x per series - Yearly seasonality strength varies 0.2x-2.0x per series - Level shift frequency varies across series (some have none, some have several) - This creates a range from subtle, hard-to-detect patterns to very obvious ones
Event Scaling with Dataset Length: - Events (anomalies, level shifts, etc.) scale appropriately with n_days - Short datasets (< 1 year) use probabilistic event generation - Longer datasets use Poisson-based event counts - Level shifts are rare events, appropriately distributed
Template Compatibility: - Template structure is compatible with TimeSeriesFeatureDetector - Both use same JSON-friendly format for components and labels - Templates can be saved/loaded and used for model evaluation
Parameter Tuning: - Use tune_to_data() to optimize parameters to match real-world data - Tuning adjusts frequency and strength parameters based on statistical properties - See TUNING_GUIDE.md for detailed usage examples
- Parameters:
start_date (str or pd.Timestamp) – Start date for the time series
n_days (int) – Number of days to generate
n_series (int) – Number of time series to generate
random_seed (int) – Random seed for reproducibility
trend_changepoint_freq (float) – Probability per year of a trend changepoint (default 0.5)
level_shift_freq (float) – Probability per year of a level shift (default 0.1)
level_shift_strength (float) – Controls the magnitude of level shifts as a percentage of the series baseline. Shifts will be sampled from 10% to this value (skewed toward 10%), but always at least 5x the noise standard deviation for detectability (default 0.4 = 40%)
anomaly_freq (float) – Probability per week of an anomaly (default 0.05)
weekly_seasonality_strength (float) – Base strength of weekly seasonality (default 1.0) Actual per-series strength will vary 0.3x-2.5x this value
yearly_seasonality_strength (float) – Base strength of yearly seasonality (default 1.0) Actual per-series strength will vary 0.2x-2.0x this value
noise_level (float) – Base noise level (default 0.1, relative to signal) Actual per-series level will vary 0.5x-2.0x this value
include_regressors (bool) – Whether to include regressor effects (default False)
anomaly_types (list of str or None) – List of anomaly types to generate. Valid types are: ‘point_outlier’, ‘noisy_burst’, ‘impulse_decay’, ‘linear_decay’, ‘transient_change’ If None (default), all types will be generated
disable_holiday_splash (bool) – If True, holidays will only affect a single day with no splash or bridge effects (default False)
Examples
Basic usage:
>>> from autots.datasets import generate_synthetic_daily_data >>> gen = generate_synthetic_daily_data(n_days=365, n_series=5) >>> data = gen.get_data() >>> labels = gen.get_all_labels()
Tuning to real-world data:
>>> import pandas as pd >>> real_data = pd.read_csv('real_data.csv', index_col=0, parse_dates=True) >>> gen = generate_synthetic_daily_data( ... start_date=real_data.index[0], ... n_days=len(real_data), ... n_series=len(real_data.columns), ... ) >>> results = gen.tune_to_data(real_data, n_iterations=20, verbose=True) >>> gen._generate() # Regenerate with tuned parameters >>> tuned_data = gen.get_data()
- SERIES_TYPE_DESCRIPTIONS = {'autocorrelated_noise': 'Autocorrelated Noise (AR)', 'business_day': 'Business Day (weekend NaN)', 'granger_lagged': 'Granger Lagged (7-day lag from Lunar Holidays)', 'lunar_holidays': 'Lunar Holidays', 'multiplicative_seasonality': 'Multiplicative Seasonality (AR noise)', 'no_level_shifts': 'No Level Shifts', 'ramadan_holidays': 'Ramadan Holidays', 'saturating_trend': 'Saturating Trend (logistic)', 'seasonality_changepoints': 'Seasonality Changepoints', 'standard': 'Standard', 'time_varying_seasonality': 'Time-Varying Seasonality', 'variance_regimes': 'Variance Regimes (GARCH)'}¶
- TEMPLATE_VERSION = '1.0'¶
- get_all_labels(series_name=None)¶
Get all labels in a structured format for easy model evaluation.
- Parameters:
series_name (str, optional) – If provided, return labels for specific series only.
- Returns:
Comprehensive dictionary of all labels and metadata.
- Return type:
dict
- get_anomalies(series_name=None)¶
Get anomaly labels: {series_name: [(date, magnitude, type, duration, shared), …]}
- get_components(series_name=None)¶
Get individual components for analysis.
- Parameters:
series_name (str, optional) – If provided, return components for specific series. If None, return all components.
- Returns:
Dictionary of {series_name: {component_name: array}}
- Return type:
dict
- get_data()¶
Get the generated time series data.
- get_holiday_config()¶
Get holiday splash/bridge configuration: {holiday_name: {‘has_splash’: bool, ‘has_bridge’: bool}}
- get_holiday_impacts(series_name=None)¶
Get holiday impact labels (main holiday dates only): {series_name: {date: impact}}
- get_holiday_splash_impacts(series_name=None)¶
Get holiday splash/bridge day impacts: {series_name: {date: impact}}
- get_lagged_influences(series_name=None)¶
Get lagged influence information for Granger-style causal relationships.
- Parameters:
series_name (str, optional) – If provided, return lagged influence info for specific series. If None, return all lagged influences.
- Returns:
Dictionary of {series_name: {‘source’: source_series, ‘lag’: lag_days, ‘coefficient’: coef}} or single dict if series_name is specified
- Return type:
dict
- get_level_shifts(series_name=None)¶
Get level shift labels: {series_name: [(date, magnitude, type, shared), …]}
- get_noise_changepoints(series_name=None)¶
Get noise distribution changepoints: {series_name: [(date, old_params, new_params), …]}
- get_noise_to_signal_ratios()¶
Get noise-to-signal ratios for all series.
- get_regressor_impacts(series_name=None)¶
Get regressor impacts: {series_name: {‘by_date’: {date: {regressor: impact}}, ‘coefficients’: {…}}}
- get_regressors()¶
Get the generated regressors (if any).
- get_seasonality_changepoints(series_name=None)¶
Get seasonality changepoints: {series_name: [(date, description), …]}
- get_series_noise_levels()¶
Get per-series noise levels.
- get_series_scales()¶
Get scale factors for all series.
- get_series_seasonality_strengths()¶
Get per-series seasonality strengths.
- get_series_type_description(series_name)¶
Get human-readable description for a series type.
- Parameters:
series_name (str) – Name of the series
- Returns:
Human-readable description of the series type
- Return type:
str
- get_template(series_name=None, deep=True)¶
Get the JSON-friendly template describing the generated data.
- get_trend_changepoints(series_name=None)¶
Get trend changepoint labels: {series_name: [(date, old_slope, new_slope), …]}
- machine_summary(series_name=None, include_events=True, include_regressors=True, max_events_per_type=25, round_decimals=6, as_json=False)¶
Return a structured summary tailored for LLM or tool consumption.
- plot(series_name=None, figsize=(16, 12), save_path=None, show=True)¶
Plot a series with all its labeled components clearly marked.
- Parameters:
series_name (str, optional) – Name of series to plot. If None, randomly selects one.
figsize (tuple, optional) – Figure size (width, height) in inches. Default (16, 12).
save_path (str, optional) – If provided, saves the plot to this path instead of displaying.
show (bool, optional) – Whether to display the plot. Default True.
- Returns:
fig – The generated figure object
- Return type:
matplotlib.figure.Figure
- Raises:
ImportError – If matplotlib is not installed
- classmethod render_template(template, return_components=False)¶
Render a template into time series using the generator’s renderer.
- summary()¶
Print a summary of the generated data.
- to_csv(filepath, include_regressors=False)¶
Save generated data to CSV.
- Parameters:
filepath (str) – Path to save the CSV file
include_regressors (bool) – Whether to include regressors in the output
- tune_to_data(target_df, n_iterations=20, n_standard_series=None, metric='mse', verbose=True, random_seed=None)¶
Tune generator parameters to match real-world time series data.
This method optimizes the generator’s parameters to minimize the difference between synthetic data and real-world data based on distributional statistics. Special series types are not tuned but will still be generated with optimized base parameters.
TODO: this is a fairly basic implementation, and won’t tune many aspects of real world data
- Parameters:
target_df (pd.DataFrame) – Real-world time series data to match (DatetimeIndex, numeric columns)
n_iterations (int, optional) – Number of optimization iterations (default 20)
n_standard_series (int, optional) – Number of standard series to generate for comparison during tuning. If None, uses min(target_df.shape[1], 5) series.
metric (str, optional) – Distance metric to minimize: ‘mse’, ‘mae’, ‘wasserstein’ (default ‘mse’)
verbose (bool, optional) – Whether to print progress (default True)
random_seed (int, optional) – Random seed for tuning process (default None, uses current random_seed)
- Returns:
Dictionary containing: - ‘best_params’: Optimized parameter dictionary - ‘best_score’: Best score achieved - ‘target_stats’: Statistics from target data - ‘synthetic_stats’: Statistics from best synthetic data (scaled) - ‘scale_multiplier’: Factor to multiply synthetic data by to match target magnitude
- Return type:
dict
Notes
Updates self with best parameters found. After calling this method, new data generation will use the tuned parameters.
Important: The synthetic data is generated on a base scale (~50), which may differ from your real-world data scale. The returned ‘scale_multiplier’ should be applied to generated data to match the magnitude of the target data:
>>> gen._generate() # Regenerate with tuned parameters >>> scaled_data = gen.data * gen.tuning_results['scale_multiplier']
The scale multiplier matches the mean of absolute means between target and synthetic data, ensuring the overall magnitude is similar.
- Raises:
ImportError – If scipy is not installed (required for optimization)
ValueError – If target_df is invalid
- autots.datasets.synthetic.augment_with_synthetic_bounds(X: DataFrame, Y, ratio: float, random_seed: int = 0, max_fraction: float = 0.25)¶
Prepend synthetic boundary samples to X/Y to anchor scaling.
- Parameters:
X (pandas.DataFrame) – Feature matrix with datetime-based index.
Y (array-like) – Response array aligned with X (1d or 2d).
ratio (float) – Share of synthetic samples relative to len(X). Capped by
max_fraction.random_seed (int, default 0) – Base seed to make augmentation repeatable.
max_fraction (float, default 0.25) – Maximum share of synthetic samples permitted.
- autots.datasets.synthetic.generate_synthetic_daily_data(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, **kwargs)¶
Quick function to generate synthetic daily data.
- Parameters:
start_date (str) – Start date for the time series
n_days (int) – Number of days to generate
n_series (int) – Number of series to generate
random_seed (int) – Random seed for reproducibility
**kwargs – Additional parameters passed to SyntheticDailyGenerator
- Returns:
generator – Generator object with data and labels
- Return type:
Module contents¶
Tools for Importing Sample Data
- class autots.datasets.SyntheticDailyGenerator(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, trend_changepoint_freq=0.5, level_shift_freq=0.1, level_shift_strength=0.4, anomaly_freq=0.05, shared_anomaly_prob=0.2, shared_level_shift_prob=0.2, weekly_seasonality_strength=1.0, yearly_seasonality_strength=1.0, noise_level=0.1, include_regressors=False, anomaly_types=None, disable_holiday_splash=False)¶
Bases:
objectGenerate realistic synthetic daily time series data with labeled components.
Creates multivariate time series with: - Piecewise linear trends with changepoints - Level shifts (instantaneous and ramped) - Seasonality (weekly, yearly) with stochastic variation - Holiday effects (common and custom) with splash and bridge effects - Anomalies with various post-event patterns - Noise with regime changes - Optional regressor impacts - Business day series with weekend NaN - Multiple scales across series
All components are labeled and stored for model evaluation.
Variability Across Series: - Noise levels vary 0.5x-2.0x the base noise_level per series - Weekly seasonality strength varies 0.3x-2.5x per series - Yearly seasonality strength varies 0.2x-2.0x per series - Level shift frequency varies across series (some have none, some have several) - This creates a range from subtle, hard-to-detect patterns to very obvious ones
Event Scaling with Dataset Length: - Events (anomalies, level shifts, etc.) scale appropriately with n_days - Short datasets (< 1 year) use probabilistic event generation - Longer datasets use Poisson-based event counts - Level shifts are rare events, appropriately distributed
Template Compatibility: - Template structure is compatible with TimeSeriesFeatureDetector - Both use same JSON-friendly format for components and labels - Templates can be saved/loaded and used for model evaluation
Parameter Tuning: - Use tune_to_data() to optimize parameters to match real-world data - Tuning adjusts frequency and strength parameters based on statistical properties - See TUNING_GUIDE.md for detailed usage examples
- Parameters:
start_date (str or pd.Timestamp) – Start date for the time series
n_days (int) – Number of days to generate
n_series (int) – Number of time series to generate
random_seed (int) – Random seed for reproducibility
trend_changepoint_freq (float) – Probability per year of a trend changepoint (default 0.5)
level_shift_freq (float) – Probability per year of a level shift (default 0.1)
level_shift_strength (float) – Controls the magnitude of level shifts as a percentage of the series baseline. Shifts will be sampled from 10% to this value (skewed toward 10%), but always at least 5x the noise standard deviation for detectability (default 0.4 = 40%)
anomaly_freq (float) – Probability per week of an anomaly (default 0.05)
weekly_seasonality_strength (float) – Base strength of weekly seasonality (default 1.0) Actual per-series strength will vary 0.3x-2.5x this value
yearly_seasonality_strength (float) – Base strength of yearly seasonality (default 1.0) Actual per-series strength will vary 0.2x-2.0x this value
noise_level (float) – Base noise level (default 0.1, relative to signal) Actual per-series level will vary 0.5x-2.0x this value
include_regressors (bool) – Whether to include regressor effects (default False)
anomaly_types (list of str or None) – List of anomaly types to generate. Valid types are: ‘point_outlier’, ‘noisy_burst’, ‘impulse_decay’, ‘linear_decay’, ‘transient_change’ If None (default), all types will be generated
disable_holiday_splash (bool) – If True, holidays will only affect a single day with no splash or bridge effects (default False)
Examples
Basic usage:
>>> from autots.datasets import generate_synthetic_daily_data >>> gen = generate_synthetic_daily_data(n_days=365, n_series=5) >>> data = gen.get_data() >>> labels = gen.get_all_labels()
Tuning to real-world data:
>>> import pandas as pd >>> real_data = pd.read_csv('real_data.csv', index_col=0, parse_dates=True) >>> gen = generate_synthetic_daily_data( ... start_date=real_data.index[0], ... n_days=len(real_data), ... n_series=len(real_data.columns), ... ) >>> results = gen.tune_to_data(real_data, n_iterations=20, verbose=True) >>> gen._generate() # Regenerate with tuned parameters >>> tuned_data = gen.get_data()
- SERIES_TYPE_DESCRIPTIONS = {'autocorrelated_noise': 'Autocorrelated Noise (AR)', 'business_day': 'Business Day (weekend NaN)', 'granger_lagged': 'Granger Lagged (7-day lag from Lunar Holidays)', 'lunar_holidays': 'Lunar Holidays', 'multiplicative_seasonality': 'Multiplicative Seasonality (AR noise)', 'no_level_shifts': 'No Level Shifts', 'ramadan_holidays': 'Ramadan Holidays', 'saturating_trend': 'Saturating Trend (logistic)', 'seasonality_changepoints': 'Seasonality Changepoints', 'standard': 'Standard', 'time_varying_seasonality': 'Time-Varying Seasonality', 'variance_regimes': 'Variance Regimes (GARCH)'}¶
- TEMPLATE_VERSION = '1.0'¶
- get_all_labels(series_name=None)¶
Get all labels in a structured format for easy model evaluation.
- Parameters:
series_name (str, optional) – If provided, return labels for specific series only.
- Returns:
Comprehensive dictionary of all labels and metadata.
- Return type:
dict
- get_anomalies(series_name=None)¶
Get anomaly labels: {series_name: [(date, magnitude, type, duration, shared), …]}
- get_components(series_name=None)¶
Get individual components for analysis.
- Parameters:
series_name (str, optional) – If provided, return components for specific series. If None, return all components.
- Returns:
Dictionary of {series_name: {component_name: array}}
- Return type:
dict
- get_data()¶
Get the generated time series data.
- get_holiday_config()¶
Get holiday splash/bridge configuration: {holiday_name: {‘has_splash’: bool, ‘has_bridge’: bool}}
- get_holiday_impacts(series_name=None)¶
Get holiday impact labels (main holiday dates only): {series_name: {date: impact}}
- get_holiday_splash_impacts(series_name=None)¶
Get holiday splash/bridge day impacts: {series_name: {date: impact}}
- get_lagged_influences(series_name=None)¶
Get lagged influence information for Granger-style causal relationships.
- Parameters:
series_name (str, optional) – If provided, return lagged influence info for specific series. If None, return all lagged influences.
- Returns:
Dictionary of {series_name: {‘source’: source_series, ‘lag’: lag_days, ‘coefficient’: coef}} or single dict if series_name is specified
- Return type:
dict
- get_level_shifts(series_name=None)¶
Get level shift labels: {series_name: [(date, magnitude, type, shared), …]}
- get_noise_changepoints(series_name=None)¶
Get noise distribution changepoints: {series_name: [(date, old_params, new_params), …]}
- get_noise_to_signal_ratios()¶
Get noise-to-signal ratios for all series.
- get_regressor_impacts(series_name=None)¶
Get regressor impacts: {series_name: {‘by_date’: {date: {regressor: impact}}, ‘coefficients’: {…}}}
- get_regressors()¶
Get the generated regressors (if any).
- get_seasonality_changepoints(series_name=None)¶
Get seasonality changepoints: {series_name: [(date, description), …]}
- get_series_noise_levels()¶
Get per-series noise levels.
- get_series_scales()¶
Get scale factors for all series.
- get_series_seasonality_strengths()¶
Get per-series seasonality strengths.
- get_series_type_description(series_name)¶
Get human-readable description for a series type.
- Parameters:
series_name (str) – Name of the series
- Returns:
Human-readable description of the series type
- Return type:
str
- get_template(series_name=None, deep=True)¶
Get the JSON-friendly template describing the generated data.
- get_trend_changepoints(series_name=None)¶
Get trend changepoint labels: {series_name: [(date, old_slope, new_slope), …]}
- machine_summary(series_name=None, include_events=True, include_regressors=True, max_events_per_type=25, round_decimals=6, as_json=False)¶
Return a structured summary tailored for LLM or tool consumption.
- plot(series_name=None, figsize=(16, 12), save_path=None, show=True)¶
Plot a series with all its labeled components clearly marked.
- Parameters:
series_name (str, optional) – Name of series to plot. If None, randomly selects one.
figsize (tuple, optional) – Figure size (width, height) in inches. Default (16, 12).
save_path (str, optional) – If provided, saves the plot to this path instead of displaying.
show (bool, optional) – Whether to display the plot. Default True.
- Returns:
fig – The generated figure object
- Return type:
matplotlib.figure.Figure
- Raises:
ImportError – If matplotlib is not installed
- classmethod render_template(template, return_components=False)¶
Render a template into time series using the generator’s renderer.
- summary()¶
Print a summary of the generated data.
- to_csv(filepath, include_regressors=False)¶
Save generated data to CSV.
- Parameters:
filepath (str) – Path to save the CSV file
include_regressors (bool) – Whether to include regressors in the output
- tune_to_data(target_df, n_iterations=20, n_standard_series=None, metric='mse', verbose=True, random_seed=None)¶
Tune generator parameters to match real-world time series data.
This method optimizes the generator’s parameters to minimize the difference between synthetic data and real-world data based on distributional statistics. Special series types are not tuned but will still be generated with optimized base parameters.
TODO: this is a fairly basic implementation, and won’t tune many aspects of real world data
- Parameters:
target_df (pd.DataFrame) – Real-world time series data to match (DatetimeIndex, numeric columns)
n_iterations (int, optional) – Number of optimization iterations (default 20)
n_standard_series (int, optional) – Number of standard series to generate for comparison during tuning. If None, uses min(target_df.shape[1], 5) series.
metric (str, optional) – Distance metric to minimize: ‘mse’, ‘mae’, ‘wasserstein’ (default ‘mse’)
verbose (bool, optional) – Whether to print progress (default True)
random_seed (int, optional) – Random seed for tuning process (default None, uses current random_seed)
- Returns:
Dictionary containing: - ‘best_params’: Optimized parameter dictionary - ‘best_score’: Best score achieved - ‘target_stats’: Statistics from target data - ‘synthetic_stats’: Statistics from best synthetic data (scaled) - ‘scale_multiplier’: Factor to multiply synthetic data by to match target magnitude
- Return type:
dict
Notes
Updates self with best parameters found. After calling this method, new data generation will use the tuned parameters.
Important: The synthetic data is generated on a base scale (~50), which may differ from your real-world data scale. The returned ‘scale_multiplier’ should be applied to generated data to match the magnitude of the target data:
>>> gen._generate() # Regenerate with tuned parameters >>> scaled_data = gen.data * gen.tuning_results['scale_multiplier']
The scale multiplier matches the mean of absolute means between target and synthetic data, ensuring the overall magnitude is similar.
- Raises:
ImportError – If scipy is not installed (required for optimization)
ValueError – If target_df is invalid
- autots.datasets.generate_synthetic_daily_data(start_date='2015-01-01', n_days=2555, n_series=10, random_seed=42, **kwargs)¶
Quick function to generate synthetic daily data.
- Parameters:
start_date (str) – Start date for the time series
n_days (int) – Number of days to generate
n_series (int) – Number of series to generate
random_seed (int) – Random seed for reproducibility
**kwargs – Additional parameters passed to SyntheticDailyGenerator
- Returns:
generator – Generator object with data and labels
- Return type:
- autots.datasets.load_artificial(long=False, date_start=None, date_end=None)¶
Load artifically generated series from random distributions.
- Parameters:
long (bool) – if True long style data, if False, wide style data
date_start – str or datetime.datetime of start date
date_end – str or datetime.datetime of end date
- autots.datasets.load_daily(long: bool = True)¶
Daily sample data.
``` # most of the wiki data was chosen to show holidays or holiday-like patterns wiki = [
‘United_States’, ‘Germany’, ‘List_of_highest-grossing_films’, ‘Jesus’, ‘Michael_Jackson’, ‘List_of_United_States_cities_by_population’, ‘Microsoft_Office’, ‘Google_Chrome’, ‘Periodic_table’, ‘Standard_deviation’, ‘Easter’, ‘Christmas’, ‘Chinese_New_Year’, ‘Thanksgiving’, ‘List_of_countries_that_have_gained_independence_from_the_United_Kingdom’, ‘History_of_the_hamburger’, ‘Elizabeth_II’, ‘William_Shakespeare’, ‘George_Washington’, ‘Cleopatra’, ‘all’
]
- df2 = load_live_daily(
observation_start=”2017-01-01”, weather_years=7, trends_list=None, gov_domain_list=None, wikipedia_pages=wiki, fred_series=[‘DGS10’, ‘T5YIE’, ‘SP500’,’DEXUSEU’], sleep_seconds=10, fred_key = “93873d40f10c20fe6f6e75b1ad0aed4d”, weather_data_types = [“WSF2”, “PRCP”], weather_stations = [“USW00014771”], # looking for intermittent tickers=None, london_air_stations=None, weather_event_types=None, earthquake_min_magnitude=None,
) data_file_name = join(“autots”, “datasets”, ‘data’, ‘holidays.zip’) df2.to_csv(
data_file_name, index=True, compression={
‘method’: ‘zip’, ‘archive_name’: ‘holidays.csv’, ‘compresslevel’: 9 # Maximum compression level (0-9)
}
)¶
Sources: Wikimedia Foundation
- param long:
if True, return data in long format. Otherwise return wide
- type long:
bool
- autots.datasets.load_hourly(long: bool = True)¶
Traffic data from the MN DOT via the UCI data repository.
- autots.datasets.load_linear(long=False, shape=None, start_date: str = '2021-01-01', introduce_nan: float | None = None, introduce_random: float | None = None, random_seed: int = 123)¶
Create a dataset of just zeroes for testing edge case.
- Parameters:
long (bool) – whether to make long or wide
shape (tuple) – shape of output dataframe
start_date (str) – first date of index
introduce_nan (float) – percent of rows to make null. 0.2 = 20%
introduce_random (float) – shape of gamma distribution
random_seed (int) – seed for random
- autots.datasets.load_live_daily(long: bool = False, observation_start: str | None = None, observation_end: str | None = None, fred_key: str | None = None, fred_series=['DGS10', 'T5YIE', 'SP500', 'DCOILWTICO', 'DEXUSEU', 'WPU0911'], tickers: list = ['MSFT'], trends_list: list = ['forecasting', 'cycling', 'microsoft'], trends_geo: str = 'US', weather_data_types: list = ['AWND', 'WSF2', 'TAVG', 'PRCP'], weather_stations: list = ['USW00094846', 'USW00014925', 'USW00014771'], weather_years: int = 5, noaa_cdo_token: str | None = None, london_air_stations: list = ['CT3', 'SK8'], london_air_species: str = 'PM25', london_air_days: int = 180, earthquake_days: int = 180, earthquake_min_magnitude: int = 5, gsa_key: str | None = None, nasa_api_key: str = 'DEMO_KEY', gov_domain_list=['nasa.gov'], gov_domain_limit: int = 600, wikipedia_pages: list = ['Microsoft_Office', 'List_of_highest-grossing_films'], wiki_language: str = 'en', weather_event_types=['%28Z%29+Winter+Weather', '%28Z%29+Winter+Storm'], caiso_query: str | None = None, eia_key: str | None = None, eia_respondents: list = ['MISO', 'PJM', 'TVA', 'US48'], timeout: float = 300.05, sleep_seconds: int = 10, **kwargs)¶
Generates a dataframe of data up to the present day. Requires active internet connection. Try to be respectful of these free data sources by not calling too much too heavily. Pass None instead of specification lists to exclude a data source.
- Parameters:
long (bool) – whether to return in long format or wide
observation_start (str) – %Y-%m-%d earliest day to retrieve, passed to Fred.get_series and yfinance.history note that apis with more restrictions have other default lengths below which ignore this
observation_end (str) – %Y-%m-%d most recent day to retrieve
fred_key (str) – https://fred.stlouisfed.org/docs/api/api_key.html
fred_series (list) – list of FRED series IDs. This requires fredapi package
tickers (list) – list of stock tickers, requires yfinance pypi package
trends_list (list) – list of search keywords, requires pytrends pypi package. None to skip.
weather_data_types (list) – from NCEI NOAA api data types, GHCN Daily Weather Elements PRCP, SNOW, TMAX, TMIN, TAVG, AWND, WSF1, WSF2, WSF5, WSFG
weather_stations (list) – from NCEI NOAA api station ids. Pass empty list to skip.
noaa_cdo_token (str) – API token from https://www.ncdc.noaa.gov/cdo-web/token (free, required for weather data)
london_air_stations (list) – londonair.org.uk source station IDs. Pass empty list to skip.
london_species (str) – what measurement to pull from London Air. Not all stations have all metrics.
earthquake_min_magnitude (int) – smallest earthquake magnitude to pull from earthquake.usgs.gov. Set None to skip this.
gsa_key (str) – api key from https://open.gsa.gov/api/dap/
nasa_api_key (str) – API key for https://api.nasa.gov/. Set to None to skip NASA DONKI data.
gov_domain_list (list) – dist of government run domains to get traffic data for. Can be very slow, so fewer is better. some examples: [‘usps.com’, ‘ncbi.nlm.nih.gov’, ‘cdc.gov’, ‘weather.gov’, ‘irs.gov’, “usajobs.gov”, “studentaid.gov”, ‘nasa.gov’, “uk.usembassy.gov”, “tsunami.gov”]
gov_domain_limit (int) – max number of records. Smaller will be faster. Max is currently 10000.
wikipedia_pages (list) – list of Wikipedia pages, html encoded if needed (underscore for space)
weather_event_types (list) – list of html encoded severe weather event types https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/Storm-Data-Export-Format.pdf
caiso_query (str) – ENE_SLRS or None, can try others but probably won’t work due to other hardcoded params
timeout (float) – used by some queries
sleep_seconds (int) – increasing this may reduce probability of server download failures
- autots.datasets.load_monthly(long: bool = True)¶
Federal Reserve of St. Louis monthly economic indicators.
- autots.datasets.load_sine(long=False, shape=None, start_date: str = '2021-01-01', introduce_random: float | None = None, random_seed: int = 123)¶
Create a dataset of just zeroes for testing edge case.
- autots.datasets.load_weekdays(long: bool = False, categorical: bool = True, periods: int = 180)¶
Test edge cases by creating a Series with values as day of week.
- Parameters:
long (bool) – if True, return a df with columns “value” and “datetime” if False, return a Series with dt index
categorical (bool) – if True, return str/object, else return int
periods (int) – number of periods, ie length of data to generate
- autots.datasets.load_weekly(long: bool = True)¶
Weekly petroleum industry data from the EIA.
- autots.datasets.load_yearly(long: bool = True)¶
Federal Reserve of St. Louis annual economic indicators.
- autots.datasets.load_zeroes(long=False, shape=None, start_date: str = '2021-01-01')¶
Create a dataset of just zeroes for testing edge case.