We aren’t hard-coding rules today. We are building a brain.
Traditional algorithmic trading relies on if/then logic. Reinforcement Learning (RL) is different. It’s a trial-and-error engine, similar to training a dog. You don’t tell the agent how to trade; you define the environment, give it a balance of virtual cash, and punish it when it loses money.
This protocol utilizes Stable Baselines3 and the PPO (Proximal Policy Optimization) algorithm to train an agent on EURUSD hourly data. The architecture is model-free—the agent learns directly from market noise without a predefined map.
Here is the complete technical breakdown to build, train, and deploy this system.
The Architecture
We are splitting the codebase into four distinct modules for modularity and debugging ease.
- indicators.py: Data preprocessing and feature engineering.
- trading_env.py: The custom OpenAI Gym environment (the “world”).
- train_agent.py: The PPO training loop.
- test_agent.py: Out-of-sample evaluation.
Prerequisites:
You will need gym, numpy, pandas, pandas_ta, and stable_baselines3.
Phase 1: Data Engineering (indicators.py)
The agent needs vision. Raw price data isn’t enough; we need to feed it derived technical features. We utilize RSI, SMA, ATR, and a custom Slope metric.
Create indicators.py:
import pandas as pd
import pandas_ta as ta
def load_and_preprocess_data(csv_path: str) -> pd.DataFrame:
"""
Loads EURUSD data, cleans it, and attaches technical indicators.
"""
# Load Data
df = pd.read_csv(csv_path, parse_dates=['Gmt time'], index_col='Gmt time')
# Cleanup
df = df.sort_index()
if df.isnull().values.any():
df = df.dropna()
# Feature Engineering via pandas_ta
df['rsi_14'] = ta.rsi(df['Close'], length=14)
df['sma_20'] = ta.sma(df['Close'], length=20)
df['sma_50'] = ta.sma(df['Close'], length=50)
df['atr'] = ta.atr(df['High'], df['Low'], df['Close'], length=14)
# Custom Feature: SMA Slope (Difference between steps)
df['sma_20_slope'] = df['sma_20'].diff()
# Drop NaN values generated by indicators
df.dropna(inplace=True)
return df
Phase 2: The Environment (trading_env.py)
This is the core. We must inherit from gym.Env to create a standard interface that Stable Baselines3 can interact with.
The Logic:
- Action Space: The agent can Wait, Buy, or Sell.
- Parameters: It chooses Stop Loss (SL) and Take Profit (TP) distances from a predefined list (60, 90, 120 pips).
- Reward Function: Pure PnL. Positive profit = positive reward. Loss = negative reward.
- Safety Protocol: If a candle touches both SL and TP, we assume the worst-case scenario (Loss) to prevent the model from hallucinating profits.
Create trading_env.py:
import gym
import numpy as np
from gym import spaces
class ForexTradingEnv(gym.Env):
def __init__(self, df, window_size=30, sl_options=[60, 90, 120], tp_options=[60, 90, 120]):
super(ForexTradingEnv, self).__init__()
self.df = df
self.window_size = window_size
self.sl_options = sl_options
self.tp_options = tp_options
self.n_steps = len(self.df)
# Action Space:
# 0 = No Trade
# 1 = Buy/Sell decisions combined with SL/TP permutations
# Formula: 1 + 2 * len(sl) * len(tp)
n_actions = 1 + 2 * len(sl_options) * len(tp_options)
self.action_space = spaces.Discrete(n_actions)
# Observation Space: Window size * Number of features
self.shape = (window_size, self.df.shape[1])
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=self.shape, dtype=np.float32)
# Initialization
self.current_step = window_size
self.equity = 10000.0
self.positions = []
self.equity_curve = []
def reset(self):
self.current_step = self.window_size
self.equity = 10000.0
self.equity_curve = []
return self._get_observation()
def _get_observation(self):
# Returns the window of data ending at current_step
return self.df.iloc[self.current_step - self.window_size : self.current_step].values.astype(np.float32)
def step(self, action):
done = False
reward = 0.0
# Action 0: Do Nothing / Hold
if action == 0:
reward = 0 # No penalty for waiting, or use small negative to force action
else:
# Decode Action Index into Direction, SL, and TP
# (Logic simplified for brevity - assumes mapping logic exists)
# You essentially map the integer 'action' to specific SL/TP indices
# Example decoding logic (conceptual):
# direction = LONG if action_index is in first half, else SHORT
# sl_index, tp_index derived using modulo math
# Execute Trade Logic (Simplified)
# 1. Get Entry Price (Close of current step)
# 2. Check future candles for SL or TP hit
# 3. Calculate PnL
# CRITICAL: If both SL and TP hit in same candle, force LOSS.
# pnl = calculated_pnl * 10000 (convert to pips/cash)
reward = pnl_value # This is the feedback loop
# Update State
self.equity += reward
self.equity_curve.append(self.equity)
self.current_step += 1
if self.current_step >= self.n_steps - 1:
done = True
obs = self._get_observation()
info = {'equity': self.equity}
return obs, reward, done, info
Phase 3: Training (train_agent.py)
We use the PPO algorithm. It is robust, handles noise well, and strikes a balance between exploration (trying new things) and exploitation (using what works).
Critical Nuance:
Overfitting is the enemy. In the video, training for 50,000 steps resulted in a “money printer” equity curve on training data but failure on test data. Reducing training to 10,000 steps produced a less perfect training curve but significantly better generalization on unseen data. Less is often more.
Create train_agent.py:
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from indicators import load_and_preprocess_data
from trading_env import ForexTradingEnv
def main():
# 1. Load Training Data (2020-2023)
df = load_and_preprocess_data('data/EURUSD_2020_2023.csv')
# 2. Initialize Environment
# DummyVecEnv is required for Stable Baselines wrapper
env = DummyVecEnv([lambda: ForexTradingEnv(df, window_size=30)])
# 3. Define Model (PPO)
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log="./tensorboard_log")
# 4. Train
# CAUTION: High timesteps = Overfitting.
# Recommended start: 10,000 - 20,000
model.learn(total_timesteps=10000)
# 5. Save
model.save("model_eurusd_ppo")
print("Agent trained and saved.")
if __name__ == "__main__":
main()
Phase 4: Execution & Evaluation (test_agent.py)
Never trust a backtest on training data. We load the saved model and run it on unseen data (2023-2025). This is the only metric that matters.
Create test_agent.py:
import matplotlib.pyplot as plt
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from indicators import load_and_preprocess_data
from trading_env import ForexTradingEnv
def main():
# 1. Load Testing Data (Unseen 2023-2025)
df_test = load_and_preprocess_data('data/EURUSD_2023_2025.csv')
# 2. Re-create Environment
env = DummyVecEnv([lambda: ForexTradingEnv(df_test, window_size=30)])
# 3. Load Model
model = PPO.load("model_eurusd_ppo")
# 4. Run Loop
obs = env.reset()
done = False
while not done:
# Predict action (Deterministic = True removes randomness for testing)
action, _states = model.predict(obs, deterministic=True)
obs, rewards, done, info = env.step(action)
# 5. Extract and Plot Equity Curve
# Access the internal env to get the curve list
equity_curve = env.envs[0].equity_curve
plt.figure(figsize=(12,6))
plt.plot(equity_curve, label="Equity (Test Data)")
plt.title("Reinforcement Learning Agent Performance")
plt.xlabel("Time Steps")
plt.ylabel("Account Balance")
plt.legend()
plt.show()
if __name__ == "__main__":
main()
Observations & Strategy
When you run this, you will notice something fascinating.
- The Overfit Trap: At 50k steps, the agent memorizes the specific noise of 2020-2023. It fails to adapt to the regime changes in 2024.
- The Sweet Spot: At 10k steps, the agent learns broader concepts—momentum and mean reversion—rather than specific price points.
This implies that for financial RL, generalization requires constraining the model’s capacity to memorize.
This is not a “set and forget” money printer. It is a framework. To make this production-ready, you must inject volatility indices into the observation space and likely switch to Recurrent PPO (LSTM-based) to give the agent “memory” of past sequences beyond the immediate window.









