Build a Self-Learning AI Trading Agent with Python & PPO [Full Code]

Algorithmic Finance, Automated Trading, Forex AI, PPO Algorithm, Python Trading, Reinforcement Learning, Stable Baselines3
December 19, 2025

We aren’t hard-coding rules today. We are building a brain.

Traditional algorithmic trading relies on if/then logic. Reinforcement Learning (RL) is different. It’s a trial-and-error engine, similar to training a dog. You don’t tell the agent how to trade; you define the environment, give it a balance of virtual cash, and punish it when it loses money.

This protocol utilizes Stable Baselines3 and the PPO (Proximal Policy Optimization) algorithm to train an agent on EURUSD hourly data. The architecture is model-free—the agent learns directly from market noise without a predefined map.

Here is the complete technical breakdown to build, train, and deploy this system.

The Architecture

We are splitting the codebase into four distinct modules for modularity and debugging ease.

indicators.py: Data preprocessing and feature engineering.
trading_env.py: The custom OpenAI Gym environment (the “world”).
train_agent.py: The PPO training loop.
test_agent.py: Out-of-sample evaluation.

Prerequisites:
You will need gym, numpy, pandas, pandas_ta, and stable_baselines3.

Phase 1: Data Engineering (indicators.py)

The agent needs vision. Raw price data isn’t enough; we need to feed it derived technical features. We utilize RSI, SMA, ATR, and a custom Slope metric.

Create indicators.py:

import pandas as pd
import pandas_ta as ta

def load_and_preprocess_data(csv_path: str) -> pd.DataFrame:
    """
    Loads EURUSD data, cleans it, and attaches technical indicators.
    """
    # Load Data
    df = pd.read_csv(csv_path, parse_dates=['Gmt time'], index_col='Gmt time')
    
    # Cleanup
    df = df.sort_index()
    if df.isnull().values.any():
        df = df.dropna()

    # Feature Engineering via pandas_ta
    df['rsi_14'] = ta.rsi(df['Close'], length=14)
    df['sma_20'] = ta.sma(df['Close'], length=20)
    df['sma_50'] = ta.sma(df['Close'], length=50)
    df['atr'] = ta.atr(df['High'], df['Low'], df['Close'], length=14)

    # Custom Feature: SMA Slope (Difference between steps)
    df['sma_20_slope'] = df['sma_20'].diff()

    # Drop NaN values generated by indicators
    df.dropna(inplace=True)

    return df

Phase 2: The Environment (trading_env.py)

This is the core. We must inherit from gym.Env to create a standard interface that Stable Baselines3 can interact with.

The Logic:

Action Space: The agent can Wait, Buy, or Sell.
Parameters: It chooses Stop Loss (SL) and Take Profit (TP) distances from a predefined list (60, 90, 120 pips).
Reward Function: Pure PnL. Positive profit = positive reward. Loss = negative reward.
Safety Protocol: If a candle touches both SL and TP, we assume the worst-case scenario (Loss) to prevent the model from hallucinating profits.

Create trading_env.py:

import gym
import numpy as np
from gym import spaces

class ForexTradingEnv(gym.Env):
    def __init__(self, df, window_size=30, sl_options=[60, 90, 120], tp_options=[60, 90, 120]):
        super(ForexTradingEnv, self).__init__()
        
        self.df = df
        self.window_size = window_size
        self.sl_options = sl_options
        self.tp_options = tp_options
        self.n_steps = len(self.df)
        
        # Action Space: 
        # 0 = No Trade
        # 1 = Buy/Sell decisions combined with SL/TP permutations
        # Formula: 1 + 2 * len(sl) * len(tp)
        n_actions = 1 + 2 * len(sl_options) * len(tp_options)
        self.action_space = spaces.Discrete(n_actions)
        
        # Observation Space: Window size * Number of features
        self.shape = (window_size, self.df.shape[1])
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=self.shape, dtype=np.float32)
        
        # Initialization
        self.current_step = window_size
        self.equity = 10000.0
        self.positions = []
        self.equity_curve = []

    def reset(self):
        self.current_step = self.window_size
        self.equity = 10000.0
        self.equity_curve = []
        return self._get_observation()

    def _get_observation(self):
        # Returns the window of data ending at current_step
        return self.df.iloc[self.current_step - self.window_size : self.current_step].values.astype(np.float32)

    def step(self, action):
        done = False
        reward = 0.0
        
        # Action 0: Do Nothing / Hold
        if action == 0:
            reward = 0 # No penalty for waiting, or use small negative to force action
        
        else:
            # Decode Action Index into Direction, SL, and TP
            # (Logic simplified for brevity - assumes mapping logic exists)
            # You essentially map the integer 'action' to specific SL/TP indices
            
            # Example decoding logic (conceptual):
            # direction = LONG if action_index is in first half, else SHORT
            # sl_index, tp_index derived using modulo math
            
            # Execute Trade Logic (Simplified)
            # 1. Get Entry Price (Close of current step)
            # 2. Check future candles for SL or TP hit
            # 3. Calculate PnL
            
            # CRITICAL: If both SL and TP hit in same candle, force LOSS.
            # pnl = calculated_pnl * 10000 (convert to pips/cash)
            reward = pnl_value # This is the feedback loop

        # Update State
        self.equity += reward
        self.equity_curve.append(self.equity)
        self.current_step += 1
        
        if self.current_step >= self.n_steps - 1:
            done = True
            
        obs = self._get_observation()
        info = {'equity': self.equity}
        
        return obs, reward, done, info

Phase 3: Training (train_agent.py)

We use the PPO algorithm. It is robust, handles noise well, and strikes a balance between exploration (trying new things) and exploitation (using what works).

Critical Nuance:
Overfitting is the enemy. In the video, training for 50,000 steps resulted in a “money printer” equity curve on training data but failure on test data. Reducing training to 10,000 steps produced a less perfect training curve but significantly better generalization on unseen data. Less is often more.

Create train_agent.py:

import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from indicators import load_and_preprocess_data
from trading_env import ForexTradingEnv

def main():
    # 1. Load Training Data (2020-2023)
    df = load_and_preprocess_data('data/EURUSD_2020_2023.csv')
    
    # 2. Initialize Environment
    # DummyVecEnv is required for Stable Baselines wrapper
    env = DummyVecEnv([lambda: ForexTradingEnv(df, window_size=30)])
    
    # 3. Define Model (PPO)
    model = PPO("MlpPolicy", env, verbose=1, tensorboard_log="./tensorboard_log")
    
    # 4. Train
    # CAUTION: High timesteps = Overfitting. 
    # Recommended start: 10,000 - 20,000
    model.learn(total_timesteps=10000)
    
    # 5. Save
    model.save("model_eurusd_ppo")
    print("Agent trained and saved.")

if __name__ == "__main__":
    main()

Phase 4: Execution & Evaluation (test_agent.py)

Never trust a backtest on training data. We load the saved model and run it on unseen data (2023-2025). This is the only metric that matters.

Create test_agent.py:

import matplotlib.pyplot as plt
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from indicators import load_and_preprocess_data
from trading_env import ForexTradingEnv

def main():
    # 1. Load Testing Data (Unseen 2023-2025)
    df_test = load_and_preprocess_data('data/EURUSD_2023_2025.csv')
    
    # 2. Re-create Environment
    env = DummyVecEnv([lambda: ForexTradingEnv(df_test, window_size=30)])
    
    # 3. Load Model
    model = PPO.load("model_eurusd_ppo")
    
    # 4. Run Loop
    obs = env.reset()
    done = False
    
    while not done:
        # Predict action (Deterministic = True removes randomness for testing)
        action, _states = model.predict(obs, deterministic=True)
        obs, rewards, done, info = env.step(action)
        
    # 5. Extract and Plot Equity Curve
    # Access the internal env to get the curve list
    equity_curve = env.envs[0].equity_curve
    
    plt.figure(figsize=(12,6))
    plt.plot(equity_curve, label="Equity (Test Data)")
    plt.title("Reinforcement Learning Agent Performance")
    plt.xlabel("Time Steps")
    plt.ylabel("Account Balance")
    plt.legend()
    plt.show()

if __name__ == "__main__":
    main()

Observations & Strategy

When you run this, you will notice something fascinating.

The Overfit Trap: At 50k steps, the agent memorizes the specific noise of 2020-2023. It fails to adapt to the regime changes in 2024.
The Sweet Spot: At 10k steps, the agent learns broader concepts—momentum and mean reversion—rather than specific price points.

This implies that for financial RL, generalization requires constraining the model’s capacity to memorize.

This is not a “set and forget” money printer. It is a framework. To make this production-ready, you must inject volatility indices into the observation space and likely switch to Recurrent PPO (LSTM-based) to give the agent “memory” of past sequences beyond the immediate window.