← Back to Blog
AI TrainingNVIDIA NitroGenBehavioral CloningClassic Arcade

Teaching AI to Play Joust: Fine-tuning NVIDIA's NitroGen Model (Part 1)

Setting up the system for recording human gameplay and preparing to train a vision-to-action transformer

By Garry OsborneDecember 27, 202418 min read

Introduction

What if you could teach an AI to play classic arcade games just by showing it gameplay videos? That's exactly what NVIDIA's NitroGen model promises, a vision-to-action transformer trained on 40,000 hours of gameplay across 1000+ different games. When I saw the Nitrogen model drop It reminded me of the Open AI Gym from several years back. In the Gym you could train your AI model to learn to play Atari 2600 games. So I thought why don't I see how well Nitrogen plays one of my favorite games, which was showcased in the book "Ready Player One" by Ernest Cline, Joust. Well, it turns out that Nitrogen was not trained to play joust!

In this two-part series, I document my journey fine-tuning NitroGen to master Joust, the classic 1982 arcade game. Part 1 covers the setup, data collection strategy, and getting human gameplay recordings working. Part 2 will cover training the model and evaluating its performance.

Why Joust?

Joust is an excellent test case for several reasons:

  1. Simple but challenging: Horizontal movement + one button (flap), but requires good timing
  2. Clear objectives: Stay alive, defeat enemies, collect eggs
  3. Limited action space: Perfect for behavioral cloning with limited training data
  4. Nostalgic: Classic arcade game from the golden age

Unlike modern games with complex controls, Joust's simplicity means we can achieve meaningful results without massive computational resources.

The Vision

Fine-tune Nvidia's Nitrogen pre-trained 493M parameter vision transformer to play Joust competently by:

  • Using MAME to run an emulation of Joust. (I have purchased the rom)
  • Recording 5-10 hours of human gameplay demonstrations using MAME
  • Training through behavioral cloning (imitation learning)
  • Training on cloud infrastructure (RunPod H100 80GB)*

*Originally planned for RTX 3050 (4GB VRAM), but switched to cloud training for faster iteration and better results at comparable cost.

No reinforcement learning, no reward engineering, just "watch and learn" from human game play.

Project Setup

Hardware Configuration

Development & Data Collection Setup:

  • GPU: NVIDIA RTX 3050 (4GB VRAM) - for model inference testing
  • RAM: 32GB
  • Storage: Local SSD + NAS for large datasets
  • OS: Windows 11

Training Platform:

  • Service: RunPod cloud GPU rental
  • GPU: H100 80GB PCIe (rented on-demand)
  • Pricing: ~$1.49/hr spot pricing
  • Duration: 12-15 hours estimated for full training run

Originally, I planned to train locally on the RTX 3050, but cloud rental proved more practical: comparable cost (~$15-22 total) with dramatically faster results (12-15 hours vs 6-9 days). This demonstrates how cloud infrastructure democratizes access to professional-grade GPUs—you don't need to own datacenter hardware to use it.

Software Stack

Core Components:

  • MAME: The Multiple Arcade Machine Emulator (MAME) for running Joust
  • NitroGen: NVIDIA's vision-to-action transformer (493M parameters)
  • PyTorch: Deep learning framework with CUDA support
  • Python 3.11: Via Pinokio environment (includes all ML dependencies)

Key Libraries:

  • opencv-python: Video processing and screen capture
  • pygame: Gamepad input handling
  • pywin32: Windows API access for keyboard input
  • zmq: Client-server architecture for model inference
  • mss/dxcam: Screen capture (software fallback + hardware acceleration)

The Data Challenge

My initial plan was to leverage YouTube videos of Joust gameplay with gamepad overlays visible on screen. I could extract the controller inputs from the overlay and sync them with gameplay frames. It turns out that there is maybe three videos of joust game play with a visible gamepad overlay. One of which was joust on a gameboy played in grayscale low rez and not useful. So I have to dust off my arcade joust skills and setup a way to capture my own gameplay. This may take some time. I really suck at Joust.

Reality check: Out of 206 videos (68.5 hours total), only 3 videos had usable gamepad overlays (~1.5% hit rate).

This forced a pivot: record my own gameplay.

Building the Recording System

Challenge 1: Use a Physical Gamepad or go with a Keyboard

I don't own a gamepad. Solution? Map keyboard controls to gamepad data:

# Windows API for system-wide keyboard state
VK_LEFT = 0x25   # Left arrow
VK_RIGHT = 0x27  # Right arrow
VK_LCONTROL = 0xA2  # Left Ctrl (FLAP button)

# Map to gamepad format
left_x = 0.0
if is_key_pressed(VK_LEFT):
    left_x = -1.0  # Stick pushed left
elif is_key_pressed(VK_RIGHT):
    left_x = +1.0  # Stick pushed right

return GamepadState(
    j_left=(left_x, 0.0),  # Left analog stick (x, y)
    j_right=(0.0, 0.0),    # Right stick unused
    buttons=[1.0 if is_key_pressed(VK_LCONTROL) else 0.0, ...]
)

Key insight: NitroGen expects standard gamepad data (21D action space: 2 joysticks × 2 axes + 17 buttons). The j_left field isn't "move left"—it's the left analog stick position where negative X = left direction and positive X = right direction.

Challenge 2: Screen Capture from MAME

Capturing gameplay frames proved surprisingly tricky:

Attempt 1: Use dxcam for hardware-accelerated DirectX capture

  • Failed on my hybrid graphics system (Intel integrated + NVIDIA discrete)
  • Error: "device interface or feature level not supported"

Solution: Automatic fallback to mss (software capture)

  • Works reliably across all systems
  • Still achieves 60 FPS capture (adequate for training data)

Critical requirement: MAME must run in windowed mode (-window flag). Fullscreen uses hardware overlays that can't be captured and results in black frames if you forget this.

Challenge 3: Keyboard Input When MAME Has Focus

This was the biggest technical hurdle. Multiple failed attempts:

Attempt 1: Use pygame.event.get() to read keyboard

  • Problem: Consumed events before other parts could see them
  • Hotkeys stopped working

Attempt 2: Use pygame.key.get_pressed()

  • Problem: Only sees input when pygame window has focus
  • MAME has focus during gameplay → all zeros recorded

Final Solution: Windows API GetAsyncKeyState()

import ctypes

def is_key_pressed(vk_code):
    return ctypes.windll.user32.GetAsyncKeyState(vk_code) & 0x8000 != 0

This reads system-wide keyboard state regardless of window focus. Works perfectly even when MAME is in foreground.

Lesson learned: Don't assume high-level libraries will work for background input monitoring. Sometimes you need to drop down to OS APIs.

The Recording Workflow

The final recording system works beautifully:

# 1. Start MAME in windowed mode
d:\mame-exe\mame.exe joust -window

# 2. Start recording with keyboard controls
scripts\record_joust.bat --keyboard

# 3. Play for 10-30 minutes (I usually last 5 - 10 minutes tops)
# 4. Press Ctrl+C to save

What gets recorded:

  • 256×256 PNG frames at 60 FPS
  • Synchronized gamepad state (joystick + buttons) in JSONL format
  • Metadata (duration, actual FPS achieved, input device)

Example action data:

{
  "frame_id": 1536,
  "timestamp": 148.58,
  "j_left": [1.0, 0.0],
  "j_right": [0.0, 0.0],
  "buttons": [1.0, 0.0, 0.0, ...],
  "observation_path": "observations/001536.png"
}

This shows: moving right (j_left X = 1.0) while flapping (button 0 = 1.0).

Data Format Design

Each recording session produces:

human_demos/joust_20241227_143052/
├── observations/
│   ├── 000000.png
│   ├── 000001.png
│   └── ... (60 frames per second)
├── actions.jsonl      # Frame-by-frame input data
└── metadata.json      # Session stats

This format is designed to convert cleanly to NitroGen's expected Parquet training format with frame-action pairs.

Technical Insights

1. Understanding Gamepad Data Models

Coming from a keyboard/mouse background, gamepad data structure wasn't intuitive:

  • Physical gamepad: Has left stick, right stick, and buttons
  • Data representation: j_left = left stick (x, y), j_right = right stick (x, y)
  • Joust usage: Only left stick X-axis (horizontal) + button A (flap)

The naming can be confusing—j_left sounds like "move left" but actually means "left joystick position." The X value determines direction. You have to think 2 player arcade game cabinet.

2. Screen Capture Gotchas

  • Exclusive fullscreen = hardware overlays = uncapturable
  • Windowed mode = standard rendering = works with mss/dxcam
  • Always test capture during actual gameplay, not just menus
  • First frame black? You're probably in fullscreen mode

3. Input Reading Complexity

Reading input across applications is harder than it seems:

MethodWorks When MAME Has Focus?System-Wide?
pygame.event.get()NoNo
pygame.key.get_pressed()NoNo
keyboard libraryYes, but blocks pygameYes
GetAsyncKeyState()YesYes

For background monitoring, OS-level APIs are the only reliable solution.

Current Status

After solving these technical challenges, the recording system is production-ready:

  • ✅ Reliable 60 FPS screen capture from MAME
  • ✅ Accurate keyboard-to-gamepad mapping
  • ✅ System-wide input reading (works with MAME in focus)
  • ✅ Auto-start recording with countdown
  • ✅ Clean data format ready for training pipeline
  • ✅ Proper error handling and fallbacks

Training Data Requirements

Based on NitroGen's architecture and similar behavioral cloning projects:

  • Minimum: 2-3 hours (basic gameplay ability)
  • Recommended: 5-10 hours (competent performance)
  • Optimal: 10-15 hours (strong generalization)

Why so little compared to NitroGen's original training?

  1. Fine-tuning, not training from scratch: The model already understands gameplay concepts
  2. Single game focus: Limited action space and scenarios compared to 100+ games
  3. Transfer learning: Pre-trained vision encoder already extracts relevant features

Lessons Learned

1. Cloud Infrastructure Democratizes AI Development

Initially, I planned to train on my RTX 3050 (4GB VRAM), which would have required extensive memory optimization and 6-9 days of training time. Switching to RunPod's H100 rental proved more practical: $15-22 for 12-15 hours of training versus similar electricity costs but a week of waiting. This demonstrates a key insight: you don't need to own datacenter GPUs to use them. Cloud rental democratizes access to professional-grade infrastructure.

2. Data Collection is Hard

My initial "just scrape videos" plan fell apart immediately. Only 1.5% of videos had usable overlays. This pivot to recording human gameplay, while more time-intensive, gives full control over data quality, diverse scenarios, and clean ground truth with no OCR errors.

3. Systems Integration is 80% of the Work

The actual model fine-tuning will be relatively straightforward PyTorch code. But getting the data collection pipeline working required understanding Windows APIs, debugging screen capture issues, solving input reading across processes, handling window detection edge cases, and implementing fallback strategies. This is typical of real-world ML projects: data engineering dominates model engineering.

What's Next: Part 2

In Part 2, I'll cover:

  1. Data Processing Pipeline - Converting JSONL recordings to Parquet training format, data validation and quality checks, augmentation strategies
  2. Training Implementation - PyTorch dataset and dataloader, RunPod setup and environment configuration, training loop with monitoring
  3. Fine-tuning Process - Hyperparameter selection, training curves and metrics, debugging training issues
  4. Evaluation and Results - Qualitative gameplay assessment, quantitative metrics, comparison to baseline
  5. Reflections - What worked well, what I'd do differently, cost analysis, future improvements

Current Progress

Status: Data collection phase

Recordings completed: 3.5 hours (already started)

Target: 15 hours of diverse gameplay (optimal)

Platform: RunPod H100 80GB (spot pricing ~$1.49/hr)

Estimated training cost: $15-22 total

Next milestone: First training run on RunPod

Outro

This project demonstrates that modern foundation models enable exciting possibilities even for individual developers through accessible cloud infrastructure. The key is:

  1. Start small: Single game, simple controls, modest data requirements
  2. Transfer learning: Leverage pre-trained models, don't train from scratch
  3. Use the right tools: Cloud GPU rental democratizes access to professional infrastructure
  4. Iterate ruthlessly: Expect initial plans to fail, have fallback strategies
  5. Document problems: Every issue you solve helps the next person

The recording system is working beautifully. Now it's time to collect 15 hours of gameplay and see if behavioral cloning can teach an AI to master Joust.

Stay tuned for Part 2, where we'll find out if all this engineering effort pays off with a competent Joust-playing AI.


Follow along with the full implementation details in the GitHub repository (available after Part 2 publication).

Published on Tomorrow's Innovations | tomorrowsinnovations.co

Interested in AI Training?

Follow along for Part 2 of this series