Teaching AI to Play Joust: Fine-tuning NVIDIA's NitroGen Model (Part 1)

Introduction

What if you could teach an AI to play classic arcade games just by showing it gameplay videos? That's exactly what NVIDIA's NitroGen model promises, a vision-to-action transformer trained on 40,000 hours of gameplay across 1000+ different games. When I saw the Nitrogen model drop It reminded me of the Open AI Gym from several years back. In the Gym you could train your AI model to learn to play Atari 2600 games. So I thought why don't I see how well Nitrogen plays one of my favorite games, which was showcased in the book "Ready Player One" by Ernest Cline, Joust. Well, it turns out that Nitrogen was not trained to play joust!

In this two-part series, I document my journey fine-tuning NitroGen to master Joust, the classic 1982 arcade game. Part 1 covers the setup, data collection strategy, and getting human gameplay recordings working. Part 2 will cover training the model and evaluating its performance.

Why Joust?

Joust is an excellent test case for several reasons:

Simple but challenging: Horizontal movement + one button (flap), but requires good timing
Clear objectives: Stay alive, defeat enemies, collect eggs
Limited action space: Perfect for behavioral cloning with limited training data
Nostalgic: Classic arcade game from the golden age

Unlike modern games with complex controls, Joust's simplicity means we can achieve meaningful results without massive computational resources.

The Vision

Fine-tune Nvidia's Nitrogen pre-trained 493M parameter vision transformer to play Joust competently by:

Using MAME to run an emulation of Joust. (I have purchased the rom)
Recording 5-10 hours of human gameplay demonstrations using MAME
Training through behavioral cloning (imitation learning)
Training on cloud infrastructure (RunPod H100 80GB)*

*Originally planned for RTX 3050 (4GB VRAM), but switched to cloud training for faster iteration and better results at comparable cost.

No reinforcement learning, no reward engineering, just "watch and learn" from human game play.

Project Setup

Hardware Configuration

Development & Data Collection Setup:

GPU: NVIDIA RTX 3050 (4GB VRAM) - for model inference testing
RAM: 32GB
Storage: Local SSD + NAS for large datasets
OS: Windows 11

Training Platform:

Service: RunPod cloud GPU rental
GPU: H100 80GB PCIe (rented on-demand)
Pricing: ~$1.49/hr spot pricing
Duration: 12-15 hours estimated for full training run

Originally, I planned to train locally on the RTX 3050, but cloud rental proved more practical: comparable cost (~$15-22 total) with dramatically faster results (12-15 hours vs 6-9 days). This demonstrates how cloud infrastructure democratizes access to professional-grade GPUs—you don't need to own datacenter hardware to use it.

Software Stack

Core Components:

MAME: The Multiple Arcade Machine Emulator (MAME) for running Joust
NitroGen: NVIDIA's vision-to-action transformer (493M parameters)
PyTorch: Deep learning framework with CUDA support
Python 3.11: Via Pinokio environment (includes all ML dependencies)

Key Libraries:

opencv-python: Video processing and screen capture
pygame: Gamepad input handling
pywin32: Windows API access for keyboard input
zmq: Client-server architecture for model inference
mss/dxcam: Screen capture (software fallback + hardware acceleration)

The Data Challenge

My initial plan was to leverage YouTube videos of Joust gameplay with gamepad overlays visible on screen. I could extract the controller inputs from the overlay and sync them with gameplay frames. It turns out that there is maybe three videos of joust game play with a visible gamepad overlay. One of which was joust on a gameboy played in grayscale low rez and not useful. So I have to dust off my arcade joust skills and setup a way to capture my own gameplay. This may take some time. I really suck at Joust.

Reality check: Out of 206 videos (68.5 hours total), only 3 videos had usable gamepad overlays (~1.5% hit rate).

This forced a pivot: record my own gameplay.

Building the Recording System

Challenge 1: Use a Physical Gamepad or go with a Keyboard

I don't own a gamepad. Solution? Map keyboard controls to gamepad data:

# Windows API for system-wide keyboard state
VK_LEFT = 0x25   # Left arrow
VK_RIGHT = 0x27  # Right arrow
VK_LCONTROL = 0xA2  # Left Ctrl (FLAP button)

# Map to gamepad format
left_x = 0.0
if is_key_pressed(VK_LEFT):
    left_x = -1.0  # Stick pushed left
elif is_key_pressed(VK_RIGHT):
    left_x = +1.0  # Stick pushed right

return GamepadState(
    j_left=(left_x, 0.0),  # Left analog stick (x, y)
    j_right=(0.0, 0.0),    # Right stick unused
    buttons=[1.0 if is_key_pressed(VK_LCONTROL) else 0.0, ...]
)

Key insight: NitroGen expects standard gamepad data (21D action space: 2 joysticks × 2 axes + 17 buttons). The j_left field isn't "move left"—it's the left analog stick position where negative X = left direction and positive X = right direction.

Challenge 2: Screen Capture from MAME

Capturing gameplay frames proved surprisingly tricky:

Attempt 1: Use dxcam for hardware-accelerated DirectX capture

Failed on my hybrid graphics system (Intel integrated + NVIDIA discrete)
Error: "device interface or feature level not supported"

Solution: Automatic fallback to mss (software capture)

Works reliably across all systems
Still achieves 60 FPS capture (adequate for training data)

Critical requirement: MAME must run in windowed mode (-window flag). Fullscreen uses hardware overlays that can't be captured and results in black frames if you forget this.

Challenge 3: Keyboard Input When MAME Has Focus

This was the biggest technical hurdle. Multiple failed attempts:

Attempt 1: Use pygame.event.get() to read keyboard

Problem: Consumed events before other parts could see them
Hotkeys stopped working

Attempt 2: Use pygame.key.get_pressed()

Problem: Only sees input when pygame window has focus
MAME has focus during gameplay → all zeros recorded

Final Solution: Windows API GetAsyncKeyState()

import ctypes

def is_key_pressed(vk_code):
    return ctypes.windll.user32.GetAsyncKeyState(vk_code) & 0x8000 != 0

This reads system-wide keyboard state regardless of window focus. Works perfectly even when MAME is in foreground.

Lesson learned: Don't assume high-level libraries will work for background input monitoring. Sometimes you need to drop down to OS APIs.

The Recording Workflow

The final recording system works beautifully:

# 1. Start MAME in windowed mode
d:\mame-exe\mame.exe joust -window

# 2. Start recording with keyboard controls
scripts\record_joust.bat --keyboard

# 3. Play for 10-30 minutes (I usually last 5 - 10 minutes tops)
# 4. Press Ctrl+C to save

What gets recorded:

256×256 PNG frames at 60 FPS
Synchronized gamepad state (joystick + buttons) in JSONL format
Metadata (duration, actual FPS achieved, input device)

Example action data:

{
  "frame_id": 1536,
  "timestamp": 148.58,
  "j_left": [1.0, 0.0],
  "j_right": [0.0, 0.0],
  "buttons": [1.0, 0.0, 0.0, ...],
  "observation_path": "observations/001536.png"
}

This shows: moving right (j_left X = 1.0) while flapping (button 0 = 1.0).

Data Format Design

Each recording session produces:

human_demos/joust_20241227_143052/
├── observations/
│   ├── 000000.png
│   ├── 000001.png
│   └── ... (60 frames per second)
├── actions.jsonl      # Frame-by-frame input data
└── metadata.json      # Session stats

This format is designed to convert cleanly to NitroGen's expected Parquet training format with frame-action pairs.

Technical Insights

1. Understanding Gamepad Data Models

Coming from a keyboard/mouse background, gamepad data structure wasn't intuitive:

Physical gamepad: Has left stick, right stick, and buttons
Data representation: j_left = left stick (x, y), j_right = right stick (x, y)
Joust usage: Only left stick X-axis (horizontal) + button A (flap)

The naming can be confusing—j_left sounds like "move left" but actually means "left joystick position." The X value determines direction. You have to think 2 player arcade game cabinet.

2. Screen Capture Gotchas

Exclusive fullscreen = hardware overlays = uncapturable
Windowed mode = standard rendering = works with mss/dxcam
Always test capture during actual gameplay, not just menus
First frame black? You're probably in fullscreen mode

3. Input Reading Complexity

Reading input across applications is harder than it seems:

Method	Works When MAME Has Focus?	System-Wide?
`pygame.event.get()`	No	No
`pygame.key.get_pressed()`	No	No
`keyboard` library	Yes, but blocks pygame	Yes
`GetAsyncKeyState()`	Yes	Yes

For background monitoring, OS-level APIs are the only reliable solution.

Current Status

After solving these technical challenges, the recording system is production-ready:

✅ Reliable 60 FPS screen capture from MAME
✅ Accurate keyboard-to-gamepad mapping
✅ System-wide input reading (works with MAME in focus)
✅ Auto-start recording with countdown
✅ Clean data format ready for training pipeline
✅ Proper error handling and fallbacks

Training Data Requirements

Based on NitroGen's architecture and similar behavioral cloning projects:

Minimum: 2-3 hours (basic gameplay ability)
Recommended: 5-10 hours (competent performance)
Optimal: 10-15 hours (strong generalization)

Why so little compared to NitroGen's original training?

Fine-tuning, not training from scratch: The model already understands gameplay concepts
Single game focus: Limited action space and scenarios compared to 100+ games
Transfer learning: Pre-trained vision encoder already extracts relevant features

Lessons Learned

1. Cloud Infrastructure Democratizes AI Development

Initially, I planned to train on my RTX 3050 (4GB VRAM), which would have required extensive memory optimization and 6-9 days of training time. Switching to RunPod's H100 rental proved more practical: $15-22 for 12-15 hours of training versus similar electricity costs but a week of waiting. This demonstrates a key insight: you don't need to own datacenter GPUs to use them. Cloud rental democratizes access to professional-grade infrastructure.

2. Data Collection is Hard

My initial "just scrape videos" plan fell apart immediately. Only 1.5% of videos had usable overlays. This pivot to recording human gameplay, while more time-intensive, gives full control over data quality, diverse scenarios, and clean ground truth with no OCR errors.

3. Systems Integration is 80% of the Work

The actual model fine-tuning will be relatively straightforward PyTorch code. But getting the data collection pipeline working required understanding Windows APIs, debugging screen capture issues, solving input reading across processes, handling window detection edge cases, and implementing fallback strategies. This is typical of real-world ML projects: data engineering dominates model engineering.

What's Next: Part 2

In Part 2, I'll cover:

Data Processing Pipeline - Converting JSONL recordings to Parquet training format, data validation and quality checks, augmentation strategies
Training Implementation - PyTorch dataset and dataloader, RunPod setup and environment configuration, training loop with monitoring
Fine-tuning Process - Hyperparameter selection, training curves and metrics, debugging training issues
Evaluation and Results - Qualitative gameplay assessment, quantitative metrics, comparison to baseline
Reflections - What worked well, what I'd do differently, cost analysis, future improvements

Current Progress

Status: Data collection phase

Recordings completed: 3.5 hours (already started)

Target: 15 hours of diverse gameplay (optimal)

Platform: RunPod H100 80GB (spot pricing ~$1.49/hr)

Estimated training cost: $15-22 total

Next milestone: First training run on RunPod

Outro

This project demonstrates that modern foundation models enable exciting possibilities even for individual developers through accessible cloud infrastructure. The key is:

Start small: Single game, simple controls, modest data requirements
Transfer learning: Leverage pre-trained models, don't train from scratch
Use the right tools: Cloud GPU rental democratizes access to professional infrastructure
Iterate ruthlessly: Expect initial plans to fail, have fallback strategies
Document problems: Every issue you solve helps the next person

The recording system is working beautifully. Now it's time to collect 15 hours of gameplay and see if behavioral cloning can teach an AI to master Joust.

Stay tuned for Part 2, where we'll find out if all this engineering effort pays off with a competent Joust-playing AI.

Follow along with the full implementation details in the GitHub repository (available after Part 2 publication).

Published on Tomorrow's Innovations | tomorrowsinnovations.co