Introduction
What if you could teach an AI to play classic arcade games just by showing it gameplay videos? That's exactly what NVIDIA's NitroGen model promises, a vision-to-action transformer trained on 40,000 hours of gameplay across 1000+ different games. When I saw the Nitrogen model drop It reminded me of the Open AI Gym from several years back. In the Gym you could train your AI model to learn to play Atari 2600 games. So I thought why don't I see how well Nitrogen plays one of my favorite games, which was showcased in the book "Ready Player One" by Ernest Cline, Joust. Well, it turns out that Nitrogen was not trained to play joust!
In this two-part series, I document my journey fine-tuning NitroGen to master Joust, the classic 1982 arcade game. Part 1 covers the setup, data collection strategy, and getting human gameplay recordings working. Part 2 will cover training the model and evaluating its performance.
Why Joust?
Joust is an excellent test case for several reasons:
- Simple but challenging: Horizontal movement + one button (flap), but requires good timing
- Clear objectives: Stay alive, defeat enemies, collect eggs
- Limited action space: Perfect for behavioral cloning with limited training data
- Nostalgic: Classic arcade game from the golden age
Unlike modern games with complex controls, Joust's simplicity means we can achieve meaningful results without massive computational resources.
The Vision
Fine-tune Nvidia's Nitrogen pre-trained 493M parameter vision transformer to play Joust competently by:
- Using MAME to run an emulation of Joust. (I have purchased the rom)
- Recording 5-10 hours of human gameplay demonstrations using MAME
- Training through behavioral cloning (imitation learning)
- Training on cloud infrastructure (RunPod H100 80GB)*
*Originally planned for RTX 3050 (4GB VRAM), but switched to cloud training for faster iteration and better results at comparable cost.
No reinforcement learning, no reward engineering, just "watch and learn" from human game play.
Project Setup
Hardware Configuration
Development & Data Collection Setup:
- GPU: NVIDIA RTX 3050 (4GB VRAM) - for model inference testing
- RAM: 32GB
- Storage: Local SSD + NAS for large datasets
- OS: Windows 11
Training Platform:
- Service: RunPod cloud GPU rental
- GPU: H100 80GB PCIe (rented on-demand)
- Pricing: ~$1.49/hr spot pricing
- Duration: 12-15 hours estimated for full training run
Originally, I planned to train locally on the RTX 3050, but cloud rental proved more practical: comparable cost (~$15-22 total) with dramatically faster results (12-15 hours vs 6-9 days). This demonstrates how cloud infrastructure democratizes access to professional-grade GPUs—you don't need to own datacenter hardware to use it.
Software Stack
Core Components:
- MAME: The Multiple Arcade Machine Emulator (MAME) for running Joust
- NitroGen: NVIDIA's vision-to-action transformer (493M parameters)
- PyTorch: Deep learning framework with CUDA support
- Python 3.11: Via Pinokio environment (includes all ML dependencies)
Key Libraries:
opencv-python: Video processing and screen capturepygame: Gamepad input handlingpywin32: Windows API access for keyboard inputzmq: Client-server architecture for model inferencemss/dxcam: Screen capture (software fallback + hardware acceleration)
The Data Challenge
My initial plan was to leverage YouTube videos of Joust gameplay with gamepad overlays visible on screen. I could extract the controller inputs from the overlay and sync them with gameplay frames. It turns out that there is maybe three videos of joust game play with a visible gamepad overlay. One of which was joust on a gameboy played in grayscale low rez and not useful. So I have to dust off my arcade joust skills and setup a way to capture my own gameplay. This may take some time. I really suck at Joust.
Reality check: Out of 206 videos (68.5 hours total), only 3 videos had usable gamepad overlays (~1.5% hit rate).
This forced a pivot: record my own gameplay.
Building the Recording System
Challenge 1: Use a Physical Gamepad or go with a Keyboard
I don't own a gamepad. Solution? Map keyboard controls to gamepad data:
# Windows API for system-wide keyboard state
VK_LEFT = 0x25 # Left arrow
VK_RIGHT = 0x27 # Right arrow
VK_LCONTROL = 0xA2 # Left Ctrl (FLAP button)
# Map to gamepad format
left_x = 0.0
if is_key_pressed(VK_LEFT):
left_x = -1.0 # Stick pushed left
elif is_key_pressed(VK_RIGHT):
left_x = +1.0 # Stick pushed right
return GamepadState(
j_left=(left_x, 0.0), # Left analog stick (x, y)
j_right=(0.0, 0.0), # Right stick unused
buttons=[1.0 if is_key_pressed(VK_LCONTROL) else 0.0, ...]
)Key insight: NitroGen expects standard gamepad data (21D action space: 2 joysticks × 2 axes + 17 buttons). The j_left field isn't "move left"—it's the left analog stick position where negative X = left direction and positive X = right direction.
Challenge 2: Screen Capture from MAME
Capturing gameplay frames proved surprisingly tricky:
Attempt 1: Use dxcam for hardware-accelerated DirectX capture
- Failed on my hybrid graphics system (Intel integrated + NVIDIA discrete)
- Error: "device interface or feature level not supported"
Solution: Automatic fallback to mss (software capture)
- Works reliably across all systems
- Still achieves 60 FPS capture (adequate for training data)
Critical requirement: MAME must run in windowed mode (-window flag). Fullscreen uses hardware overlays that can't be captured and results in black frames if you forget this.
Challenge 3: Keyboard Input When MAME Has Focus
This was the biggest technical hurdle. Multiple failed attempts:
Attempt 1: Use pygame.event.get() to read keyboard
- Problem: Consumed events before other parts could see them
- Hotkeys stopped working
Attempt 2: Use pygame.key.get_pressed()
- Problem: Only sees input when pygame window has focus
- MAME has focus during gameplay → all zeros recorded
Final Solution: Windows API GetAsyncKeyState()
import ctypes
def is_key_pressed(vk_code):
return ctypes.windll.user32.GetAsyncKeyState(vk_code) & 0x8000 != 0This reads system-wide keyboard state regardless of window focus. Works perfectly even when MAME is in foreground.
Lesson learned: Don't assume high-level libraries will work for background input monitoring. Sometimes you need to drop down to OS APIs.
The Recording Workflow
The final recording system works beautifully:
# 1. Start MAME in windowed mode
d:\mame-exe\mame.exe joust -window
# 2. Start recording with keyboard controls
scripts\record_joust.bat --keyboard
# 3. Play for 10-30 minutes (I usually last 5 - 10 minutes tops)
# 4. Press Ctrl+C to saveWhat gets recorded:
- 256×256 PNG frames at 60 FPS
- Synchronized gamepad state (joystick + buttons) in JSONL format
- Metadata (duration, actual FPS achieved, input device)
Example action data:
{
"frame_id": 1536,
"timestamp": 148.58,
"j_left": [1.0, 0.0],
"j_right": [0.0, 0.0],
"buttons": [1.0, 0.0, 0.0, ...],
"observation_path": "observations/001536.png"
}This shows: moving right (j_left X = 1.0) while flapping (button 0 = 1.0).
Data Format Design
Each recording session produces:
human_demos/joust_20241227_143052/
├── observations/
│ ├── 000000.png
│ ├── 000001.png
│ └── ... (60 frames per second)
├── actions.jsonl # Frame-by-frame input data
└── metadata.json # Session statsThis format is designed to convert cleanly to NitroGen's expected Parquet training format with frame-action pairs.
Technical Insights
1. Understanding Gamepad Data Models
Coming from a keyboard/mouse background, gamepad data structure wasn't intuitive:
- Physical gamepad: Has left stick, right stick, and buttons
- Data representation:
j_left= left stick (x, y),j_right= right stick (x, y) - Joust usage: Only left stick X-axis (horizontal) + button A (flap)
The naming can be confusing—j_left sounds like "move left" but actually means "left joystick position." The X value determines direction. You have to think 2 player arcade game cabinet.
2. Screen Capture Gotchas
- Exclusive fullscreen = hardware overlays = uncapturable
- Windowed mode = standard rendering = works with mss/dxcam
- Always test capture during actual gameplay, not just menus
- First frame black? You're probably in fullscreen mode
3. Input Reading Complexity
Reading input across applications is harder than it seems:
| Method | Works When MAME Has Focus? | System-Wide? |
|---|---|---|
pygame.event.get() | No | No |
pygame.key.get_pressed() | No | No |
keyboard library | Yes, but blocks pygame | Yes |
GetAsyncKeyState() | Yes | Yes |
For background monitoring, OS-level APIs are the only reliable solution.
Current Status
After solving these technical challenges, the recording system is production-ready:
- ✅ Reliable 60 FPS screen capture from MAME
- ✅ Accurate keyboard-to-gamepad mapping
- ✅ System-wide input reading (works with MAME in focus)
- ✅ Auto-start recording with countdown
- ✅ Clean data format ready for training pipeline
- ✅ Proper error handling and fallbacks
Training Data Requirements
Based on NitroGen's architecture and similar behavioral cloning projects:
- Minimum: 2-3 hours (basic gameplay ability)
- Recommended: 5-10 hours (competent performance)
- Optimal: 10-15 hours (strong generalization)
Why so little compared to NitroGen's original training?
- Fine-tuning, not training from scratch: The model already understands gameplay concepts
- Single game focus: Limited action space and scenarios compared to 100+ games
- Transfer learning: Pre-trained vision encoder already extracts relevant features
Lessons Learned
1. Cloud Infrastructure Democratizes AI Development
Initially, I planned to train on my RTX 3050 (4GB VRAM), which would have required extensive memory optimization and 6-9 days of training time. Switching to RunPod's H100 rental proved more practical: $15-22 for 12-15 hours of training versus similar electricity costs but a week of waiting. This demonstrates a key insight: you don't need to own datacenter GPUs to use them. Cloud rental democratizes access to professional-grade infrastructure.
2. Data Collection is Hard
My initial "just scrape videos" plan fell apart immediately. Only 1.5% of videos had usable overlays. This pivot to recording human gameplay, while more time-intensive, gives full control over data quality, diverse scenarios, and clean ground truth with no OCR errors.
3. Systems Integration is 80% of the Work
The actual model fine-tuning will be relatively straightforward PyTorch code. But getting the data collection pipeline working required understanding Windows APIs, debugging screen capture issues, solving input reading across processes, handling window detection edge cases, and implementing fallback strategies. This is typical of real-world ML projects: data engineering dominates model engineering.
What's Next: Part 2
In Part 2, I'll cover:
- Data Processing Pipeline - Converting JSONL recordings to Parquet training format, data validation and quality checks, augmentation strategies
- Training Implementation - PyTorch dataset and dataloader, RunPod setup and environment configuration, training loop with monitoring
- Fine-tuning Process - Hyperparameter selection, training curves and metrics, debugging training issues
- Evaluation and Results - Qualitative gameplay assessment, quantitative metrics, comparison to baseline
- Reflections - What worked well, what I'd do differently, cost analysis, future improvements
Current Progress
Status: Data collection phase
Recordings completed: 3.5 hours (already started)
Target: 15 hours of diverse gameplay (optimal)
Platform: RunPod H100 80GB (spot pricing ~$1.49/hr)
Estimated training cost: $15-22 total
Next milestone: First training run on RunPod
Outro
This project demonstrates that modern foundation models enable exciting possibilities even for individual developers through accessible cloud infrastructure. The key is:
- Start small: Single game, simple controls, modest data requirements
- Transfer learning: Leverage pre-trained models, don't train from scratch
- Use the right tools: Cloud GPU rental democratizes access to professional infrastructure
- Iterate ruthlessly: Expect initial plans to fail, have fallback strategies
- Document problems: Every issue you solve helps the next person
The recording system is working beautifully. Now it's time to collect 15 hours of gameplay and see if behavioral cloning can teach an AI to master Joust.
Stay tuned for Part 2, where we'll find out if all this engineering effort pays off with a competent Joust-playing AI.
Follow along with the full implementation details in the GitHub repository (available after Part 2 publication).
Published on Tomorrow's Innovations | tomorrowsinnovations.co