Back to Blog

Why I couldn't train Gemma 3n locally (and why I'm using Vertex AI instead)

2025-10-165 min read

I've been working on a pronunciation coaching app that needs a multimodal AI capable of understanding both text and audio. The app currently uses Google Gemini's native audio models, but I wanted to fine-tune a dedicated model for pronunciation coaching. Google's Gemma 3n seemed perfect with its native audio support through the Universal Speech Model encoder. I have a MacBook Pro M4 Pro with 48GB of RAM. Surely that's enough to train a 5B parameter model locally, right?

Spoiler: it's not. Here's what I learned from failing.

Why I tried local training

The appeal was obvious:

  • No cost: Cloud GPU training runs $2-5 per attempt on services like RunPod or Vast.ai
  • Privacy: My training data stays local
  • Fast iteration: No uploading datasets or configuring cloud environments
  • Hands-on learning: I wanted to experiment with PyTorch training scripts, understand memory optimization techniques, and learn through trial and error. Locally, I can modify code and retry in seconds, versus deploying to the cloud and waiting for scripts to run

I had 988 training examples covering pronunciation errors, speaking delivery analysis, and coaching conversations. It seemed like a perfect candidate for overnight local training.

What does training a model actually mean?

Quick clarification: when people say "training," they might mean two very different things.

Training from scratch is what Google and OpenAI do. You start with random numbers and teach a neural network everything from zero. This takes billions of text samples, hundreds of GPUs, and months of compute time. It's expensive and slow.

Fine-tuning is taking an existing trained model and teaching it your specific task. The hard work is already done. The model knows language and reasoning. You're just showing it your specific format and domain. Like hiring someone who already knows how to code and teaching them your codebase, versus teaching someone to code from scratch.

For my use case, I want fine-tuning. The model already understands English and pronunciation. I just need it to give feedback in my format.

How to fine-tune with Hugging Face

Hugging Face makes this whole process way easier than it should be. They host thousands of pre-trained models and give you Python libraries to fine-tune them. You don't need a PhD to do this.

Here's the basic workflow:

from transformers import AutoProcessor, AutoModelForImageTextToText

# Download the model and processor
processor = AutoProcessor.from_pretrained("google/gemma-3n-E2B-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-3n-E2B-it")

Two lines of code. That downloads a 5.4 billion parameter AI model to your laptop. The processor converts your data into the right format, and model is the neural network itself.

Then you use the Trainer class to actually run the training:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir="outputs/gemma3n",
num_train_epochs=3,
per_device_train_batch_size=1,
learning_rate=3e-5,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=my_dataset,
)

trainer.train() # Start training!

This is great because you skip all the low-level stuff like writing gradient descent loops and managing checkpoints. The downside? You don't really know what's happening under the hood, especially how much memory everything uses. Which is exactly the problem I ran into.

The memory problem

Training a large language model requires fitting several things in memory simultaneously:

  1. Model weights: For Gemma 3n E2B (5.4B parameters), that's ~11GB in float16
  2. Gradients: Another ~11GB (same size as weights)
  3. Optimizer states: Adam keeps 2 states per parameter, adding ~22GB
  4. Activations: Intermediate computation results
  5. Batch data: The training examples being processed

Total: 50-60GB for the smaller E2B model, 70-80GB for E4B. My 48GB of RAM wasn't going to cut it.

What I tried

Attempt 1: Full fine-tuning with E4B
Out of memory during optimizer initialization. The model loaded (16GB), but the optimizer couldn't allocate its state.

Attempt 2: Smaller batch sizes
Reducing batch_size from 2 to 1 just delayed the OOM error to the backward pass.

Attempt 3: LoRA with E4B
LoRA (Low-Rank Adaptation) only trains small adapter layers instead of the full model, reducing trainable parameters from 7.8B to 40M (0.5%). But the frozen base model still needs to fit in memory: 59GB for E4B alone.

Attempt 4: LoRA with E2B
This almost worked. The smaller E2B model with LoRA used 40-45GB, leaving 3-8GB of headroom. Training started and ran for a few hours before hitting memory errors during specific batches with longer sequences.

Why it ultimately failed

Even with aggressive optimizations (LoRA, batch_size=1, no gradient checkpointing), 48GB wasn't enough for consistent training. The memory usage varied by example:

  • Short examples: 38-42GB (fine)
  • Long examples: 46-50GB (crashes)

I could probably make it work by filtering out longer examples or reducing max_length to 1024 tokens, but at that point I'm compromising the model's capabilities to fit hardware constraints.

Moving to Vertex AI

After three days of failed attempts, I'm switching to Google Cloud's Vertex AI for training:

  • A100 40GB GPU: Plenty of memory for the E4B model with LoRA
  • Managed infrastructure: No memory debugging, no crashes
  • Cost: ~$3-5 for a complete training run (3 epochs)
  • Time: 2-3 hours vs my attempted 6-9 hours locally

I'll write a separate blog post about the Vertex AI training process, including:

  • Setting up custom training jobs with Gemma 3n
  • Uploading datasets to Cloud Storage
  • Monitoring training with TensorBoard
  • Deploying the trained model to Vertex AI endpoints

When can you train locally?

Local training on Apple Silicon is viable if:

  • Your model is smaller: 1-3B parameter models with LoRA fit comfortably in 48GB
  • You have more RAM: The M4 Max with 128GB would handle Gemma 3n E2B easily
  • You use quantization: 4-bit or 8-bit quantization reduces memory further but adds complexity

Key lessons

  • Memory is the bottleneck: 48GB of unified memory sounds like a lot until you try training 5B+ parameter models. The optimizer states alone can double or triple your memory requirements
  • LoRA helps but isn't magic: It reduces trainable parameters by 99%, but the frozen base model weights still need to fit in memory during forward and backward passes
  • Cloud training is cost-effective: $3-5 for a training run is reasonable compared to days of debugging OOM errors and failed experiments
  • Know your hardware limits: Apple Silicon's unified memory is excellent for inference and smaller model training (1-3B parameters), but training 5B+ models reliably requires dedicated GPU VRAM

Resources

Stay Updated

Get the latest posts and insights delivered to your inbox.

Unsubscribe anytime. No spam, ever.