Lab: Fine-Tuning

Hugging Face Fine-Tuning (Trainer API)

Adapt a pretrained model to a specific task with a small, focused dataset.

Objective

Use the Hugging Face Trainer API to fine-tune a model for Yelp review classification. This lab prioritizes speed and correctness by using a smaller subset first.

Prerequisites

Python installed and working in VS Code.
A Hugging Face account and access token.
Basic comfort running Python scripts.

Tip: Fine-tuning needs compute. Start with a small subset to confirm everything works before scaling up.

Step 1: Install Dependencies

pip install transformers datasets evaluate accelerate

Step 2: Log in to Hugging Face

Authenticate to access gated models and enable model sharing.

from huggingface_hub import login

login()

Step 3: Load and Tokenize the Dataset

We’ll use the Yelp Reviews dataset and tokenize it with a BERT model.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = dataset.map(tokenize, batched=True)

Step 4: Use a Small Subset First

Start small to validate your pipeline quickly.

small_train = dataset["train"].shuffle(seed=42).select(range(1000))
small_eval = dataset["test"].shuffle(seed=42).select(range(1000))

Step 5: Load Your Model

Load a pretrained model and specify the number of labels for your classification task. The Yelp Review dataset has 5 rating classes (1-5 stars).

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased",
    num_labels=5
)

Important: When you load a model for a specific task, the pretrained classification head is replaced with a randomly initialized one. You'll see a warning like: "Some weights of BertForSequenceClassification were not initialized from the model checkpoint". This is expected—you're about to fine-tune this new head on your specific task.

Step 6: Define Metrics Computation

The Trainer is an optimized training loop that abstracts away boilerplate code. To measure model performance, define how accuracy is computed from predictions:

import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Convert logits to predicted classes
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Learning Moment: The compute_metrics function receives raw logits (unnormalized scores) from the model. np.argmax() selects the class with the highest score, and then we compare to actual labels using the Evaluate library's accuracy function.

Step 7: Configure Training Hyperparameters

Hyperparameters are settings like learning rate, batch size, and number of epochs that control the training process. The Trainer uses TrainingArguments to configure these:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="yelp_review_classifier",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=True
)

Key hyperparameters explained:

output_dir: Where to save checkpoints and final model.
eval_strategy="epoch": Evaluate accuracy at the end of each training epoch.
learning_rate=2e-5: Control how much weights change per training step. Lower rates are safer for fine-tuning.
per_device_train_batch_size=8: Process 8 examples at a time during training.
num_train_epochs=2: Run through the dataset twice.
weight_decay=0.01: Regularization to prevent overfitting.
push_to_hub=True: Automatically upload the model to Hugging Face Hub after training.

Step 8: Create and Run the Trainer

Instantiate the Trainer with the model, training arguments, datasets, and metrics function. Then call train() to start:

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_eval,
    compute_metrics=compute_metrics
)

# Start training
trainer.train()

Tip: With 1,000 training examples and batch size 8, this should complete in 5-10 minutes on most machines.

Step 9: Push Your Model to the Hub

Share your fine-tuned model on Hugging Face Hub for others to use and discover:

trainer.push_to_hub()

This uploads:

Your fine-tuned model weights
The tokenizer used
Training configuration (all hyperparameters)
README with usage instructions (auto-generated)

Step 10: Evaluate and Iterate

Review metrics: Check accuracy and loss at each epoch printed to console.
Scale up: If results look good, increase training samples from 1,000 to 5,000 or more.
Experiment: Try different base models (bert-base-uncased, distilbert/distilbert-base-uncased) or hyperparameters.
Monitor overfitting: Compare training loss vs. eval loss. If eval loss plateaus while training loss continues dropping, consider adding regularization or reducing epochs.

Advanced: Full-Scale Training Example

Once your small-subset pipeline works, scale up with the full dataset:

# Load the full dataset (if you want to train on all data)
full_train = dataset["train"]
full_eval = dataset["test"]

# Update training arguments for longer training
training_args = TrainingArguments(
    output_dir="yelp_review_classifier_full",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,  # Larger batch if GPU available
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
    logging_steps=100  # Log every 100 steps
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=full_train,
    eval_dataset=full_eval,
    compute_metrics=compute_metrics
)

trainer.train()

Production Note: Full training on Yelp (560K examples) may take 1-2 hours on a GPU. Consider using Hugging Face Hub's training infrastructure or Google Colab for larger datasets.

Installation & Setup Details

Virtual Environment Setup (Recommended)

Use uv, a fast Rust-based Python package manager, or pip:

# Using uv (recommended for speed)
uv venv .env
source .env/bin/activate

# Install Transformers and dependencies
uv pip install transformers datasets evaluate accelerate

Or with standard pip:

python -m venv .env
source .env/bin/activate  # On Windows: .env\Scripts\activate
pip install transformers datasets evaluate accelerate

Verify Installation

Test that everything is working:

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('hugging face is the best'))"

Expected output:

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]

GPU Acceleration (Optional)

For faster training, enable CUDA with PyTorch:

# Check if NVIDIA GPU is available
nvidia-smi

# Install GPU-optimized PyTorch (CUDA 11.8)
pip install torch --index-url https://download.pytorch.org/whl/cu118

Offline Mode (Advanced)

If you need to run without internet access:

from huggingface_hub import snapshot_download

# Pre-download model
snapshot_download(repo_id="google-bert/bert-base-cased", repo_type="model")

# Then set environment variable before training
import os
os.environ["HF_HUB_OFFLINE"] = "1"

Caching & Model Loading

Transformers automatically caches downloaded models. Control the cache location with environment variables:

# Check cache location
import os
cache_dir = os.environ.get("HF_HUB_CACHE", "~/.cache/huggingface/hub")
print(f"Cache location: {cache_dir}")

# Or set a custom location
os.environ["HF_HUB_CACHE"] = "/custom/cache/path"

Default cache locations:

Linux/Mac: ~/.cache/huggingface/hub
Windows: C:\Users\YourUsername\.cache\huggingface\hub

Resources & Further Learning

Transformers Documentation: Official API reference
Task Recipes: Classification, summarization, QA, and more
Training Tips & Tricks: Advanced Trainer features and debugging
Model Hub: Browse 500K+ pretrained models
Trainer API Reference: Full parameter documentation