Adapt a pretrained model to a specific task with a small, focused dataset.
Use the Hugging Face Trainer API to fine-tune a model for Yelp review classification. This lab prioritizes speed and correctness by using a smaller subset first.
pip install transformers datasets evaluate accelerate
Authenticate to access gated models and enable model sharing.
from huggingface_hub import login
login()
We’ll use the Yelp Reviews dataset and tokenize it with a BERT model.
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
def tokenize(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
dataset = dataset.map(tokenize, batched=True)
Start small to validate your pipeline quickly.
small_train = dataset["train"].shuffle(seed=42).select(range(1000))
small_eval = dataset["test"].shuffle(seed=42).select(range(1000))
Load a pretrained model and specify the number of labels for your classification task. The Yelp Review dataset has 5 rating classes (1-5 stars).
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"google-bert/bert-base-cased",
num_labels=5
)
"Some weights of BertForSequenceClassification were not initialized from the model checkpoint". This is expected—you're about to fine-tune this new head on your specific task.
The Trainer is an optimized training loop that abstracts away boilerplate code. To measure model performance, define how accuracy is computed from predictions:
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
# Convert logits to predicted classes
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
compute_metrics function receives raw logits (unnormalized scores) from the model. np.argmax() selects the class with the highest score, and then we compare to actual labels using the Evaluate library's accuracy function.
Hyperparameters are settings like learning rate, batch size, and number of epochs that control the training process. The Trainer uses TrainingArguments to configure these:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="yelp_review_classifier",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=2,
weight_decay=0.01,
push_to_hub=True
)
Key hyperparameters explained:
output_dir: Where to save checkpoints and final model.eval_strategy="epoch": Evaluate accuracy at the end of each training epoch.learning_rate=2e-5: Control how much weights change per training step. Lower rates are safer for fine-tuning.per_device_train_batch_size=8: Process 8 examples at a time during training.num_train_epochs=2: Run through the dataset twice.weight_decay=0.01: Regularization to prevent overfitting.push_to_hub=True: Automatically upload the model to Hugging Face Hub after training.Instantiate the Trainer with the model, training arguments, datasets, and metrics function. Then call train() to start:
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train,
eval_dataset=small_eval,
compute_metrics=compute_metrics
)
# Start training
trainer.train()
Share your fine-tuned model on Hugging Face Hub for others to use and discover:
trainer.push_to_hub()
This uploads:
bert-base-uncased, distilbert/distilbert-base-uncased) or hyperparameters.Once your small-subset pipeline works, scale up with the full dataset:
# Load the full dataset (if you want to train on all data)
full_train = dataset["train"]
full_eval = dataset["test"]
# Update training arguments for longer training
training_args = TrainingArguments(
output_dir="yelp_review_classifier_full",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16, # Larger batch if GPU available
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
push_to_hub=True,
logging_steps=100 # Log every 100 steps
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=full_train,
eval_dataset=full_eval,
compute_metrics=compute_metrics
)
trainer.train()
Use uv, a fast Rust-based Python package manager, or pip:
# Using uv (recommended for speed)
uv venv .env
source .env/bin/activate
# Install Transformers and dependencies
uv pip install transformers datasets evaluate accelerate
Or with standard pip:
python -m venv .env
source .env/bin/activate # On Windows: .env\Scripts\activate
pip install transformers datasets evaluate accelerate
Test that everything is working:
python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('hugging face is the best'))"
Expected output:
[{'label': 'POSITIVE', 'score': 0.9998704791069031}]
For faster training, enable CUDA with PyTorch:
# Check if NVIDIA GPU is available
nvidia-smi
# Install GPU-optimized PyTorch (CUDA 11.8)
pip install torch --index-url https://download.pytorch.org/whl/cu118
If you need to run without internet access:
from huggingface_hub import snapshot_download
# Pre-download model
snapshot_download(repo_id="google-bert/bert-base-cased", repo_type="model")
# Then set environment variable before training
import os
os.environ["HF_HUB_OFFLINE"] = "1"
Transformers automatically caches downloaded models. Control the cache location with environment variables:
# Check cache location
import os
cache_dir = os.environ.get("HF_HUB_CACHE", "~/.cache/huggingface/hub")
print(f"Cache location: {cache_dir}")
# Or set a custom location
os.environ["HF_HUB_CACHE"] = "/custom/cache/path"
Default cache locations:
~/.cache/huggingface/hubC:\Users\YourUsername\.cache\huggingface\hub