Migrating from AWS SageMaker to GCP Vertex AI: A Training Environment Transition

RMAG news

Previously, I used AWS SageMaker Studio for model training in my work. However, when I received a generous $10,000 credit from Google Cloud for Startups, I decided to transition our training environment to Vertex AI Workbench.

This article explores the usability differences between SageMaker and Vertex AI and documents our migration process.

Building the Model Training Environment

Creating the Dockerfile

In SageMaker, application code was not included in the container image. Instead, we used dependencies to load external code and entry_point to specify a shell script that switched Conda environments and executed the code.

SageMaker Training Script:

import sagemaker
from sagemaker.estimator import Estimator

session = sagemaker.Session()
role = sagemaker.get_execution_role()

estimator = Estimator(
image_uri=*****.dkr.ecr.ap-northeast-1.amazonaws.com/bert-training:latest,
role=role,
instance_type=ml.g4dn.2xlarge,
instance_count=1,
base_job_name=pre-training,
output_path=s3://sagemaker/output_data/pre_training,
sagemaker_session=session,
entry_point=pre-training.sh,
dependencies=[bert-training],
checkpoint_s3_uri=s3://sagemaker/checkpoints/summary,
checkpoint_local_path=/opt/ml/checkpoints/,
use_spot_instances=True,
max_wait=120*60*60,
max_run=120*60*60,
hyperparameters={
wandb_api_key: *******,
mlm: True,
do_train: True,
field_hs: 64,
output_dir: /opt/ml/checkpoints/,
data_root: /opt/ml/input/data/input_data/,
data_fname: pre_training_data,
num_train_epochs: 3,
save_steps: 100,
per_device_train_batch_size: 8
},
tags=[{Key: Project, Value: AIResearch}]
)

estimator.fit({input_data: s3://sagemaker/input_data/pre_training_data.csv})

Unlike SageMaker, Vertex AI does not offer an entry_point for specifying commands, so we included the application code directly in the container image and installed the necessary packages without using a Conda environment.

Dockerfile for Vertex AI:

# Dockerfile for model training on Vertex AI
FROM gcr.io/deeplearning-platform-release/pytorch-gpu.1-12

ENV PYTHONDONTWRITEBYTECODE=1
PYTHONUNBUFFERED=1

WORKDIR /app

RUN apt-get update && apt-get install -y
build-essential
&& rm -rf /var/lib/apt/lists/*

RUN pip install –no-cache-dir
pandas==1.4.3
scikit-learn==1.1.1
transformers==4.26.0
numpy==1.23.1
imbalanced-learn==0.10.1
wandb
python-dotenv
google-cloud-storage

COPY . /app

ENTRYPOINT [“python”, “main.py”]

Deploying the Container Image

We used the following commands to deploy the Docker image to Google Cloud’s Artifact Registry:

docker buildx build –platform linux/amd64 -f Dockerfile.vertex -t asia-northeast1-docker.pkg.dev/ai/bert-training/pre-training:latest .
gcloud auth configure-docker asia-northeast1-docker.pkg.dev
docker push asia-northeast1-docker.pkg.dev/ai/bert-training/pre-training:latest

Migrating Input Data

We transferred input data from S3 to Cloud Storage using the following steps:

Open ‘Create a Transfer Job’ in Cloud Storage.
Select ‘Amazon S3’ as the source and ‘Google Cloud Storage’ as the destination.
Create an IAM user with AmazonS3ReadOnlyAccess and enter the provided credentials.
Specify the destination bucket and start the transfer job.

Writing the Training Script

In Vertex AI, the training script can be written as follows:

from google.cloud import aiplatform

def create_custom_job(
project: str,
display_name: str,
container_image_uri: str,
location: str = us-central1,
args: list = None,
bucket_name: str = None,
):
aiplatform.init(project=project, location=location, staging_bucket=bucket_name)

custom_job = {
display_name: display_name,
worker_pool_specs: [{
machine_spec: {

machine_type: n1-highmem-32,
accelerator_type: NVIDIA_TESLA_V100,
accelerator_count: 4,
},
replica_count: 1,
container_spec: {
image_uri: container_image_uri,
args: args,
},
}]
}

job = aiplatform.CustomJob(**custom_job)
job.run(sync=True)

project_id = ai
display_name = pre-training
container_image_uri = asia-northeast1-docker.pkg.dev/ai/bert-training/pre-training:latest
bucket_name = gs://bert-training
args = [
–mlm,
–do_train,
–field_hs, 64,
–data_fname, pre_training_data,
–num_train_epochs, 1,
–save_steps, 100,
–per_device_train_batch_size, 8,
–gcs_bucket_name, bert-training,
–gcs_blob_name, vertex/input_data/pre_training_data.csv,
–local_data_path, ./data/action_history/pre_training_data.csv
]

create_custom_job(
project=project_id,
display_name=display_name,
container_image_uri=container_image_uri,
bucket_name=bucket_name,
location=location,
args=args,
)

Vertex AI does not automatically place input data into a container path as SageMaker does with S3 paths. Therefore, the application must explicitly handle the download and upload of training artifacts.

from google.cloud import storage
import os

def download_csv_from_gcs(bucket_name, source_blob_name, destination_file_path):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_path)
print(fCSV file {source_blob_name} downloaded to {destination_file_path}.)

def upload_directory_to_gcs(bucket_name, source_directory, destination_blob_prefix):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
for root, _, files in os.walk(source_directory):
for file in files:
local_path = os.path.join(root, file)
relative_path = os.path.relpath(local_path, source_directory)
blob_path = os.path.join(destination_blob_prefix, relative_path)

blob = bucket.blob(blob_path)
blob.upload_from_filename(local_path)
print(fUploaded {local_path} to {blob_path}.)

These functions are integrated into the application as follows:

def main(args):
output_dir = args.output_dir # Directory where training outputs are saved
bucket_name = args.gcs_bucket_name # Cloud Storage bucket name
destination_blob_prefix = vertex/output_data/pre_training
upload_directory_to_gcs(bucket_name, output_dir, destination_blob_prefix)

if __name__ == __main__:
parser = define_main_parser()
opts = parser.parse_args()
download_csv_from_gcs(opts.gcs_bucket_name, opts.gcs_blob_name, opts.local_data_path)
main(opts)

Additional Notes

Regional Constraints

When we attempted to use the NVIDIA Tesla V100 GPU in the asia-northeast1 (Tokyo) region, we encountered errors. Further investigation revealed significant restrictions on the GPUs available in this region, prompting us to switch to the us-central1 (Iowa) region. It was a lesson in the importance of considering regional resource differences before migration.

https://cloud.google.com/vertex-ai/docs/quotas

Spot Instances Unavailable

Unlike SageMaker, where we frequently utilized spot instances to save costs, Vertex AI does not support them. This was not a major issue due to the ample GCP credits we had, but it’s worth noting for those planning budgets. In fact, the guaranteed resource allocation in Vertex AI provided unexpected benefits over the potential for training interruptions with SageMaker’s spot instances.

Leave a Reply

Your email address will not be published. Required fields are marked *