Table of contents:
Context
My process
Summary of method & results
TLDR:
DeepCell + tensorflow-2.8.4
DeepCell + tensorflow-2.8.4-redux
Delta
Compressed size
3.2GB
4.0 GB
+0.8 GB (+25%)
VULNs
553
125
-428 (-77%)
Critical
1
1
0 (0%)
High
80
29
-51 (-63%)
Medium
349
53
-296 (-85%)
Low
123
42
-81 (-66%)
Read on for the how, why, wherefore, and finally.
Context & motivation
Previously we switched from the DeepLearning container to the base TensorFlow container.
Unfortunately the container has 553 security of vulnerabilities according to Google’s scanner:
The 553 issues break down this way:
1 critical [vuln]
80 high
349 medium
123 low
The official 2.8.4 container was published in Nov 2022. That’s 1.5 years of OS updates at least. I looked up the 2.8.4 source and found that it’s using Ubuntu 20.04 as the base OS. Of note, we’re using the x86_64 architecture according to the container image layer: ENV NVARCH=x86_64.
So the obvious thing to do is to switch to the most recent Ubuntu version 24.04 right? Well no, that’s a short party: NVIDIA doesn’t have CUDA packages for 24.04 in their repository. So it’s off to 22.04 – still two years more recent, and more importantly with CUDA packages.
My process, as I did it
I wouldn’t do it this way again, but this is how I did it.
Updating the base Ubuntu image + dependencies.
First, I forked the tensorflow repository. I did a master-only clone so I needed to fetch the tag information after clone. Then, I could reset to the 2.8.4 version.
git remote add upstream https://github.com/tensorflow/tensorflow.git
git fetch upstream
# Reset master branch to 2.8.4
git checkout master
git reset –hard v2.8.4
git push –force
# Clean out local copy (everything after 2.8.4)
git gc
Then, I updated the build steps. Here’s what I did, following the instructions in the containers readme.
1. Build the tf-tools build tools container:
docker build -t tf-tools -f tools.Dockerfile .
2. Set up aliases:
alias asm_images=“docker run –rm -v $(pwd):/tf -v /var/run/docker.sock:/var/run/docker.sock tf-tools python3 assembler.py “
3. Update build settings. I started with changing the file partials/ubuntu/version.partial.Dockerfile to use Ubuntu 22.04.
4. Regenerate the dockerfiles.
5. Rebuild the desired TF-2.8 image. This builds a container tagged with the 2.8.4-rebuilt version, which causes the build system to tag the GPU-accelerated container 2.8.4-rebuilt-gpu.
6. Done, or need to fix. Fix build errors & loop to step 3.
Dependency updates
Following this process here’s what I fixed at first:
Downgrade requests & urllib libraries (see github bug)
Update base Ubuntu to 22.04.
Update CUDA from 11.2.1 to 11.8.0
Parameterize CUDA patch level (to support .0 instead of .1)
Update CUDNN from 8.1.0.77-1 to 8.6.0.163-1
Update libvinfer from 7.2.2-1 to 8.5.3-1.
I didn’t love the major version update. But things seem fine.
At this point the container built, and I could run DeepCell. It output a segmentation image that seems plausible.
However a new error message popped up in the logs…
2024-05-28 19:38:34.423903: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-05-28 19:38:34.424006: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
Is this an error? Is it a problem to rely on the driver? I don’t know, but I wanted to clear out the error.
Finding ptxas
I found a GitHub issue that seemed similar (missing ptxas) and saw a suggestion to install nvidia-cuda-toolkit. Alright: but that exploded the container size from 6.5 GB to 12.13 GB … unacceptable 😤 (Incidentally, this is too large for Cloud Shell to build on its limited persistent disk.)
At this point I struggled for a couple hours. The nvidia-cuda-toolkit package info says it uses CUDA 11.5. But the prebuilt containers had 11.7 and 11.8 not 11.5 (I’d previously selected 11.8). The 11.5 packages weren’t available in NVIDIA’s Ubuntu 22.04 package repo.
Along the way, I switched the base container from NVIDIA’s nvidia/cuda:11.8.0-base-ubuntu22.04 to nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04. Rather than pick the versions myself, I figured going with an official NVIDIA container with the files I was installing anyhow made sense.
I eventually found this “ptxas version issue” linked from a TensorFlow discussion asking whether to worry about a version mismatch warning. Not the same as our message that it’s missing, but close enough.
This part caught my eye:
Interesting idea. I cherry-picked the binary by launching a container from the rebuilt image, and installing the very large nvidia-cuda-toolkit:
Need to get 1603 MB of archives.
After this operation, 4505 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Gulp. One very long download later, I had a ptxas binary.
/usr/bin/ptxas
Now to copy it back to the host, so I can add it to the redux repo for direct insertion into the container. Back on the host:
Then I installed it into /usr/bin/ptxas in the dockerfile.
Lo and behold: no more ptxas error when running DeepCell.
Summary
The container was rebuilt by:
Forking TensorFlow 2.8.4 from source.
Switching to the nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 base image.
Cherry-picking ptxas from nvidia-cuda-toolkit (avoids ~4GB unnecessary files).
The container rebuild yielded these changes:
DeepCell + tensorflow-2.8.4
DeepCell + tensorflow-2.8.4-redux
Delta
Compressed size
3.2GB
4.0 GB
+0.8 GB (+25%)
VULNs
553
125
-428 (-77%)
Critical
1
1
0 (0%)
High
80
29
-51 (-63%)
Medium
349
53
-296 (-85%)
Low
123
42
-81 (-66%)
It’s too bad we added 25% to the container size. This may be because I moved away from the TensorFlow container build’s selective dependencies to the full runtime package.
Still, 77% reduction in VULNs (and 63% for the highs) is very good.
The critical VULN is in TensorFlow pre 2.11.1. It allows malicious users running custom TensorFlow python code to access memory that’s not theirs in some cases. Since we’re running our own Python code, and DeepCell’s, we’re safe as long as nobody sticks in naughty code in those layers. But, we’re also stuck with 2.8.4 and can’t upgrade to 2.11 so the rationalization is rationalized.
If I were to do it again, I’d skip hand-picking library versions & move straight to an official NVIDIA runtime container.
Appendix
Helpful command to get into the TF container shell to poke around for files:
I ran out of disk space on cloud shell a few times. Clear out the docker cache like so:
⚠️ Don’t run these as-is if you have other containers/images you want to keep!
# Delete the previously build image
# (make room for new one)
docker image rm tensorflow:2.8.4-rebuilt-gpu
In the end, Cloud Shell (which has a limited disk) became a hassle for iterating on builds. I considered a Cloud Workstation however there’s a fixed $0.20/hr cost whether or not you have a workstation running … and I really just need a place to run Docker with disk space, so I used my local computer (a mac). The downloads weren’t as fast as on cloud, but hey.
Side note: I’m super impressed with how easy it was to rebuild TF from source. Nice job y’all 🤩
Tools used in the rebuild:
GCP Cloud Shell
GCP Artifact Registry container scanner
Docker (local + cloud shell)
git & GitHub
TensorFlow
apt-file (to look up which package installed a file)