Gemini API 102: Next steps beyond “Hello World!”

Gemini API 102: Next steps beyond “Hello World!”

TL;DR:

The previous post in this (short) series introduced developers to the Gemini API by providing more user-friendly and useful “Hello World!” samples than in the official Google documentation, fragmented into both places you can access the API from: Google AI and GCP Vertex AI. Your next steps are to enhance that example to learn a few more features of the Gemini API: support for streaming and multi-turn conversations (chat), upgrade to the latest 1.0 or even 1.5 API versions, and switch to multimodality… stick around!

Introduction

Are you a developer interested in using Google APIs? You’re in the right place as this blog is dedicated to that craft from Python and sometimes Node.js. Previous posts showed you how to use Google credentials like API keys or OAuth client IDs for use with Google Workspace (GWS) APIs. Other posts introduced serverless computing or showed you how to export Google Docs as PDF. If you’re interested in Google APIs, you’re in the right place.

The previous post kicked off the conversation about generative AI, presenting “Hello World!” examples that help you get started with the Gemini API in a more user-friendly way than in Google’s documentation. It presented code samples showing you how to use the API from both Google AI as well as GCP Vertex AI. This post follows up with a multimodal example, one that supports streaming output, and another one that supports multi-turn conversations (“chat”), and how to upgrade to the latest 1.0 and 1.5 models, the latter of which is in public preview at the time of this writing.

Whereas your initial journey began with a variety of examples, code in Python & Node.js, and API access from Google AI & Vertex AI, this post focuses specifically on the “upgrades,” so I’m sticking with Gemini API access in Python on Google AI. Use the previous post’s variety of samples to “extrapolate” porting to Node.js or running on Vertex AI.

Prerequisites

The example assumes you’ve performed the prerequisites from the previous post:

Installed the Google GenAI Python package with: pip install -U pip google-generativeai

Created an API key
Saved API key as a string to settings.py as API_KEY = ‘YOUR_API_KEY_HERE’ (and followed the suggestions for only hard-coding it in prototypes and keeping it safe when deploying to production)

For today’s code sample, there a couple more packages to install:

The popular Python HTTP requests library

The Python Imaging Library (PIL)’s flexible fork, Pillow

You can do so along with updating the GenAI package with: pip install -U Pillow requests google-generativeai (or pip3)

The “OG”

Let’s start with the original script from the first post that we’re going to upgrade here, gemtxt-simple-gai.py:

import google.generativeai as genai
from settings import API_KEY

PROMPT = Describe a cat in a few sentences
MODEL = gemini-pro
print(** GenAI text: %r model & prompt %rn % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content(PROMPT)
print(response.text)

[CODE] gemtxt-simple-gai.py: “Hello World!” sample from previous post

 

Review the original post if you need a description of the code. This is the starting point for the remaining examples here.

Upgrade API version

The simplest update is to upgrade the API version. The original Gemini API 1.0 version was named gemini-pro. It was replaced soon thereafter by gemini-1.0-pro, and after that is the latest version, gemini-1.0-pro-latest.

import google.generativeai as genai
from settings import API_KEY

PROMPT = Describe a cat in a few sentences
MODEL = gemini-1.0-pro-latest
print(** GenAI text: %r model & prompt %rn % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content(PROMPT)
print(response.text)

[CODE] gemtxt-simple10-gai.py: Uses latest Gemini 1.0 Pro model

 

The one-line upgrade was effected by updating the MODEL variable. This “delta” version is available in the repo as gemtxt-simple10-gai.py. Executing it results in output similar to the original version:

$ python3 gemtxt-simple10-gai.py
** GenAI text: ‘gemini-1.0-pro-latest’ model & prompt ‘Describe a
cat in a few sentences’

A cat is a small, furry mammal with sharp claws and teeth. It
is a carnivore, meaning that it eats other animals. Cats are
often kept as pets because they are affectionate and playful.
They are also very good at catching mice and other small
rodents.

If you have access to the 1.5 API, update MODEL to gemini-1.5-pro-latest. The remaining samples below stay with the latest 1.0 model as 1.5 is still in preview.

Streaming

The next easiest update is to change to streaming output. When sending a request to an LLM (large language model), sometimes you don’t want to wait for all of the output from the model to return before displaying to users. To give them a better experience, “stream” the output as it comes instead:

import google.generativeai as genai
from settings import API_KEY

PROMPT = Describe a cat in a few sentences
MODEL = gemini-1.0-pro-latest
print(** GenAI text: %r model & prompt %rn % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content(PROMPT, stream=True)
for chunk in response:
print(chunk.text, end=)
print()

[CODE] gemtxt-stream10-gai.py: Produces streaming output

 

Switching to streaming requires only the stream=True flag passed to the model’s generate_content() method. The loop displays the chunks of data returned by the LLM as they come in. To keep the spacing consistent, set Python’s print() function to not output a NEWLINE (n) after each chunk with the end parameter. Instead, keep chaining the chunks together and issue the NEWLINE after all have been retrieved and displayed. This version is also available in the repo as gemtxt-stream10-gai.py. Its output here isn’t going to reveal the streamed output, so you have to take my work for it. 🙂

$ python3 gemtxt-stream10-gai.py
** GenAI text: ‘gemini-1.0-pro-latest’ model & prompt ‘Describe a
cat in a few sentences’

A cat is a small, carnivorous mammal with soft fur, retractable
claws, and sharp teeth. They are known for their independence,
cleanliness, and playful nature. With its keen senses and
graceful movements, a cat exudes both mystery and intrigue. Its
sleek body is covered in sleek fur that ranges in color from
black to white to tabby.

Multi-turn conversations (chat)

Now, you may be building a chat application or executing a workflow where your user or system must interact with the model more than once, keeping context between exchanges. To facilitate this exchange, Google provides a convenience chat object, obtained with start_chat() which features a send_message() method for communicating with the model instead of generate_content(), as shown below:

import google.generativeai as genai
from settings import API_KEY

PROMPTS = (Describe a cat in a few sentences,
Since youre now a feline expert, what are the top three
most friendly cat breeds for a family with small children?
)
MODEL = gemini-1.0-pro-latest
print(f** GenAI text: %r modeln % MODEL)

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
chat = model.start_chat()
for prompt in PROMPTS:
print( USER: %rn % prompt)
response = chat.send_message(prompt)
print( MODEL: %rn % response.text)

[CODE] gemtxt-simple10-chat-gai.py: Supports multi-turn conversations

 

While the flow is slightly different from what you’ve already seen, the basic operations are the same: send a prompt to the model and await the response. The core difference is that you’re sending multiple messages in a row, and each succeeding message maintains the full context of the ongoing “conversation.” This version is found in the repo as gemtxt-simple10-chat-gai.py, and shown here is one sample exchange with the model:

$ python3 gemtxt-simple10-chat-gai.py
** GenAI text: ‘gemini-1.0-pro-latest’ model

USER: ‘Describe a cat in a few sentences’

MODEL: ‘A feline enigma, the cat slinks with feline grace,
its ethereal presence both alluring and aloof. Its piercing
gaze holds secrets untold, while its velvety coat invites
gentle caresses. With an air of both mischief and mystery,
it roams the home, leaving paw prints on the heart.’

USER: “Since you’re now a feline expert, what are the top
three most friendly cat breeds for a family with small
children?”

MODEL: ‘As a feline expert, here are the top three most
friendly cat breeds for a family with small children:nn1.
**Ragdoll:** Known for their docile and affectionate nature,
Ragdolls are gentle giants that love to be around people.
They are known to go limp and relaxed when picked up, hence
their name.nn2. **Maine Coon:** These large and shaggy
cats are known for their sweet and playful personalities.
They are very tolerant of children and love to be involved
in family activities.nn3. **Siberian:** Hypoallergenic
and affectionate, Siberians are a great choice for families
with children who may have allergies. They are known for
their loyalty and love to be petted and cuddled.’

I’m not a cat owner, so I can’t vouch for Gemini’s accuracy there. Add a comment below if you have a take on it. Now let’s switch gears a bit.

So far, all of the enhancements and corresponding samples are text-based, single-modality requests. A whole new class of functionality is available if a model can accept data in addition to text, in other form factors such as images, audio, or video content. The Google AI documentation states that this wider variety of input, “creates many additional possibilities for generating content, analyzing data, and solving problems..

Multimodal

Some Gemini models, and by extension, their corresponding APIs, support multimodality, “prompting with text, image, and audio data“. Video is also supported, but you need to use the File API to convert them to a series of image frames. You can also use the File API to upload all of the assets to use in your prompts.

The sample script below takes an image and asks the LLM for some information about it. The image is:

[IMAGE] Indoors dome waterfall; SOURCE: author (CC-BY-4.0)

 

The prompt is a fairly straightforward query: What airport is this at, and what’s the waterfall’s name?. Below is the original sample “upgraded” with multimodality and found in the repo as gemmmd-simple10loc-gai.py:

from PIL import Image
import google.generativeai as genai
from settings import API_KEY

IMG = dome-waterfall.jpg
DATA = Image.open(IMG)
PROMPT = What airport is this at, and whats the waterfalls name?
MODEL = gemini-1.0-pro-vision-latest
print(** GenAI multimodal: %r model & prompt %rn % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content((PROMPT, DATA))
print(response.text)

[CODE] gemmmd-simple10loc-gai.py: Multimodal sample with text & local image prompt

 

You’ll see these key updates from the original app:

Change to multimodal model: Gemini 1.0 Pro to Gemini 1.0 Pro Vision
Import Pillow and use it to read the image data given its filename
New prompt: pass in prompt string plus image payload

The MODEL variable now points to gemini-1.0-pro-vision-latest, the image filename is passed to Pillow to read its DATA, and rather than a single PROMPT string, pass in both the PROMPT and image DATA as a 2-tuple to generate_content(). Everything else can stay the same.

Running this script reveals the location and waterfall name:

$ python3 gemmmd-simple10loc-gai.py
** GenAI multimodal: ‘gemini-1.0-pro-vision-latest’ model & prompt
“What airport is this at, and what’s the waterfall’s name?”

This is the Jewel Changi Airport in Singapore. The waterfall is
called the Rain Vortex.

Online data vs. local

The final update is to take the previous example and change it to access images online rather than requiring it be available on a local system. For this, we’ll use one of Google’s stock images:

[IMAGE] Friendly man in office environment; SOURCE: Google

 

This one is pretty much identical as the one above, but uses the Python requests library to access the image for Pillow. The script below is also in the repo, named gemmmd-simple10url-gai.py:

from PIL import Image
import requests
import google.generativeai as genai
from settings import API_KEY

IMG_URL = https://google.com/services/images/section-work-card-img_2x.jpg
IMG_RAW = Image.open(requests.get(IMG_URL, stream=True).raw)
PROMPT = Describe the scene in this photo
MODEL = gemini-1.0-pro-vision-latest
print(** GenAI multimodal: %r model & prompt %rn % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content((PROMPT, IMG_RAW))
print(response.text)

[CODE] gemmmd-simple10url-gai.py: Multimodal sample with text & online image prompt

 

New includes the import of requests followed by its use to perform an HTTP GET on the image URL (IMG_URL), reading the binary payload into IMG_RAW, which is passed along with the text prompt to generate_content(). Running this script results in the following output:

$ python3 gemmmd-simple10url-gai.py
** GenAI multimodal: ‘gemini-1.0-pro-vision-latest’ model &
prompt ‘Describe the scene in this photo’

A young Asian man is sitting at a desk in an office. He is
wearing a white shirt and black pants. He has a big smile on
his face and is gesturing with his hands. There is a laptop,
notebook, and pen on the desk. There is a couch and some
plants in the background. The man is probably giving a
presentation or having a conversation with someone.

I originally designed a multi-turn conversation and intended to further query the model such that if this image was used in a press release, what product or products could possibly be sold by the company that originated the content. Unfortunately, multi-turn conversations aren’t supported by Gemini multimodal models (yet).

Summary

Developers are eager to jump into the world of AI/ML, especially GenAI & LLMs, and accessing Google’s Gemini models via API is part of that picture. The previous post in the series got your foot in the door, presenting a more digestible “Hello World!” sample than what’s available in Google’s documentation. This post helped you with the next steps, providing “102” samples that enhance the original script, furthering your exploration of Gemini API features but doing so without overburdening you with large swaths of code.

More advanced features are available via the Gemini API we didn’t cover here — they merit separate posts on their own:

Model tuning
Function calling
Embeddings
Safety settings

The next post in the series focuses on Gemini’s responses and explores the differences between the 1.0 and 1.5 models’ outputs across a variety of queries, so stay tuned for that. If you found an error in this post or have a topic you want me to cover in the future, drop a note in the comments below! I’ve been on the road lately talking about Google APIs, AI included of course. Find the travel calendar at the bottom of my consulting site… I’d love to meet you IRL if I’m visiting your region!

Resources

Google AI Gemini 1.0 models; Python code samples from this post

Original 1.0 app
Latest 1.0 app
Latest 1.0 streaming app
Latest 1.0 chat app
Latest 1.0 multimodal (local image) app
Latest 1.0 multimodal (online image) app

Other blog post code samples

Gemini API samples
Other Google APIs samples

Gemini API (Google AI)

API overview
QuickStart page
QuickStart code
GenAI API reference

Gemini API (GCP Vertex AI)

QuickStart page
QuickStart code
Gemini docs
Gemini model API reference
Gemini Python SDK reference

Gemini API (differences between both platforms)

Google AI for GCP Vertex AI users
GCP Vertex AI for Google AI users

Gemini 1.5 (preview)

Launch
Paper
Private preview
Public preview
Public preview (GCP)

Other Generative AI and Gemini resources

General GenAI docs
Gemini home page
Gemini models overview
Gemini models information (& quotas)

Gemini whitepaper (PDF)
Bard (app) name-change to Gemini
Gemini Pro/Ultra vs. ChatGPT 3.5/4 performance benchmarks


WESLEY CHUN, MSCS, is a Google Developer Expert (GDE) in Google Cloud (GCP) & Google Workspace (GWS), author of Prentice Hall’s bestselling “Core Python” series, co-author of “Python Web Development with Django”, and has written for Linux Journal & CNET. He runs CyberWeb specializing in GCP & GWS APIs and serverless platforms, Python & App Engine migrations, and Python training & engineering. Wesley was one of the original Yahoo!Mail engineers and spent 13+ years on various Google product teams, speaking on behalf of their APIs, producing sample apps, codelabs, and videos for serverless migration and GWS developers. He holds degrees in Computer Science, Mathematics, and Music from the University of California, is a Fellow of the Python Software Foundation, and loves to travel to meet developers worldwide at conferences, user group events, and universities. Follow he/him @wescpy & his technical blog. Find this content useful? Contact CyberWeb or buy him a coffee (or tea)!

Leave a Reply

Your email address will not be published. Required fields are marked *