Vision - MyTokenGate
1. Usage Scenarios
Vision-Language Models (VLM) are large language models capable of processing both visual (image) and linguistic (text) input modalities. Based on VLMs, you can input images and text, and the model can simultaneously understand the content of the images and the context while following instructions to respond. For example:
- Visual Content Interpretation: The model can interpret and describe the information in an image, such as objects, text, spatial relationships, colors, and atmosphere.
- Multi-turn Conversations Combining Visual Content and Context.
- Partial Replacement of Traditional Machine Vision Models like OCR.
- Future Applications: With continuous improvements in model capabilities, VLMs can be applied to areas such as visual agents and robotics.
2. Usage Method
For VLM models, you can invoke the /chat/completions API by constructing a message containing either an image URL or a base64-encoded image. The detail parameter can be used to control how the image is preprocessed.
2.1 Explanation of Image Detail Control Parameters
MyTokenGate provides three options for the detail parameter: low, high, and auto.
2.2 Example Formats for message Containing Images
2.2.1 Using Image URLs
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png",
"detail": "high"
}
},
{
"type": "text",
"text": "Describe this image"
}
]
}2.2.2 Base64 Format
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "low"
}
},
{
"type": "text",
"text": "Describe this image"
}
]
}2.2.3 Multiple Images
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image1.png"
}
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
},
{
"type": "text",
"text": "Compare these two images"
}
]
}3. Supported Models
Currently supported VLM models:
gpt-4o- GPT-4 multimodal versiongemini-2.5-flash-image- Gemini image understandinggemini-2.5-pro- Gemini Pro multimodalgemini-3-pro-image-preview- Gemini 3 image previewgemini-3.1-flash-image-preview- Gemini 3.1 image preview
4. Usage Examples
4.1 Image Understanding
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://gateway.mytokengate.com/v1"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png"
}
},
{
"type": "text",
"text": "Describe the content of this image"
}
]
}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='', flush=True)4.2 Multi-Image Understanding
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://gateway.mytokengate.com/v1"
)
response = client.chat.completions.create(
model="gemini-2.5-pro",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image1.png"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image2.png"
}
},
{
"type": "text",
"text": "Compare these two images"
}
]
}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='', flush=True)5. Notes
- Recommended image size: under 20MB
- Supported formats: PNG, JPEG, GIF, WebP
- When using base64, add
data:image/jpeg;base64,prefix to the URL - For multiple images, recommend no more than 10 images
Last updated on