Vision - MyTokenGate

1. Usage Scenarios

Vision-Language Models (VLM) are large language models capable of processing both visual (image) and linguistic (text) input modalities. Based on VLMs, you can input images and text, and the model can simultaneously understand the content of the images and the context while following instructions to respond. For example:

Visual Content Interpretation: The model can interpret and describe the information in an image, such as objects, text, spatial relationships, colors, and atmosphere.
Multi-turn Conversations Combining Visual Content and Context.
Partial Replacement of Traditional Machine Vision Models like OCR.
Future Applications: With continuous improvements in model capabilities, VLMs can be applied to areas such as visual agents and robotics.

2. Usage Method

For VLM models, you can invoke the /chat/completions API by constructing a message containing either an image URL or a base64-encoded image. The detail parameter can be used to control how the image is preprocessed.

2.1 Explanation of Image Detail Control Parameters

MyTokenGate provides three options for the detail parameter: low, high, and auto.

2.2 Example Formats for message Containing Images

2.2.1 Using Image URLs


{
    "role": "user",
    "content": [
        {
            "type": "image_url",
            "image_url": {
                "url": "https://example.com/image.png",
                "detail": "high"
            }
        },
        {
            "type": "text",
            "text": "Describe this image"
        }
    ]
}

2.2.2 Base64 Format


{
    "role": "user",
    "content": [
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}",
                "detail": "low"
            }
        },
        {
            "type": "text",
            "text": "Describe this image"
        }
    ]
}

2.2.3 Multiple Images


{
    "role": "user",
    "content": [
        {
            "type": "image_url",
            "image_url": {
                "url": "https://example.com/image1.png"
            }
        },
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
            }
        },
        {
            "type": "text",
            "text": "Compare these two images"
        }
    ]
}

3. Supported Models

Currently supported VLM models:

gpt-4o - GPT-4 multimodal version
gemini-2.5-flash-image - Gemini image understanding
gemini-2.5-pro - Gemini Pro multimodal
gemini-3-pro-image-preview - Gemini 3 image preview
gemini-3.1-flash-image-preview - Gemini 3.1 image preview

4. Usage Examples

4.1 Image Understanding


from openai import OpenAI
 
client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://gateway.mytokengate.com/v1"
)
 
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe the content of this image"
                }
            ]
        }
    ],
    stream=True
)
 
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)

4.2 Multi-Image Understanding


from openai import OpenAI
 
client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://gateway.mytokengate.com/v1"
)
 
response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image1.png"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image2.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Compare these two images"
                }
            ]
        }
    ],
    stream=True
)
 
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)

5. Notes

Recommended image size: under 20MB
Supported formats: PNG, JPEG, GIF, WebP
When using base64, add data:image/jpeg;base64, prefix to the URL
For multiple images, recommend no more than 10 images