Challenge specifications - til-ai/til-25 GitHub Wiki

This section gives detailed specifications of each of the four challenges.

Contents

Overview

Your TIL-AI 2025 mission consists of four challenges across distinct fields of study in artificial intelligence: automatic speech recognition (ASR), computer vision (CV), optical character recognition (OCR), and reinforcement learning (RL).

This document provides the technical specifications for the four challenges. It aims to rigorously define the design and requirements, not to be a first-time guide to the TIL-AI mission for the uninitiated. For a non-technical introduction, take a look at the Opening Ceremony slides.

A copy of the input and output formats is provided in the README.md file in each model's directory in the til-ai/til-25 template repository. These copies are provided for your convenience; though if there is a conflict in meaning between the copy in README.md and the specifications in this document, this document shall take precedence.

Score breakdown

Your overall score is calculated by a weighted sum of the scores for each challenge. The weights are as follows:

Challenge Weight
ASR 20%
CV 20%
OCR 20%
RL 40%

Each challenge's score is calculated by a weighted sum of accuracy/reward and inference speed score:

Challenge Accuracy/reward weight Speed score weight
ASR 75% 25%
CV 75% 25%
OCR 75% 25%
RL 75% 25%

The speed score is calculated by:

$$1-\frac{\min(t_{\text{elapsed}},t_{\max})}{t_{\max}}$$

where $t_{elapsed}$ is the time taken for your model to complete inference on the whole test set, and $t_{max}$ is the inference duration beyond which your model gets zero speed score. For Qualifiers, $t_{max}$ is 30 minutes.

In short, the less time your model takes to generate its predictions, the better your speed score. A perfect score requires instantaneous inference.

ASR

Your automatic speech recognition challenge is to transcribe a noisy recording of speech. Given an audio file, your model is expected to produce a text transcript of what was spoken.

All speech will be in English, but the speaker may speak with a variety of accents. In each audio file, there is only one speaker.

Track variations

The input audio for the Advanced track will contain more noise than those for the Novice track.

Scoring

The accuracy score of your ASR model is calculated by $\max(0, 1 - WER)$, where WER is the word error rate as calculated by JiWER. Before computing the WER between predicted and ground truth transcripts, the following transforms are applied to the predicted transcript:

jiwer.Compose([
    jiwer.ToLowerCase(),
    jiwer.SubstituteRegexes({"-": " "}),
    jiwer.RemovePunctuation(),
    jiwer.ReduceToListOfListOfWords(),
])
Input and output formats

Input

The input is sent via a POST request to the /asr route on port 5001. It is a JSON document structured as such:

{
  "instances": [
    {
      "key": 0,
      "b64": "BASE64_ENCODED_AUDIO"
    },
    ...
  ]
}

The b64 key of each object in the instances list contains the base64-encoded bytes of the input audio in WAV format. The length of the instances list is variable.

Output

Your route handler function must return a dict with this structure:

{
    "predictions": [
        "Predicted transcript one.",
        "Predicted transcript two.",
        ...
    ]
}

where each string in predictions is the predicted ASR transcription for the corresponding audio file.

The $k$-th element of predictions must be the prediction corresponding to the $k$-th element of instances for all $1 \le k \le n$, where n is the number of input instances. The length of predictions must equal that of instances.

CV

Your CV challenge is to detect and classify objects in an image. Given an image, your model is expected to produce the bounding box and predicted category of every occurrence of an object belonging to a category in the target list. Each scene may contain zero or more targets.

Your output is expected to contain the following values:

  • $(x, y)$: The coordinates in pixels of the top-left corner of the predicted bounding box. $x$ is the horizontal coordinate and $y$ Is the vertical.
  • $(w, h)$: The width and height in pixels of the predicted bounding box
  • category_id: The index of the predicted category in the target list

All bounding box coordinates should be zero-indexed; that is, $(x, y) = (0, 0)$ means the top-left corner of the bounding box is at the top-left corner of the image.

Target list

Category index Object type
0 cargo aircraft
1 commercial aircraft
2 drone
3 fighter jet
4 fighter plane
5 helicopter
6 light aircraft
7 missile
8 truck
9 car
10 tank
11 bus
12 van
13 cargo ship
14 yacht
15 cruise ship
16 warship
17 sailboat

Track variations

The input images for the Advanced track will contain more noise than those for the Novice track. The targets in the Advanced track input images will also be smaller in size than those for the Novice track.

Scoring

The accuracy score is calculated by mean average precision over IoU thresholds ranging from 0.5 to 0.95 with a stride of 0.05, otherwise known as [email protected]:.05:.95. See here for details.

Input and output formats

Input

The input is sent via a POST request to the /cv route on port 5002. It is a JSON document structured as such:

{
  "instances": [
    {
      "key": 0,
      "b64": "BASE64_ENCODED_IMAGE"
    },
    ...
  ]
}

The b64 key of each object in the instances list contains the base64-encoded bytes of the input image in JPEG format. The length of the instances list is variable.

Output

Your route handler function must return a dict with this structure:

{
    "predictions": [
        [
            {
                "bbox": [x, y, w, h],
                "category_id": category_id
            },
            ...
        ],
        ...
    ]
}

where x, y, w, h, and category_id are defined as above.

If your model detects no objects in a scene, your handler should output an empty list for that scene.

The $k$-th element of predictions must be the prediction corresponding to the $k$-th element of instances for all $1 \le k \le n$, where n is the number of input instances. The length of predictions must equal that of instances.

OCR

Your OCR challenge is to read text in a scanned document. Given an image of a scanned document, your model is expected to produce a transcription of the contents of the document.

Input documents may be typeset in a variety of fonts and layouts. They may contain Latin alphabets, numerals and punctuation.

Training data

Provided with each image are word-, line-, and paragraph-level bounding boxes which you can use to train your OCR model. This data is provided in hOCR format, though you can easily convert it to other formats as necessary.

Track variations

The input documents for the Advanced track will be more visually degraded than those for the Novice track, and feature a greater diversity of layouts.

Scoring

The accuracy score of your OCR model is calculated by $\max(0, 1 - CER)$, where CER is the character error rate as calculated by JiWER. Before computing the CER between predicted and ground truth transcripts, the following transforms are applied to the predicted transcript:

jiwer.Compose([
    jiwer.SubstituteRegexes({"-": ""}),
    jiwer.RemoveWhiteSpace(),
    jiwer.RemoveMultipleSpaces(),
    jiwer.Strip(),
    jiwer.ReduceToListOfListOfChars(),
])
Input and output formats

Input

The input is sent via a POST request to the /ocr route on port 5003. It is a JSON document structured as such:

{
  "instances": [
    {
      "key": 0,
      "b64": "BASE64_ENCODED_IMAGE"
    },
    ...
  ]
}

The b64 key of each object in the instances list contains the base64-encoded bytes of the input image in JPEG format. The length of the instances list is variable.

Output

Your route handler function must return a dict with this structure:

{
    "predictions": [
        "Predicted transcript one.",
        "Predicted transcript two.",
        ...
    ]
}

where each string in predictions is the predicted OCR transcription for the corresponding document.

The $k$-th element of predictions must be the prediction corresponding to the $k$-th element of instances for all $1 \le k \le n$, where n is the number of input instances. The length of predictions must equal that of instances.

RL

Your RL challenge is to direct your agent through the game map while interacting with other agents and completing challenges.

Given an observation, your model is expected to produce the next action to take in order to complete its objective.

Gameplay overview

This section provides a technical overview of the TIL-AI 2025 mission gameplay, and assumes you are already familiar with the concept. Otherwise, take a look at the Opening Ceremony slides.

Each team's RL agent navigates a maze-like game map in a high-stakes wargame scenario. The map is structured as a 16x16 grid, to which agents' moves are discretized.

Each match is played by four teams, and consists of four rounds. Within each round, each team plays one of two roles: Scout or Guard. Teams rotate roles between each round, such that the end of a match, each team has played the Scout role once and the Guard role three times.

The Scout's objective is to:

  • Avoid capture by the Guards,
  • Collect Reconnaissance Points placed around the map, and
  • Complete challenges located at fixed points in the map.

The Guards' objective is to:

  • Capture the Scout.

The game takes place over discrete time steps. All agents move in step, and at the same rate. The game ends when 100 time steps elapse, or when the Scout is captured, whichever happens first.

Reconnaissance Points are evenly distributed across the map, one per grid cell. The Scout collects Reconnaissance Points by traveling to the cell in which it is located. After a Reconnaissance Point is collected, it never respawns. Nothing happens when a Guard travels to a cell with a Reconnaissance Point.

Challenges are sparsely distributed throughout the map. The Scout activates challenges by traveling to the cell in which it is located. A challenge is one of the ASR, CV, or OCR tasks. After activating the challenge, the Scout attempts to solve it, and its score is converted to points. No extra in-game time passes while the challenge is being solved; that is, activating a challenge "stops the world".

Note

During Qualifiers, each challenge (ASR, CV, OCR, and RL) is evaluated and scored separately. The RL agent will continue to receive rewards for activating challenges, but no challenge is actually sent to the ASR, CV, or OCR models, and their performance has no effect on the RL score.

Observation space

The RL environment uses a discrete observation space. Observations are provided to your RL agent as a dictionary with five key/value pairs:

  • viewcone: A 7×5 area around the agent, which extends 2 tiles in the left, rearward, and right directions, and 4 tiles in the forward direction. See viewcone.
  • direction: direction the agent is currently facing. 0 denotes right, 1 down, 2 left, and 3 up.
  • scout: 1 if the agent is currently a scout, 0 if the agent is currently a guard.
  • location: $(x, y)$ coordinate of the agent within the environment.
  • step: the current step count of the environment.

Viewcone

The observation for each tile is represented as an 8-bit unsigned integer.

Agents can see 4 tiles ahead in the forward direction and 2 tiles ahead in other directions. Agents can also hear the presence of other agents so long as the other agents are within the viewcone, even if there are no sightlines to the other agents' locations. Each tile within the viewcone is represented as an 8-bit set of binary flags, where the 2 least significant bits denote tile information, while the other 6 bits denote tile occupancy and wall information.

Value of last 2 bits (tile & 0b11) Meaning
0 No vision
1 Empty tile
2 Recon (1 point)
3 Mission (5 points)
Bit index (least significant bit lowest) Tile contains a...
2 Scout
3 Guard
4 Right wall
5 Bottom wall
6 Left wall
7 Top wall

For example, a tile with value 130 would be a tile that contains a top wall (128) as well as a Recon point (2). A tile with value 123 would be a tile with left (64), bottom (32), and right (16) walls, a Guard (8) as well as a Mission (3).

Action space

The action space of the RL environment is discrete. During its turn, your agent may choose from one of the following actions:

Value Action
0 Move forward
1 Move backward
2 Turn left
3 Turn right
4 Stay

The forward and backward actions cause the agent to move one grid cell in the direction the agent is currently facing (or in the reverse direction). The left and right actions cause the agent to rotate 90 degrees to its left or right. The stay action results in the agent not moving or turning.

Rewards

These are the rewards used to evaluate RL models during Qualifiers.

Outcome Scout Reward Guard Reward
Scout collects a Reconnaissance Point 1 0
Scout activates a challenge 5 0
Scout is captured by a Guard -50 50

Note

Reward refers specifically to the value of the reward function. This should not be confused with the score teams achieve in the Semifinals and Finals, which is calculated from a combination of metrics, including your other models' performance in the challenges your RL agent activates.

Notes

The container containing your trained RL agent is likely to be independent of the environment, and thus can have a more limited set of requirements/packages installed solely for inference.

Track variations

For the Novice track, the map used in the environment will be held fixed.

For the Advanced track, the map used in the environment will vary for each match, requiring participant's agents to be robust and able to adapt to many novel environments.

Scoring

For the automated evaluation, the score returned will be a sum of all rewards attained by the agent during evaluation, divided by the number of rounds, then divided by 100.

Input and output formats

Input

The input is sent via a POST request to the /rl route on port 5004. It is a JSON object structured as such:

{
  "instances": [
    {
      "observation": {
        "viewcone": [[0, 0, ..., 0], [0, 0, ..., 0], ... , [0, 0, ..., 0]],
        "direction": 0,
        "location": [0, 0],
        "scout": 0,
        "step": 0
      }
    }
  ]
}

The observation is a representation of the inputs the agent senses in its environment. See the observation space specifications to learn how to interpret the observation.

The length of the instances array is 1.

During evaluation for Qualifiers, a GET request will be sent to the /reset route to signal that a round has ended, all agents are being reset to their starting positions (possibly with new roles), and any persistent state information your code may have stored must be cleared.

Output

Your route handler function must return a dict with this structure:

{
    "predictions": [
        {
            "action": 0
        }
    ]
}

The action is an integer representing the next movement your agent intends to take. See the action space specifications for a list of possible movements.

⚠️ **GitHub.com Fallback** ⚠️