Challenge specifications - til-ai/til-25 GitHub Wiki
This section gives detailed specifications of each of the four challenges.
Contents
Your TIL-AI 2025 mission consists of four challenges across distinct fields of study in artificial intelligence: automatic speech recognition (ASR), computer vision (CV), optical character recognition (OCR), and reinforcement learning (RL).
This document provides the technical specifications for the four challenges. It aims to rigorously define the design and requirements, not to be a first-time guide to the TIL-AI mission for the uninitiated. For a non-technical introduction, take a look at the Opening Ceremony slides.
A copy of the input and output formats is provided in the README.md
file in each model's directory in the til-ai/til-25
template repository. These copies are provided for your convenience; though if there is a conflict in meaning between the copy in README.md
and the specifications in this document, this document shall take precedence.
Your overall score is calculated by a weighted sum of the scores for each challenge. The weights are as follows:
Challenge | Weight |
---|---|
ASR | 20% |
CV | 20% |
OCR | 20% |
RL | 40% |
Each challenge's score is calculated by a weighted sum of accuracy/reward and inference speed score:
Challenge | Accuracy/reward weight | Speed score weight |
---|---|---|
ASR | 75% | 25% |
CV | 75% | 25% |
OCR | 75% | 25% |
RL | 75% | 25% |
The speed score is calculated by:
where
In short, the less time your model takes to generate its predictions, the better your speed score. A perfect score requires instantaneous inference.
Your automatic speech recognition challenge is to transcribe a noisy recording of speech. Given an audio file, your model is expected to produce a text transcript of what was spoken.
All speech will be in English, but the speaker may speak with a variety of accents. In each audio file, there is only one speaker.
The input audio for the Advanced track will contain more noise than those for the Novice track.
The accuracy score of your ASR model is calculated by
jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.SubstituteRegexes({"-": " "}),
jiwer.RemovePunctuation(),
jiwer.ReduceToListOfListOfWords(),
])
Input and output formats
Input
The input is sent via a POST request to the /asr
route on port 5001
. It is a JSON document structured as such:
{
"instances": [
{
"key": 0,
"b64": "BASE64_ENCODED_AUDIO"
},
...
]
}
The b64
key of each object in the instances
list contains the base64-encoded bytes of the input audio in WAV format. The length of the instances
list is variable.
Output
Your route handler function must return a dict
with this structure:
{
"predictions": [
"Predicted transcript one.",
"Predicted transcript two.",
...
]
}
where each string in predictions
is the predicted ASR transcription for the corresponding audio file.
The predictions
must be the prediction corresponding to the instances
for all predictions
must equal that of instances
.
Your CV challenge is to detect and classify objects in an image. Given an image, your model is expected to produce the bounding box and predicted category of every occurrence of an object belonging to a category in the target list. Each scene may contain zero or more targets.
Your output is expected to contain the following values:
-
$(x, y)$ : The coordinates in pixels of the top-left corner of the predicted bounding box.$x$ is the horizontal coordinate and$y$ Is the vertical. -
$(w, h)$ : The width and height in pixels of the predicted bounding box -
category_id
: The index of the predicted category in the target list
All bounding box coordinates should be zero-indexed; that is,
Category index | Object type |
---|---|
0 | cargo aircraft |
1 | commercial aircraft |
2 | drone |
3 | fighter jet |
4 | fighter plane |
5 | helicopter |
6 | light aircraft |
7 | missile |
8 | truck |
9 | car |
10 | tank |
11 | bus |
12 | van |
13 | cargo ship |
14 | yacht |
15 | cruise ship |
16 | warship |
17 | sailboat |
The input images for the Advanced track will contain more noise than those for the Novice track. The targets in the Advanced track input images will also be smaller in size than those for the Novice track.
The accuracy score is calculated by mean average precision over IoU thresholds ranging from 0.5 to 0.95 with a stride of 0.05, otherwise known as [email protected]:.05:.95. See here for details.
Input and output formats
Input
The input is sent via a POST request to the /cv
route on port 5002. It is a JSON document structured as such:
{
"instances": [
{
"key": 0,
"b64": "BASE64_ENCODED_IMAGE"
},
...
]
}
The b64
key of each object in the instances
list contains the base64-encoded bytes of the input image in JPEG format. The length of the instances
list is variable.
Output
Your route handler function must return a dict
with this structure:
{
"predictions": [
[
{
"bbox": [x, y, w, h],
"category_id": category_id
},
...
],
...
]
}
where x
, y
, w
, h
, and category_id
are defined as above.
If your model detects no objects in a scene, your handler should output an empty list for that scene.
The predictions
must be the prediction corresponding to the instances
for all predictions
must equal that of instances
.
Your OCR challenge is to read text in a scanned document. Given an image of a scanned document, your model is expected to produce a transcription of the contents of the document.
Input documents may be typeset in a variety of fonts and layouts. They may contain Latin alphabets, numerals and punctuation.
Provided with each image are word-, line-, and paragraph-level bounding boxes which you can use to train your OCR model. This data is provided in hOCR format, though you can easily convert it to other formats as necessary.
The input documents for the Advanced track will be more visually degraded than those for the Novice track, and feature a greater diversity of layouts.
The accuracy score of your OCR model is calculated by
jiwer.Compose([
jiwer.SubstituteRegexes({"-": ""}),
jiwer.RemoveWhiteSpace(),
jiwer.RemoveMultipleSpaces(),
jiwer.Strip(),
jiwer.ReduceToListOfListOfChars(),
])
Input and output formats
Input
The input is sent via a POST request to the /ocr
route on port 5003. It is a JSON document structured as such:
{
"instances": [
{
"key": 0,
"b64": "BASE64_ENCODED_IMAGE"
},
...
]
}
The b64
key of each object in the instances
list contains the base64-encoded bytes of the input image in JPEG format. The length of the instances
list is variable.
Output
Your route handler function must return a dict
with this structure:
{
"predictions": [
"Predicted transcript one.",
"Predicted transcript two.",
...
]
}
where each string in predictions
is the predicted OCR transcription for the corresponding document.
The predictions
must be the prediction corresponding to the instances
for all predictions
must equal that of instances
.
Your RL challenge is to direct your agent through the game map while interacting with other agents and completing challenges.
Given an observation, your model is expected to produce the next action to take in order to complete its objective.
This section provides a technical overview of the TIL-AI 2025 mission gameplay, and assumes you are already familiar with the concept. Otherwise, take a look at the Opening Ceremony slides.
Each team's RL agent navigates a maze-like game map in a high-stakes wargame scenario. The map is structured as a 16x16 grid, to which agents' moves are discretized.
Each match is played by four teams, and consists of four rounds. Within each round, each team plays one of two roles: Scout or Guard. Teams rotate roles between each round, such that the end of a match, each team has played the Scout role once and the Guard role three times.
The Scout's objective is to:
- Avoid capture by the Guards,
- Collect Reconnaissance Points placed around the map, and
- Complete challenges located at fixed points in the map.
The Guards' objective is to:
- Capture the Scout.
The game takes place over discrete time steps. All agents move in step, and at the same rate. The game ends when 100 time steps elapse, or when the Scout is captured, whichever happens first.
Reconnaissance Points are evenly distributed across the map, one per grid cell. The Scout collects Reconnaissance Points by traveling to the cell in which it is located. After a Reconnaissance Point is collected, it never respawns. Nothing happens when a Guard travels to a cell with a Reconnaissance Point.
Challenges are sparsely distributed throughout the map. The Scout activates challenges by traveling to the cell in which it is located. A challenge is one of the ASR, CV, or OCR tasks. After activating the challenge, the Scout attempts to solve it, and its score is converted to points. No extra in-game time passes while the challenge is being solved; that is, activating a challenge "stops the world".
Note
During Qualifiers, each challenge (ASR, CV, OCR, and RL) is evaluated and scored separately. The RL agent will continue to receive rewards for activating challenges, but no challenge is actually sent to the ASR, CV, or OCR models, and their performance has no effect on the RL score.
The RL environment uses a discrete observation space. Observations are provided to your RL agent as a dictionary with five key/value pairs:
-
viewcone
: A 7×5 area around the agent, which extends 2 tiles in the left, rearward, and right directions, and 4 tiles in the forward direction. See viewcone. -
direction
: direction the agent is currently facing. 0 denotes right, 1 down, 2 left, and 3 up. -
scout
: 1 if the agent is currently a scout, 0 if the agent is currently a guard. -
location
:$(x, y)$ coordinate of the agent within the environment. -
step
: the current step count of the environment.
The observation for each tile is represented as an 8-bit unsigned integer.
Agents can see 4 tiles ahead in the forward direction and 2 tiles ahead in other directions. Agents can also hear the presence of other agents so long as the other agents are within the viewcone, even if there are no sightlines to the other agents' locations. Each tile within the viewcone is represented as an 8-bit set of binary flags, where the 2 least significant bits denote tile information, while the other 6 bits denote tile occupancy and wall information.
Value of last 2 bits (tile & 0b11 ) |
Meaning |
---|---|
0 | No vision |
1 | Empty tile |
2 | Recon (1 point) |
3 | Mission (5 points) |
Bit index (least significant bit lowest) | Tile contains a... |
---|---|
2 | Scout |
3 | Guard |
4 | Right wall |
5 | Bottom wall |
6 | Left wall |
7 | Top wall |
For example, a tile with value 130
would be a tile that contains a top wall (128
) as well as a Recon point (2
). A tile with value 123
would be a tile with left (64
), bottom (32
), and right (16
) walls, a Guard (8
) as well as a Mission (3
).
The action space of the RL environment is discrete. During its turn, your agent may choose from one of the following actions:
Value | Action |
---|---|
0 | Move forward |
1 | Move backward |
2 | Turn left |
3 | Turn right |
4 | Stay |
The forward
and backward
actions cause the agent to move one grid cell in the direction the agent is currently facing (or in the reverse direction). The left
and right
actions cause the agent to rotate 90 degrees to its left or right. The stay
action results in the agent not moving or turning.
These are the rewards used to evaluate RL models during Qualifiers.
Outcome | Scout Reward | Guard Reward |
---|---|---|
Scout collects a Reconnaissance Point | 1 | 0 |
Scout activates a challenge | 5 | 0 |
Scout is captured by a Guard | -50 | 50 |
Note
Reward refers specifically to the value of the reward function. This should not be confused with the score teams achieve in the Semifinals and Finals, which is calculated from a combination of metrics, including your other models' performance in the challenges your RL agent activates.
The container containing your trained RL agent is likely to be independent of the environment, and thus can have a more limited set of requirements/packages installed solely for inference.
For the Novice track, the map used in the environment will be held fixed.
For the Advanced track, the map used in the environment will vary for each match, requiring participant's agents to be robust and able to adapt to many novel environments.
For the automated evaluation, the score returned will be a sum of all rewards attained by the agent during evaluation, divided by the number of rounds, then divided by 100.
Input and output formats
The input is sent via a POST request to the /rl
route on port 5004
. It is a JSON object structured as such:
{
"instances": [
{
"observation": {
"viewcone": [[0, 0, ..., 0], [0, 0, ..., 0], ... , [0, 0, ..., 0]],
"direction": 0,
"location": [0, 0],
"scout": 0,
"step": 0
}
}
]
}
The observation is a representation of the inputs the agent senses in its environment. See the observation space specifications to learn how to interpret the observation.
The length of the instances
array is 1.
During evaluation for Qualifiers, a GET request will be sent to the /reset
route to signal that a round has ended, all agents are being reset to their starting positions (possibly with new roles), and any persistent state information your code may have stored must be cleared.
Your route handler function must return a dict
with this structure:
{
"predictions": [
{
"action": 0
}
]
}
The action is an integer representing the next movement your agent intends to take. See the action space specifications for a list of possible movements.