Token Counting - flight505/ContextCraft GitHub Wiki

Token Counting

Token counting is a fundamental feature in ContextCraft that helps you optimize your AI interactions by monitoring and managing token usage. This page explains what tokens are, how they're counted, and how to use token metrics effectively.

What Are Tokens?

Tokens are the basic units that AI models like GPT-4 process. A token can be:

  • A word
  • Part of a word
  • A character
  • A punctuation mark
  • A whitespace

For English text, a token is roughly equivalent to 4 characters or 3/4 of a word on average. Code tends to tokenize differently than natural language, with symbols and operators often counting as individual tokens.

How ContextCraft Counts Tokens

ContextCraft provides real-time token counting for:

  1. Individual Files: See token counts for each file in your project
  2. Selected Files: Track token usage for your current selection
  3. Processed Output: Monitor token counts after compression/comment removal
  4. Total Context Usage: View the total tokens that will be sent to the AI

Token Counting Algorithm

ContextCraft uses advanced tokenization algorithms similar to those used by AI models to provide accurate estimates of token usage:

  1. The tokenizer analyzes each character sequence in your code
  2. It applies language-specific rules to identify token boundaries
  3. It maintains a running count of identified tokens
  4. It displays this information in real-time as you select files

Token Counter Interface

The token counter appears in multiple locations within ContextCraft:

  1. Status Bar: Shows the total token count of your current selection
  2. File Tree: Displays token counts next to each file (when enabled)
  3. Control Container: Shows detailed token metrics for selected files
  4. Output Preview: Provides token counts for processed output

Understanding Token Metrics

ContextCraft provides several token metrics to help you manage context usage:

Raw Token Count

This is the unprocessed token count of your selected files before any optimizations.

Processed Token Count

This is the token count after applying optimizations like:

  • Code compression
  • Comment removal
  • Whitespace reduction

Token Savings

This metric shows how many tokens you're saving through optimization:

  • Displayed as a number and percentage
  • Updates in real-time as you adjust settings
  • Helps quantify the effectiveness of your optimization strategies

Model Context Limits

ContextCraft displays the context limit for your selected AI model:

  • Shows how much of the available context you're using
  • Provides visual indicators when approaching limits
  • Helps you stay within the model's processing capabilities

Managing Token Usage

Token Budget Planning

  1. Set a Target Budget:

    • Determine how many tokens you want to allocate to code context
    • Leave room for the AI's response and your prompts
    • ContextCraft helps visualize your budget allocation
  2. Prioritize Files:

    • Select the most important files first
    • Use token counts to guide your selection process
    • Balance coverage with token efficiency
  3. Apply Optimizations Strategically:

    • Use compression for large files with repetitive patterns
    • Remove comments from heavily-documented files
    • Preserve critical files in their original form

Visual Token Indicators

ContextCraft provides visual cues to help manage tokens:

  1. Color-Coded Status:

    • Green: Well within token limits
    • Yellow: Approaching token limits
    • Red: Exceeding token limits
  2. Progress Bars:

    • Shows context usage relative to model limits
    • Updates dynamically as you adjust your selection
  3. Warning Notifications:

    • Alerts when you exceed recommended token limits
    • Provides suggestions for optimization

Token Efficiency Best Practices

  1. Focus on Relevance:

    • Include only files directly relevant to your query
    • Exclude test files, generated code, and dependencies when possible
  2. Leverage Optimizations:

    • Use code compression for large files
    • Remove comments when they're not essential
    • Apply whitespace reduction for minor additional savings
  3. Balance Context and Detail:

    • For architectural questions: More files with compression
    • For implementation details: Fewer files without compression
  4. Monitor Token Usage Patterns:

    • Track which files consistently use the most tokens
    • Look for opportunities to refactor token-heavy files
    • Consider breaking large files into smaller modules

Language-Specific Token Considerations

Different programming languages have different tokenization characteristics:

Language Token Efficiency Notes
JavaScript/TypeScript Medium Symbols and operators count as separate tokens
Python High Whitespace-significant, relatively token-efficient
Java Low Verbose syntax uses more tokens
HTML/CSS Low Tags and attributes consume many tokens
JSON Medium Structure overhead but predictable

Advanced Token Management

Custom Tokenization Rules

In some ContextCraft versions, you can define custom tokenization rules:

  1. Token Weight Adjustments:

    • Prioritize certain file types over others
    • Weight tokens by file importance
  2. Token Budgeting:

    • Allocate specific token budgets to different parts of your codebase
    • Get warnings when individual sections exceed their budget

Token Analytics

ContextCraft may provide token usage analytics:

  1. Historical Usage:

    • Track token usage over time
    • Identify optimization opportunities
  2. Project-Wide Analysis:

    • Get insights into token distribution across your project
    • Find token-heavy files and patterns

Related Topics: