Neural Networks - ruvnet/ruv-FANN GitHub Wiki
Neural Networks
Neural networks are computational models inspired by biological neural networks, consisting of interconnected nodes (neurons) that process and transmit information. This comprehensive guide covers the fundamental architectures, training algorithms, and implementation strategies for various neural network types.
Table of Contents
- Feedforward Networks
- Convolutional Networks
- Recurrent Networks
- Attention Mechanisms
- Custom Architectures
- Training Algorithms
- Implementation Guidelines
- Performance Optimization
Feedforward Networks
Feedforward neural networks are the simplest type of artificial neural network where connections between nodes do not form cycles. Information flows in one direction from input to output.
Multi-Layer Perceptron (MLP)
The most common feedforward architecture consisting of:
- Input Layer: Receives raw data
- Hidden Layers: Process information through weighted connections
- Output Layer: Produces final predictions
# Example MLP Architecture
class MLP:
def __init__(self, input_size, hidden_sizes, output_size):
self.layers = []
prev_size = input_size
# Hidden layers
for hidden_size in hidden_sizes:
self.layers.append(Linear(prev_size, hidden_size))
self.layers.append(ReLU())
prev_size = hidden_size
# Output layer
self.layers.append(Linear(prev_size, output_size))
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x
Key Characteristics
- Universal Approximation: Can approximate any continuous function
- Non-linearity: Activation functions enable complex pattern learning
- Depth vs Width: Deeper networks can represent more complex functions
Common Activation Functions
- ReLU:
f(x) = max(0, x)
- Most popular, prevents vanishing gradients - Sigmoid:
f(x) = 1/(1 + e^(-x))
- Output range [0,1] - Tanh:
f(x) = tanh(x)
- Output range [-1,1] - Leaky ReLU:
f(x) = max(αx, x)
- Prevents dead neurons - GELU:
f(x) = x * Φ(x)
- Smooth activation used in transformers
Convolutional Networks
Convolutional Neural Networks (CNNs) are specialized for processing grid-like data such as images, using convolution operations to detect local features.
Core Components
Convolutional Layer
Applies learnable filters to detect features:
# Convolution operation
def conv2d(input, kernel, stride=1, padding=0):
# Apply kernel across input with specified stride and padding
output_height = (input.height + 2*padding - kernel.height) // stride + 1
output_width = (input.width + 2*padding - kernel.width) // stride + 1
output = zeros(output_height, output_width)
for i in range(output_height):
for j in range(output_width):
output[i,j] = sum(input[i*stride:i*stride+kernel.height,
j*stride:j*stride+kernel.width] * kernel)
return output
Pooling Layer
Reduces spatial dimensions while preserving important features:
- Max Pooling: Takes maximum value in each region
- Average Pooling: Takes average value in each region
- Global Average Pooling: Reduces feature maps to single values
CNN Architecture Example
class CNN:
def __init__(self, num_classes):
self.conv1 = Conv2D(3, 32, kernel_size=3, padding=1)
self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1)
self.conv3 = Conv2D(64, 128, kernel_size=3, padding=1)
self.pool = MaxPool2D(kernel_size=2, stride=2)
self.dropout = Dropout(0.5)
self.fc1 = Linear(128 * 4 * 4, 512)
self.fc2 = Linear(512, num_classes)
def forward(self, x):
x = relu(self.conv1(x))
x = self.pool(x)
x = relu(self.conv2(x))
x = self.pool(x)
x = relu(self.conv3(x))
x = self.pool(x)
x = x.flatten()
x = relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
Advanced CNN Architectures
ResNet (Residual Networks)
Introduces skip connections to enable training of very deep networks:
class ResidualBlock:
def __init__(self, channels):
self.conv1 = Conv2D(channels, channels, 3, padding=1)
self.conv2 = Conv2D(channels, channels, 3, padding=1)
self.bn1 = BatchNorm2D(channels)
self.bn2 = BatchNorm2D(channels)
def forward(self, x):
residual = x
out = relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += residual # Skip connection
return relu(out)
DenseNet
Each layer connects to all subsequent layers:
- Promotes feature reuse
- Reduces vanishing gradient problem
- Requires fewer parameters
EfficientNet
Systematically scales network dimensions (depth, width, resolution):
- Compound scaling methodology
- Better accuracy-efficiency trade-off
- Mobile-friendly architectures
Recurrent Networks
Recurrent Neural Networks (RNNs) process sequential data by maintaining hidden states that capture temporal dependencies.
Vanilla RNN
Basic recurrent architecture with simple hidden state update:
class VanillaRNN:
def __init__(self, input_size, hidden_size, output_size):
self.hidden_size = hidden_size
self.Wxh = initialize_weights(input_size, hidden_size)
self.Whh = initialize_weights(hidden_size, hidden_size)
self.Why = initialize_weights(hidden_size, output_size)
self.bh = zeros(hidden_size)
self.by = zeros(output_size)
def forward(self, inputs):
hidden = zeros(self.hidden_size)
outputs = []
for x in inputs:
hidden = tanh(x @ self.Wxh + hidden @ self.Whh + self.bh)
output = hidden @ self.Why + self.by
outputs.append(output)
return outputs, hidden
Long Short-Term Memory (LSTM)
Addresses vanishing gradient problem with gating mechanisms:
LSTM Cell Components
- Forget Gate: Decides what information to discard
- Input Gate: Determines what new information to store
- Output Gate: Controls what parts of cell state to output
class LSTMCell:
def forward(self, x, hidden, cell):
# Concatenate input and hidden state
combined = concatenate([x, hidden])
# Forget gate
forget_gate = sigmoid(combined @ self.Wf + self.bf)
# Input gate
input_gate = sigmoid(combined @ self.Wi + self.bi)
candidate = tanh(combined @ self.Wc + self.bc)
# Update cell state
cell = forget_gate * cell + input_gate * candidate
# Output gate
output_gate = sigmoid(combined @ self.Wo + self.bo)
hidden = output_gate * tanh(cell)
return hidden, cell
Gated Recurrent Unit (GRU)
Simplified alternative to LSTM with fewer parameters:
- Combines forget and input gates into update gate
- Merges cell state and hidden state
- Often performs similarly to LSTM with faster training
Bidirectional RNNs
Process sequences in both forward and backward directions:
class BiLSTM:
def __init__(self, input_size, hidden_size):
self.forward_lstm = LSTM(input_size, hidden_size)
self.backward_lstm = LSTM(input_size, hidden_size)
def forward(self, inputs):
forward_outputs = self.forward_lstm(inputs)
backward_outputs = self.backward_lstm(reverse(inputs))
backward_outputs = reverse(backward_outputs)
# Concatenate forward and backward outputs
outputs = concatenate([forward_outputs, backward_outputs], axis=-1)
return outputs
Attention Mechanisms
Attention mechanisms allow models to focus on relevant parts of input sequences, revolutionizing sequence-to-sequence tasks.
Scaled Dot-Product Attention
Core attention computation used in Transformers:
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: Query matrix [batch_size, seq_len, d_k]
K: Key matrix [batch_size, seq_len, d_k]
V: Value matrix [batch_size, seq_len, d_v]
"""
d_k = Q.size(-1)
# Compute attention scores
scores = (Q @ K.transpose(-2, -1)) / sqrt(d_k)
# Apply mask if provided
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Apply softmax
attention_weights = softmax(scores, dim=-1)
# Apply attention to values
output = attention_weights @ V
return output, attention_weights
Multi-Head Attention
Allows model to attend to different representation subspaces:
class MultiHeadAttention:
def __init__(self, d_model, num_heads):
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = Linear(d_model, d_model)
self.W_k = Linear(d_model, d_model)
self.W_v = Linear(d_model, d_model)
self.W_o = Linear(d_model, d_model)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Linear transformations and split into heads
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Apply attention
attention_output, attention_weights = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads and apply output projection
attention_output = attention_output.transpose(1, 2).contiguous().view(
batch_size, -1, self.num_heads * self.d_k)
return self.W_o(attention_output)
Transformer Architecture
Complete transformer block combining attention and feed-forward layers:
class TransformerBlock:
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
self.feed_forward = Sequential([
Linear(d_model, d_ff),
ReLU(),
Linear(d_ff, d_model)
])
self.dropout = Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward with residual connection
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
Attention Variants
Cross-Attention
Attention between different sequences (encoder-decoder):
def cross_attention(encoder_output, decoder_hidden):
# Query from decoder, Keys and Values from encoder
attention_output = scaled_dot_product_attention(
Q=decoder_hidden,
K=encoder_output,
V=encoder_output
)
return attention_output
Self-Attention
Attention within the same sequence:
- Captures long-range dependencies
- Parallelizable computation
- Foundation of Transformer models
Custom Architectures
Design principles for creating specialized neural network architectures.
Architecture Design Principles
Inductive Biases
Build assumptions about the problem into the architecture:
- Translation Invariance: CNNs for spatial data
- Permutation Invariance: Graph networks for unordered data
- Temporal Dependencies: RNNs for sequential data
Modular Design
Create reusable components:
class ResidualBlock:
def __init__(self, dim, norm_type='batch'):
self.conv1 = Conv1D(dim, dim, 3, padding=1)
self.conv2 = Conv1D(dim, dim, 3, padding=1)
self.norm1 = get_norm_layer(norm_type, dim)
self.norm2 = get_norm_layer(norm_type, dim)
self.activation = ReLU()
def forward(self, x):
residual = x
x = self.activation(self.norm1(self.conv1(x)))
x = self.norm2(self.conv2(x))
return self.activation(x + residual)
class CustomArchitecture:
def __init__(self, input_dim, num_blocks=4):
self.embedding = Linear(input_dim, 256)
self.blocks = [ResidualBlock(256) for _ in range(num_blocks)]
self.output = Linear(256, 1)
def forward(self, x):
x = self.embedding(x)
for block in self.blocks:
x = block(x)
return self.output(x)
Hybrid Architectures
CNN-RNN Combination
For spatiotemporal data:
class CNNRNN:
def __init__(self, cnn_features, rnn_hidden):
self.cnn = CNN(output_features=cnn_features)
self.rnn = LSTM(cnn_features, rnn_hidden)
self.classifier = Linear(rnn_hidden, num_classes)
def forward(self, sequence_of_images):
cnn_features = []
for image in sequence_of_images:
features = self.cnn(image)
cnn_features.append(features)
rnn_output, _ = self.rnn(cnn_features)
return self.classifier(rnn_output[-1])
Attention-Enhanced CNNs
Adding attention to convolutional networks:
class AttentionCNN:
def __init__(self, num_classes):
self.backbone = ResNet50()
self.attention = SelfAttention(2048)
self.classifier = Linear(2048, num_classes)
def forward(self, x):
features = self.backbone(x) # [B, 2048, H, W]
# Reshape for attention
B, C, H, W = features.shape
features_flat = features.view(B, C, H*W).transpose(1, 2) # [B, HW, C]
# Apply attention
attended_features = self.attention(features_flat) # [B, HW, C]
# Global average pooling
pooled_features = attended_features.mean(dim=1) # [B, C]
return self.classifier(pooled_features)
Neural Architecture Search (NAS)
Evolutionary Approach
class ArchitectureGenome:
def __init__(self):
self.layers = []
self.connections = []
self.fitness = 0.0
def mutate(self):
# Add, remove, or modify layers
mutation_type = random.choice(['add_layer', 'remove_layer', 'modify_layer'])
if mutation_type == 'add_layer':
layer_type = random.choice(['conv', 'attention', 'residual'])
self.layers.append(create_layer(layer_type))
elif mutation_type == 'remove_layer' and len(self.layers) > 1:
self.layers.pop(random.randint(0, len(self.layers)-1))
elif mutation_type == 'modify_layer':
layer_idx = random.randint(0, len(self.layers)-1)
self.layers[layer_idx] = modify_layer(self.layers[layer_idx])
def crossover(self, other):
# Combine architectures
child = ArchitectureGenome()
split_point = len(self.layers) // 2
child.layers = self.layers[:split_point] + other.layers[split_point:]
return child
Training Algorithms
Comprehensive overview of neural network training methods and optimization techniques.
Gradient Descent Variants
Stochastic Gradient Descent (SGD)
Basic optimization algorithm:
class SGD:
def __init__(self, parameters, lr=0.01, momentum=0.0, weight_decay=0.0):
self.parameters = parameters
self.lr = lr
self.momentum = momentum
self.weight_decay = weight_decay
self.velocity = {param: zeros_like(param) for param in parameters}
def step(self, gradients):
for param, grad in zip(self.parameters, gradients):
if self.weight_decay > 0:
grad += self.weight_decay * param
if self.momentum > 0:
self.velocity[param] = self.momentum * self.velocity[param] + grad
param -= self.lr * self.velocity[param]
else:
param -= self.lr * grad
Adam Optimizer
Adaptive learning rate method:
class Adam:
def __init__(self, parameters, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.parameters = parameters
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.t = 0 # time step
self.m = {param: zeros_like(param) for param in parameters} # first moment
self.v = {param: zeros_like(param) for param in parameters} # second moment
def step(self, gradients):
self.t += 1
for param, grad in zip(self.parameters, gradients):
# Update biased first moment estimate
self.m[param] = self.beta1 * self.m[param] + (1 - self.beta1) * grad
# Update biased second raw moment estimate
self.v[param] = self.beta2 * self.v[param] + (1 - self.beta2) * grad**2
# Compute bias-corrected first moment estimate
m_hat = self.m[param] / (1 - self.beta1**self.t)
# Compute bias-corrected second raw moment estimate
v_hat = self.v[param] / (1 - self.beta2**self.t)
# Update parameters
param -= self.lr * m_hat / (sqrt(v_hat) + self.eps)
Learning Rate Scheduling
Cosine Annealing
class CosineAnnealingLR:
def __init__(self, optimizer, T_max, eta_min=0):
self.optimizer = optimizer
self.T_max = T_max
self.eta_min = eta_min
self.base_lr = optimizer.lr
self.current_epoch = 0
def step(self):
self.current_epoch += 1
lr = self.eta_min + (self.base_lr - self.eta_min) * \
(1 + cos(pi * self.current_epoch / self.T_max)) / 2
self.optimizer.lr = lr
Warm-up and Decay
class WarmupCosineSchedule:
def __init__(self, optimizer, warmup_steps, total_steps):
self.optimizer = optimizer
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.base_lr = optimizer.lr
self.step_count = 0
def step(self):
self.step_count += 1
if self.step_count <= self.warmup_steps:
# Warm-up phase
lr = self.base_lr * self.step_count / self.warmup_steps
else:
# Cosine decay phase
progress = (self.step_count - self.warmup_steps) / \
(self.total_steps - self.warmup_steps)
lr = 0.5 * self.base_lr * (1 + cos(pi * progress))
self.optimizer.lr = lr
Regularization Techniques
Dropout
Randomly zero out neurons during training:
class Dropout:
def __init__(self, p=0.5):
self.p = p
self.training = True
def forward(self, x):
if self.training:
mask = (random.uniform(0, 1, size=x.shape) > self.p).astype(float)
return x * mask / (1 - self.p) # Scale to maintain expected value
else:
return x
Batch Normalization
Normalize layer inputs:
class BatchNorm1D:
def __init__(self, num_features, eps=1e-5, momentum=0.1):
self.num_features = num_features
self.eps = eps
self.momentum = momentum
# Learnable parameters
self.gamma = ones(num_features)
self.beta = zeros(num_features)
# Running statistics
self.running_mean = zeros(num_features)
self.running_var = ones(num_features)
self.training = True
def forward(self, x):
if self.training:
batch_mean = x.mean(axis=0)
batch_var = x.var(axis=0)
# Update running statistics
self.running_mean = (1 - self.momentum) * self.running_mean + \
self.momentum * batch_mean
self.running_var = (1 - self.momentum) * self.running_var + \
self.momentum * batch_var
# Normalize
x_norm = (x - batch_mean) / sqrt(batch_var + self.eps)
else:
# Use running statistics during inference
x_norm = (x - self.running_mean) / sqrt(self.running_var + self.eps)
return self.gamma * x_norm + self.beta
Advanced Training Techniques
Gradient Clipping
Prevent exploding gradients:
def clip_grad_norm(parameters, max_norm):
total_norm = 0
for param in parameters:
param_norm = param.grad.norm()
total_norm += param_norm ** 2
total_norm = sqrt(total_norm)
clip_coef = max_norm / (total_norm + 1e-6)
if clip_coef < 1:
for param in parameters:
param.grad *= clip_coef
Mixed Precision Training
Use both float16 and float32 for efficiency:
class MixedPrecisionTrainer:
def __init__(self, model, optimizer, loss_scale=128.0):
self.model = model
self.optimizer = optimizer
self.loss_scale = loss_scale
def train_step(self, inputs, targets):
# Forward pass in float16
with autocast():
outputs = self.model(inputs)
loss = criterion(outputs, targets)
# Scale loss to prevent underflow
scaled_loss = loss * self.loss_scale
# Backward pass
scaled_loss.backward()
# Unscale gradients
for param in self.model.parameters():
param.grad /= self.loss_scale
# Check for invalid gradients
if not self.has_inf_or_nan_gradients():
self.optimizer.step()
self.optimizer.zero_grad()
return loss.item()
Implementation Guidelines
Data Pipeline Best Practices
Efficient Data Loading
class DataPipeline:
def __init__(self, dataset, batch_size, num_workers=4):
self.dataset = dataset
self.batch_size = batch_size
self.num_workers = num_workers
def get_dataloader(self, shuffle=True):
return DataLoader(
self.dataset,
batch_size=self.batch_size,
shuffle=shuffle,
num_workers=self.num_workers,
pin_memory=True, # Faster GPU transfer
persistent_workers=True # Keep workers alive
)
def apply_transforms(self, transform_list):
self.dataset.transform = Compose(transform_list)
Data Augmentation
class AugmentationPipeline:
def __init__(self):
self.transforms = [
RandomHorizontalFlip(p=0.5),
RandomRotation(degrees=10),
ColorJitter(brightness=0.2, contrast=0.2),
RandomCrop(224, padding=4),
Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]
def __call__(self, image):
for transform in self.transforms:
image = transform(image)
return image
Model Initialization
Xavier/Glorot Initialization
def xavier_uniform(tensor):
fan_in = tensor.size(-1)
fan_out = tensor.size(0)
std = sqrt(2.0 / (fan_in + fan_out))
return tensor.uniform_(-std, std)
def xavier_normal(tensor):
fan_in = tensor.size(-1)
fan_out = tensor.size(0)
std = sqrt(2.0 / (fan_in + fan_out))
return tensor.normal_(0, std)
He Initialization
def he_uniform(tensor):
fan_in = tensor.size(-1)
std = sqrt(2.0 / fan_in)
bound = sqrt(3.0) * std
return tensor.uniform_(-bound, bound)
def he_normal(tensor):
fan_in = tensor.size(-1)
std = sqrt(2.0 / fan_in)
return tensor.normal_(0, std)
Training Loop Template
class Trainer:
def __init__(self, model, optimizer, criterion, device):
self.model = model.to(device)
self.optimizer = optimizer
self.criterion = criterion
self.device = device
self.history = {'train_loss': [], 'val_loss': [], 'val_acc': []}
def train_epoch(self, train_loader):
self.model.train()
total_loss = 0
num_batches = 0
for batch_idx, (data, targets) in enumerate(train_loader):
data, targets = data.to(self.device), targets.to(self.device)
# Forward pass
outputs = self.model(data)
loss = self.criterion(outputs, targets)
# Backward pass
self.optimizer.zero_grad()
loss.backward()
# Gradient clipping (optional)
clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
total_loss += loss.item()
num_batches += 1
return total_loss / num_batches
def validate(self, val_loader):
self.model.eval()
total_loss = 0
correct = 0
total = 0
with torch.no_grad():
for data, targets in val_loader:
data, targets = data.to(self.device), targets.to(self.device)
outputs = self.model(data)
loss = self.criterion(outputs, targets)
total_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
accuracy = 100.0 * correct / total
avg_loss = total_loss / len(val_loader)
return avg_loss, accuracy
def train(self, train_loader, val_loader, epochs, scheduler=None):
best_val_acc = 0
for epoch in range(epochs):
# Training
train_loss = self.train_epoch(train_loader)
# Validation
val_loss, val_acc = self.validate(val_loader)
# Learning rate scheduling
if scheduler:
scheduler.step()
# Save best model
if val_acc > best_val_acc:
best_val_acc = val_acc
self.save_checkpoint('best_model.pth')
# Record history
self.history['train_loss'].append(train_loss)
self.history['val_loss'].append(val_loss)
self.history['val_acc'].append(val_acc)
print(f'Epoch {epoch+1}/{epochs}:')
print(f' Train Loss: {train_loss:.4f}')
print(f' Val Loss: {val_loss:.4f}')
print(f' Val Acc: {val_acc:.2f}%')
print()
def save_checkpoint(self, filepath):
torch.save({
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'history': self.history
}, filepath)
def load_checkpoint(self, filepath):
checkpoint = torch.load(filepath)
self.model.load_state_dict(checkpoint['model_state_dict'])
self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
self.history = checkpoint['history']
Performance Optimization
Model Optimization Techniques
Quantization
Reduce model precision for faster inference:
def quantize_model(model, calibration_data):
# Post-training quantization
model.eval()
# Fuse operations (conv + bn + relu)
model_fused = fuse_modules(model, ['conv', 'bn', 'relu'](/ruvnet/ruv-FANN/wiki/'conv',-'bn',-'relu'))
# Prepare for quantization
model_prepared = prepare(model_fused, inplace=False)
# Calibrate with representative data
with torch.no_grad():
for data, _ in calibration_data:
model_prepared(data)
# Convert to quantized model
model_quantized = convert(model_prepared, inplace=False)
return model_quantized
Pruning
Remove unnecessary connections:
def structured_pruning(model, pruning_ratio=0.2):
for name, module in model.named_modules():
if isinstance(module, nn.Conv2d):
# Calculate importance scores (L1 norm)
importance = torch.abs(module.weight).sum(dim=(1, 2, 3))
# Determine channels to prune
num_channels = module.weight.size(0)
num_prune = int(num_channels * pruning_ratio)
_, indices_to_prune = torch.topk(importance, num_prune, largest=False)
# Create mask
mask = torch.ones(num_channels, dtype=torch.bool)
mask[indices_to_prune] = False
# Apply pruning
module.weight.data = module.weight.data[mask]
if module.bias is not None:
module.bias.data = module.bias.data[mask]
Knowledge Distillation
Transfer knowledge from large model to smaller one:
class DistillationLoss:
def __init__(self, temperature=4.0, alpha=0.5):
self.temperature = temperature
self.alpha = alpha
self.kl_div = nn.KLDivLoss(reduction='batchmean')
self.ce_loss = nn.CrossEntropyLoss()
def __call__(self, student_outputs, teacher_outputs, targets):
# Soft targets from teacher
soft_targets = F.softmax(teacher_outputs / self.temperature, dim=1)
soft_prob = F.log_softmax(student_outputs / self.temperature, dim=1)
# Distillation loss
distill_loss = self.kl_div(soft_prob, soft_targets) * (self.temperature ** 2)
# Hard targets loss
student_loss = self.ce_loss(student_outputs, targets)
# Combined loss
return self.alpha * distill_loss + (1 - self.alpha) * student_loss
Hardware Optimization
GPU Memory Management
class MemoryOptimizer:
def __init__(self, model):
self.model = model
def optimize_memory(self):
# Enable gradient checkpointing
self.model.gradient_checkpointing_enable()
# Use mixed precision
self.scaler = GradScaler()
# Clear cache periodically
if torch.cuda.is_available():
torch.cuda.empty_cache()
def forward_with_checkpointing(self, x):
# Trade compute for memory
return checkpoint(self.model, x)
Distributed Training
class DistributedTrainer:
def __init__(self, model, world_size):
self.model = model
self.world_size = world_size
# Wrap model for distributed training
self.model = DistributedDataParallel(model)
def setup(self, rank):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# Initialize process group
dist.init_process_group("nccl", rank=rank, world_size=self.world_size)
# Set device
torch.cuda.set_device(rank)
self.model.to(rank)
Inference Optimization
Model Serving
class ModelServer:
def __init__(self, model_path, device='cuda'):
self.device = device
self.model = self.load_model(model_path)
self.model.eval()
# Optimize for inference
self.model = torch.jit.script(self.model) # TorchScript
# Warm up
dummy_input = torch.randn(1, 3, 224, 224).to(device)
with torch.no_grad():
for _ in range(10):
_ = self.model(dummy_input)
def predict(self, inputs):
with torch.no_grad():
inputs = inputs.to(self.device)
outputs = self.model(inputs)
return F.softmax(outputs, dim=1)
def batch_predict(self, batch_inputs):
predictions = []
for inputs in batch_inputs:
pred = self.predict(inputs)
predictions.append(pred)
return torch.cat(predictions, dim=0)
ONNX Export
def export_to_onnx(model, dummy_input, filepath):
model.eval()
torch.onnx.export(
model,
dummy_input,
filepath,
export_params=True,
opset_version=11,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
Conclusion
Neural networks continue to evolve with new architectures, training techniques, and optimization methods. The key to successful implementation lies in:
- Understanding the Problem: Choose architectures that match your data and task requirements
- Proper Training: Use appropriate optimization, regularization, and scheduling techniques
- Efficient Implementation: Optimize for both training and inference performance
- Continuous Learning: Stay updated with latest research and best practices
This comprehensive guide provides the foundation for building, training, and deploying neural networks across various domains. Remember that successful neural network development requires both theoretical understanding and practical experience with implementation details.
For specific implementations and advanced techniques, refer to the latest research papers and framework documentation. The field of neural networks is rapidly advancing, and staying current with developments is crucial for optimal results.