Milestone 3: Architectural Tactic for Availability - SENG-350-2024-fall/Team3 GitHub Wiki
The team has created a set of checks to evaluate the systems requirements as well as some of the possible points of failure:
- Does the system detect and log errors and faults effectively?
- In case of faults, does the system recover without user intervention?
- Are there failover/backup mechanisms in place to help ensure continuous data availability?
- Is the system designed to handle errors gracefully, possibly providing alternative functionality or feedback to the users?
- In the system capable of managing multiple concurrent requests, avoiding overload on any single resource?
The checklist helped derive a set of fice architectural tactics that can be implemented:
In this scenario, it is crucial to ensure that any failures that occur in the system are logged and recognized quickly, to allow for timely troubleshooting and intervention if needed. In the case of this project, the following has been implemented through configuring Django's logging to capture errors. All errors are directed to error.log, which allows for monitoring of issues in real time.
# settings.py
LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'handlers': {
'file': {
'level': 'ERROR',
'class': 'logging.FileHandler',
'filename': os.path.join(BASE_DIR, 'error.log'),
},
},
'loggers': {
'django': {
'handlers': ['file'],
'level': 'ERROR',
'propagate': True,
},
},
}
The triage function in views.py contains try and except block that capture and log errors, this helps ensure that faults that might occur do not go unnoticed.
# views.py
def triage(request):
try:
if request.method == 'POST':
# Get the triage data from POST
severity = int(request.POST.get('severity', 0))
symptoms = int(request.POST.get('symptoms', 0))
duration = int(request.POST.get('duration', 0))
additional = request.POST.get('additional', '')
# Calculate priorityScore
priorityScore = severity * 0.5 + symptoms * 0.3 + duration * 0.2
# Determine the recommended action based on priorityScore
if priorityScore < 2:
recommended_action = 'No immediate action required'
options = ['Virtual Meeting']
elif priorityScore < 5:
recommended_action = 'Consider visiting a clinic'
options = ['Clinic Visit', 'Virtual Meeting']
else:
recommended_action = 'Immediate medical attention required'
options = ['Visit to ED', 'Paramedic Visit']
context = {
'recommended_action': recommended_action,
'options': options,
}
return render(request, 'selection/options.html', context)
else:
return redirect('home')
except Exception as e:
logger.error(f"Triage error: {str(e)}")
return render(request, 'errors/500.html', status=500)
Fault recovery allows the system to handle errors and recover them through either trying different operation or by providing alternatives. In this code, the following was implemented through elements such as retry mechanisms, where functions such as book_appointment in view.py have retry logic with a for loop attempting to complete the booking up to three times is the booking failed. If all retries are exhausted, the system then displays an error page instead of just allowing the application to crash.
@login_required
def book_appointment(request, schedule_id):
max_retries = 3
schedule = Schedule.objects.get(schedule_id=schedule_id)
if request.method == 'POST':
for attempt in range(max_retries):
try:
schedule.is_booked = True
schedule.save()
appointment = Appointment.objects.create(
user=request.user,
schedule=schedule
)
return render(request, 'selection/booking_confirmation.html', {'appointment': appointment})
except Exception as e:
logger.error(f"Booking attempt {attempt + 1} failed: {e}")
sleep(1) # Brief delay before retry
return render(request, 'errors/booking_failed.html')
return render(request, 'selection/booking_confirm.html', {'schedule': schedule})
Redundancy helps ensure that even in a scenario where the primary resource fails, systems data is still available. This was implemented through a database configuration which supports failover to a read replica if the primary database is unavailable for some reason.
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'mister_ed',
'USER': 'postgres',
'PASSWORD': '----',
'HOST': 'localhost',
'PORT': '5432',
},
'replica': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'mister_ed',
'USER': 'postgres',
'PASSWORD': '----',
'HOST': 'replica-db-host', # for now set to localhost
'PORT': '5432',
}
}
Exception handling is a useful tool for the system, that helps ensure that errors do not crash the system, and graceful degradation allows the system to function in a limited way when a partial failure occurs. This is implemented through the use of elements such as custom error pages in templates/errors where custom error pages like 500.html provide user friendly error messages.
<!-- errors/500.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<title>Server Error</title>
</head>
<body>
<div class="container mt-5 text-center">
<h1>Oops! Something went wrong.</h1>
<p>We’re experiencing a temporary issue. Please try again later.</p>
<a href="{% url 'home' %}" class="btn btn-primary">Return to Home</a>
</div>
</body>
</html>
The system needs to have load balancing and efficient resource management inorder to help the system handle concurrent requests more effectively and efficiently, this helps reduce response time and is a good measure in preventing overload. In the case the mister_ed application, a Waitress WSGI server implementation takes place where Waitress is used as a production-ready WSGI server, which efficiently manages concurrent requests. Caching is the second component of the implementation where Django's caching is enabled for frequently used/accessed views which helps reduce the load on the servers.
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
'LOCATION': 'unique-snowflake',
}
}
from django.views.decorators.cache import cache_page
@cache_page(60 * 15)
@login_required
def appointment_list(request):
waitress-serve --port=8000 mysite.wsgi:application
- For fault detection, the test method involves simulating errors in views such as triage for example and checking that the error.log captures these errors correctly.
- For fault recovery, test through triggering booking failures and confirming that the retry mechanism functions as expected and allows up to 3 retries. *Redundancy is tested through simulating primary database failure and seeing that the data can still be accessed due to the replica. *Exception Handling is tested by forcing exceptions in views and confirming that error pages such as 500.html are displayed to the user.
- Load balancing is a bit more complex as it requires the database to be moved from localhost to some location that is accessible by multiple devices, testing would most likely involve trying to perform load testing with multiple users. Cache can be enables and disabled to see the difference in performance.