Reliability - universAAL/middleware GitHub Wiki

Table of Contents Blackbox Description Design decisions Failure Diagnosis Module in universAAL Blackbox Description Bundles Features Design Decisions Fault Containment Region in universAAL Use case and failure modes of the Fault Containment Regions Hardware Faults Software Faults Implementation of The Diagnosis Framework Error Detection Unit Backbox Description Bundles Features Design Decisions Message classification concept Conceptual model of Error detection unit Implementation Fault detection mechanism in time domain Semantic fault detection mechanism References

Blackbox Description

To improve the reliability of systems using redundant subsystems is to support fault tolerance mechanisms that adapt to reliability changes of subsystems during the system’s life time. For instance, if the ability to recover from errors is exhausted for a particular replicated subsystem because too many permanent errors have accumulated (e.g., one replica of a Triple Modular Redundancy system has failed permanently), appropriate actions have to be taken to enhance the reliability (e.g., the migration of the replicated system functionality to a different IP core in a multi-core system). This is a requirement for enabling sustained operation of components, which is demanded for applications that require a non-stop operation throughout their entire lifetime. In addition, since on-call maintenance can be very cost intensive due to maintenance contracts and service outages, the universAAL architecture shall enable the shift of on-call maintenance to periodic maintenance. A shift to periodic maintenance can be achieved by fault-tolerance techniques that retain, in case of an internal error, the correct system functionality until the next scheduled service date.

Ground Rules

Creating fault tolerant behavior in a hardware/software system is a complex process. Faults have diverse sources, from physical failures of the hardware, logic errors in the software, either internal or external. They may be operational errors or the result of malicious use. Faults can be temporary or persistent. Software faults are due to flaws in the design of the system.

Descriptions of failure scenarios range from the complete loss of power to the failure of individual components. The classification of failures and their consequences is always unique to the particulars of the service provided. Consequently, approaches for achieving a particular level of dependability will vary. Fault prevention in the design phase, and fault removal through maintenance are important means in delivering reliable software.

There is a precise and rigorous terminology used in the literature to describe the basic concepts of dependable computing codified by Laprie ^[1]; see also Avizienis, Laprie, and Randall ^[2]. A system is a collection of interacting components that deliver a service through a service interface to a user. The user can be a human operator, or another computer system. The service delivered by a system is its behaviour perceived by the user. Dependability of a computing system is the ability to deliver service that can justifiably be trusted. Applications can emphasize different attributes of dependability, including:

Availability, the readiness for correct service.
Reliability, the continuity of that service.
Safety, the avoidance of catastrophic consequences on the environment.
Security, the prevention of unauthorized access.

The function of a system is what the system is intended to do, as described by the functional specification. A system failure occurs when the service delivered does not comply with the specification. The system state is the set of the component states.

An error is a system state that may lead to failure. An error is detected if an error message or signal is produced within the system or latent if not detected. A fault is the cause of an error, and is active when it results in an error, otherwise it is dormant.
Fault tolerance is the ability of a system to deliver of correct service in the presence of faults ^[3]. This is achieved by error processing —removing the system error state— and by treating the source of fault. The ability to detect and process error states and assess the consequences is critical requirements of fault tolerant design.

Fault tolerance —both hardware and software— is achieved through some kind of redundancy. Hardware redundancy techniques often make use of multiple identical units, in addition to a means for arbitrating the resulting output. ECC memory, for example, uses a few extra bits to detect and correct errors resulting from faults in the individual storage bits. Running the same input data through a faulty software module multiple times yields the same erroneous result each time. Software fault tolerance is built by applying algorithmic diversity, computing results through independent paths, and by judging the results. This adds complexity to the system in general. Adding software fault tolerance will improve system reliability only if the gains made by the added redundancy are not offset by commensurate new faults introduced by the parallel code.

Design decisions

Reliability building block goal is to improve the reliability aspects of the universAAL platform. Therefore the reliability building block is a vertical layer cross over all layers of universAAL, especially in the Middleware. This can be done by dealing with to major challenges of reliability and enhance the system efficiency. The first action point, the creation of a framework to diagnose the system behaviour by detecting the faults that might occur during the systems operation, and take decisions to overcome such cases. Taking into consideration the existing components of the Middleware, the following components will be reused in the Diagnosis Framework: Context Events, Context Bus and the Situation Reasoner (see Context Group wikipages for more details). The Diagnosis Framework, should not create further effort on the operational load of the platform or interrupt other services. The Middleware has a message based communication.Hence, fault detection mechanisms is also using message classification algorithms in order to categorize messages and differentiate all message types interacting in the platform. The diagnosis framework uses a knowledge base of rules that determine the behaviour of the system and define possible solutions. This knowledge base has to be fed continuously with new knowledge and cases to be able to decide in more and more use cases. A Fault injection framework has been implemented to create a high effort testing scenarios for a number of nodes in an uSpace, after the end of this check, a file of feedback results can be used in the knowledge base that is used in the Diagnosis Framework. The Fault Injection Framework in its final version will be fully independent bundle from the Middleware. This will also give universAAL administrators the ability to test the functionality of any uuSpace remotely. The third bundle in the Reliability building block is the Time Triggered patch, this patch is giving the users of universAAL platform the possibility to have an advantage of using a time triggered communication in there uuSpaces where many reliability aspects are taken already into consideration in the infrastructure used in such communication (e.g. global time synchronization, reliable communication of critical events in the system).

Failure Diagnosis Module in universAAL

Blackbox Description

Fault Diagnosis is the process of determining the type, size and location of the most possible fault together with the temporal specification of the fault. Diagnosis is the reasoning process for detection, isolation, analysis and recovery of occurring faults. A Symptom is the subjective evidence of a failure that indicates the existence of fault.

The notion of Fault Containment Region (FCR) is a key concept for reasoning about the behaviour of a system in the presence of faults. The knowledge about the immediate impact of a fault can serve as the starting point for the reliability analysis of a system. In addition, fault-tolerance mechanisms such as triple modular redundancy require replicas to be assigned to independent FCRs. FCR is the set of subsystems that share one or more common resources that can be affected by a single fault and is assumed to fail independently from other FCRs.^[4]

The main components for the diagnosis infrastructure for universAAL are as follows.

Error detection unit: is the basic component for monitoring system's software components, processes, and processing nodes and reporting errors.
Fault analyser : is the component which analyses the gathered error reports by detection unit based on diagnosis rules
Failure Notifier: is the component that typically gathers fault reports and publishes the diagnosis decisions

Bundles

Artifact: Failure Diagnosis Module in universAAL
GIT Address	http://github.com/universAAL/context/tree/master/ctxt.reliability.reasoner

Features

This artefact offers the following features.

Fault hypotheses: on top of which the rules for diagnosis are designed
Fault Containment Regions: to identify the FCRs in universAAL
Failure modes: to identify specific failure modes for specific FCR
Failure diagnosis rules: to reasone on the root cause of any failure
Reliability reasoner: the reasoning engine to make decisions on detected error events

Design Decisions

Fault Containment Region in universAAL

As diagnosis involves the backtracking from Failure to Fault, the knowledge about the possible FCRs in universAAL platform are used in the fault analysis process. This fault analysis also uses knowledge about the failures in the both time and value domain. To gather this knowledge, the whole platform is divided into several Fault Containment Regions (FCRs) together with the specific failure modes that they can show. In the following, these FCRs are listed with appropriate possible failure modes and use cases. universAAL platform can be formulated from the diagnosis point of view where the whole platform is divided into FCRs.

In the following, a comprehensive list of Fault Containment Region with respective failure modes is listed. In this list, each of the FCRs is enlisted with its input, output and rationale so that the inclusion if this FCR is justified. The failure modes for each of the components are classified as follows. From the consistency point of view

Consistent Failure
Inconsistent Failure

From the time and value perspective

Timing failure
1. Early Timing Failure: eg. Babbling idiot
2. Late Timing Failure: eg. Omission, Crash, Fail-stop
Value failure

Use case and failure modes of the Fault Containment Regions

Hardware Faults

The Hardware faults that can be presented in the following user cases of the fault containment regions are:

FCR	Node
Rationale	A single node shares processor, memory, power supply. Single physical fault will affect other software components
Input
Output
Service
Failure Modes	•Omission failure: a node can stop sending or receiving signals (messages) to or form the physical channels. This failure mode appears as a late timing failure mode from our [1st]. •Babbling idiot: a node can send untimely messages to the channel. This failure mode is also a late timing failure mode

FCR	Communication channel
Rationale	Physical fault of the communication channel will lead to communication breakdown of the nodes that are connected through the channel. The value failure is not considered here as the error correcting codes are able to deal with the value failure. It is also assumed that the channel itself cannot create messages by itself.
Input
Output
Service
Failure Modes	Crash failure: a physical communication channel may not produce any output and this failure can remain undetected by the correct FCRs

(*)Solely dependent on the application and/or specific implementation.

Software Faults

The Software faults that can be presented in the following user cases of the fault containment regions are:

FCR	Operating System (OS)
Rationale	An OS failure renders an entire node unusable because applications depend on core OS services for memory allocation, process management and I/O.
Input
Output
Service
Failure Modes	The failure modes for OS require a deeper understanding of the structure of the OS and are kept for future development.

(*)OS is out of scope for this diagnosis implementation

FCR	Middleware
Rationale	As middleware acts a broker between the hardware and application software, failure to this will lead to total system failure
Input	•Input from uSpace Managers •Input from uSpace applications •Packets from Ethernet (Input from layers below) •Packages and messages from components (Input from layers above)
Output	•Context event for context manager •Service invoke for the service manager
Service	Provides two types of connection points •Connection points between instances of middleware •Connection points between the local components of the node and the system
Failure Modes	•Omission failure: a middleware can send or receive messages or packages to or from the layers above and below. This includes a scenario where a piece of middleware is not delivering the required package to other piece of middleware. This failure mode is already handled by the current uPnP connectors •Crash failure: a piece of middleware can omit its output for all subsequent input until it is restarted again. A suitable scenario for it is as follows. A new uPnP device has joined the middleware bus. If the middleware fails to produce the context event related to the joining of the new device to context bus and it requires that the middleware has to be restarted, then this scenario falls into a crash failure. •Byzantine failure: this failure mode covers all the arbitrary failures that may occur to the middleware

FCR	I/O Drivers
Rationale	Failure to I/O drivers will lead an I/O device not able to work as an I/O driver controls devices and maintain the data acquisition and visualization.
Input
Output
Service	•Provides an interface for hardware like printers, video adapters, sound cards, network cards, digital cameras etc. •For hardware: a.Interfacing directly b.Writing to or reading from a device control register c.Using some higher-level interface (e.g. Video BIOS) d.Using another lower-level device driver (e.g. file system drivers using disk drivers) •For software: a.Allowing the operating system direct access to hardware resources b.Implementing only primitives c.Implementing an interface for non-driver software
Failure Modes	This FCR is related to the failure modes of OS which leads to check if it is in our scope of work or not

(*)Solely dependent on the application and/or specific implementation.

FCR	Application components
Rationale	Failure to one application component is contained within that component
Input
Output
Service	Application dependent
Failure Modes

(*)Solely dependent on the application and/or specific implementation.

FCR	uSpace Gateway
Rationale	Failure to an uSpace gateway leads to intra and inter-uSpace bridging mechanism failure and failure to one gateway is contained within it
Input	•Incoming request from a remote user (includes another uSpace)to start and or publish its service and or service request •Message from output bus
Output	•Authenticated or denied request to the remote user (includes another uSpace) and or service provider •A communication channel between the remote user (uSpace) and the current uSpace •Message to input bus
Service	•Manages the activities among the Space Federation (intra-uSpace) or inside the same space (inter-uSpace) •Acts as an IO handler within an uSpace •Provides certain communication service to other IO handlers •Provides a mechanism to check the trustworthiness (authentication/deny request) •Enable intra-space communications •Log the uSpace activities.
Failure Modes	•Omission failure: An example scenario for this Omission failure is as follows. A remote device tries to connect to the uSpace through the uSpace Gateway. The gateway is not responding to the incoming request for the remote device to join the uSpace. Another scenario would be if the gateway is omitting the messages that it received from the uSpace output bus and it should pass those messages to the remote user (includes another uSpace) •Babbling idiot: an example scenario for this Babbling Idiot failure is as follows. The faulty uSpace Gateway is constantly sending high priority messages to the uSpace (specifically in the buses(Input bus)) •Value failure: an example scenario for this failure is as follows. A remote trustworthy user (including another uSpace) tries to join the current uSpace using the uSpace Gateway, but the gateway is denying the remote connection

FCR	Connectors (of ACL)
Rationale	Protocol specific
Input
Output
Service	Application dependent
Failure Modes

(*)Solely dependent on the application and/or specific implementation.

FCR	ACL (Abstract Connection Layer)
Rationale	Failure to ACL leads to breakdown of connectivity among the instances of the middleware
Input	Registration message from SodaPopPeer Listener request from PeerDiscoveryListener
Output	Registration message of the P2PConnector to the underlying Hosting OSGI Framework
Service	•Peer-discovery •Creating proxies of remote implementations of SodaPopPeer •Forwarding calls made to SodaPopPeer proxies to the real implementations of it on the side of remote peers.
Failure Modes	•Fail stop failure •Value failure: an ACL maintains a queue for the incoming request for the registration message from the SodaPopPeer. If the ACL fails to produce the correct registration message for the underlying OSGI framework, a value failure occurs. •Babbling idiot: this failure mode includes a scenario that ACL produces a correct registration message for the P2PConnector, but it produces it very late

FCR	SODAPOP layer
Rationale	Failure to SODAPOP layer leads to disconnection between ACL and AAL specific layer
Input	Incoming calls from ACL to bind a peer
Output	Communication between the peers using the buses
Service	•Finds peers by PeerDiscoveryListener interface •Peers access middleware by SodaPopPeer interface •Buses communicate with own peers by SodaPop interface •Serialize and deserialize messages by MessageContentSerializer interface
Failure Modes	•Timing failure: untimely (de-)serialization of messages; create the communication between the peers when one of them has already been absent •Omission failure: the sender (receiver) SODAPOP layer fails to send(receive) the message •Value message failure: the message contents does not comply with the interface specification

FCR	Virtual Communication Bus (Context, Service, UI)
Rationale	Failure to the logical bus leads to cease of communication messages as the buses are the Connection Points towards AAL Specific layer
Input	(De-)registration messages
Output	Messages defined by BusStrategy abstract class
Service	Management of the message queue Propagate messages
Failure Modes	•Value message failure: transmitted message do not comply with the interface specification •Timing message failure: unspecified instance of message in time domain

Implementation of The Diagnosis Framework

In this section, an integrated detection and diagnosis framework is presented that can identify anomalies and find the most probable root cause of not only severe problems but even smaller degradations as well. Detecting an anomaly is based on monitoring uAAL component profile (see the following section on Error Detection Unit). Diagnosis is based on reports of previous fault cases by identifying and learning their characteristic impact on different performance indicators.

In common day terminologies, detection and diagnosis are hardly separated. Commonly, by the phrase “detecting a problem”, one often means two things actually: first, the confirmation that there is a problem at all and second, the verification of the nature or type of the problem itself. An example might be as follows. Sensors to register weight in the bed can activate the lighting of the route to the toilet, when the bed is left. But in one instance, it has been detected that the lighting does not activate although the bed is left because the weight sensor of the bed is generating no signal. The correct terminology in this example would be to say that an unusual behavior has been detected (i.e., the lights are not activated) and it is diagnosed that, e.g., the cause is a damaged sensor that has to be replaced. Detecting that the lights will not activate does not necessarily mean that there is a problem with the lighting itself; nevertheless, simply looking at the symptom level with this granularity it is impossible to tell if there is a serious problem with the lights or the master switch of the lights just has to be restarted. Therefore, if an unusual behavior is detected, a more thorough diagnosis has to be conducted in order to find out if there is actually a problem and what is the root cause behind it. Since the terms “detection” and “diagnosis” often carry an implicit duality, they have to be precisely defined before used in an engineering system such as an Ambient Assisted Living space: Detection basically means to identify something unusual in the network. However, in the context of the integrated framework in uAAL, the role of the detection process is only to provide a common view of possible indicators (symptoms) to the diagnosis to facilitate their correlation but deciding if there is a fault at all or what it is will be left to the diagnosis. Diagnosis means to investigate the root cause that could have caused the detected symptoms. In the framework, the input of the diagnosis is the output of the detection unit. The output of the diagnosis might as well be that there is in fact no problem at all. Usually, after the diagnosis of the root cause is done, certain corrective actions have to be performed in order to resolve the problem. Sometimes the root cause is harder to investigate than providing the action without knowing the underlying mechanisms; e.g., several failures can have a common corrective action (like restarting the sensor) but the root cause is unknown for the maintenance operator. It is even possible that the associated action is not a direct correction of the fault but the recommended escalation (e.g., alarming manual support line). Therefore, using the corrective action instead of the specific root cause is also acceptable. The root cause or the corrective action are what the diagnosis returns and they will be jointly referred to as the target of the diagnosis.

The integrated diagnosis framework uses the power of the Context bus in universAAL so that looking at any context event gives the indication of any symptom for a fault. It also uses the reasoning power of SPARQL and also uses the Publish/Subscribe model in universAAL. The integrated diagnosis framework is depicted in the following figure.

From the context bus, the context events related to faults are taken as symptoms for a failure. These symptoms are analyzed by a priori knowledge of the FCR and the related static knowledge on the associated failure mode. These symptoms are further queried by Reliability Reasoner with the help of the KB (Knowledge Base) and Dependability Ontology. These symptoms can be analyzed either in a rule based approach or simple SPARQL query. The rules for the failure analaysis are inside the Reliability Reasoner. Then the reasoner will publish the context event with the diagnosis information into the context bus. This diagnosis information includes the actions for the failure that have to be adopted for the specific failure modes for that specific FCR.

Error Detection Unit

Backbox Description

In highly distributed systems, where a large number of hardware and software are contributed to serve a certain scenario, the probability of fault occurrence will be significant. Some of the provided services are critical and need to be served with relatively high reliability and availability i.e. the corporate components should provide at least a degraded level of this service even with fault existence. To tolerate the faults in such systems, three interrelated phases should be followed:

Fault detection.
Fault diagnosis.
Fault masking and recover.

The first phase is responsible for detecting anomalies within a system, a pre-knowledge about the correct system behavior is required, such that any deviation from the normal behavior either in time or value domain can be caught easily by the error detection mechanism. Then, the detected anomalies would be analyzed and diagnosed by a certain diagnosis technique. After diagnosing the faults, a certain action should be taken either by recovering the faults online or offline or by blocking the fault and preventing them from elaborating to another healthy unit. Because of its importance in fault tolerance operation, an Error detection framework has been created, the framework is based on classifying the exchanged messages within the network of the distributed system according to its specifications in time and value.

Bundles

Artifact: Error Detection Unit
GIT Address	http://github.com/universAAL/context/tree/master/ctxt.error.detection.unit

Features

EDU comes to enhance the reliability of the universal platform by discovering the faults of the exchanged messages in different domains. The discovered faults can then be forwarded to the diagnostic unit to take the suitable action. Several fault detection methods has been implemented in order to cover a wide range of faults. These methods may be classified as follow:

Detecting the faults in time domain for both of the periodic and sporadic messages. These methods have the ability to detect the temporary and the permanent faults in time domain
Detecting methods in semantic domain, several check process has been implemented (range check, 1st derivative). These methods have the ability to detect the temporary and the design faults n semantic domain.

In addition to the general purposes fault detection methods, EDU has been built in an extendable way to accept any other application specific check process.

Design Decisions

Message classification concept

The principle of Error detection by using message classification is introduced for the first by (Jones Kopetz) in Dependable System of Systems conceptual model (DSoS). The conceptual model of DSoS classifies the messages as shown in the next table . Before sending a message from one node to another one within the same network, code protective bits should be added to the message (e.g. CRC bits), then the output assertion on the sending node should verify the message. The message is classified as checked if it passed the output assertion. To be permitted the message has to pass the input assertion check of the destination node. The syntactic check should be done on the message by checking the codes bits to make sure that the message is still valid and has not been truncated. After that, the message should be checked against its receiving time and semantic to see whether the message is timely and correct therefore can be used further, otherwise the message will not be used.

Attribute	Explanation	Antonym
valid	A message is valid if it contains a correct CRC.	invalid
checked	A message is checked if it passes the output assertion.	not checked
permitted	A message is permitted with respect to a receiver if it passes the input assertion of that receiver.	not permitted
timely	A message is timely if it is in agreement with the temporal specification	untimely
correct	A message is correct if it is in agreement with the temporal and the value specification.	incorrect
insidious	A message is insidious if it is permitted but incorrect	not insidious

Conceptual model of Error detection unit

The next figure depicts a simple network, which consist of several universAAL aware communication nodes. EDU has been realized in each universAAL node as a separate software component by occupying the location between middleware and the application layer. EDU is not application specific, but it uses some functions from the underlying operating system to ensure its predictable behavior. However, EDU should be configured by the application developer to meet the specification of his application.

EDU has been designed only to handle the received events by other uAAL-Components. Thus, whenever a uAAL-Component receives a new event from the context bus, it can deliver this event to the EDU to check the events against several fault type that should be predefined at the design time by the uAAL-Components itself. The physical location of the EDU on the receiving node will help the EDU in monitoring the sender status by analyzing its messages. Actually two design possibilities were available; whether putting EDU in the sending side or the receiving side. In some situation it becomes difficult for the sending node to judge itself. Suppose for instance that the sending node has mismatched the system synchronization due to a drift in its oscillation, in this case, it’ll be unreasonable to trust on the node’s decision whether the message timing is correct or not. As mentioned in previous sections, EDU is relying on message classification concept to detect anomalies in the received messages. Next figure shows the follow of the received message inside the EDU, and how the message classification concept has been realized inside it.

First of all the incoming message should pass the syntactic check to see if the received message is valid or not. In fact, the syntactic check tests if the received message has already been configured by the user. If not the message is dropped and doesn’t precede the other processes, at the same time an indication goes to the diagnostic unit to tell him about the invalid message. If the message is valid, a time check should be done to verify the timing of the message. Depending on the timing behavior of the different messages (e.g. periodic or sporadic messages), different time check algorithms may be required. The timely messages should finally pass the semantic check to make sure that the received message is error free. To check the message semantic, different software methods are available. Some of these methods are not application specific and can be applied generally like limit check, 1st derivative check , etc… while other methods require more information about the application like plausibility check, process model based check. If the message has been dropped in any one of these check points, an indication is made to the diagnostic unit to take the suitable decision. However, to take an accurate decision, accurate information of the caught anomaly should be provided from the error detection unit. This information should contain the error type, location, and time to help the diagnostic unit in taking the right decision easily. The faulty messages that are generated by the EDU on each node, are finally published on the context bus, see Figure 1. Diagnostic unit should be able to subscribe to all of these events from the different nodes. Physically, the Diagnostic unit should be in one centric node, and this node should have the ability to connect to all distributed nodes by using a suitable networking. In order to realize the EDU, several assumptions and requirements should be taken in consideration before getting in the implementation phase:

Deterministic behavior: as check functions in EDU are relying on pre-knowledge information in both time and value domain, a deterministic behavior for both of the middleware and the communication infrastructure is required to ensure message consistency.
Synchronization among the communication nodes: in order to have a unique timing view, the senders and the receivers of the messages should be in synchrony.
Syntactic check within communication Architecture: the framework assume that no fault can happen to the message content in communication network i.e. the communication network has the ability to catch the syntactic fault (e.g. flipped bits, truncated message) by implementing some type of check function (e.g. CRC check).
Extendibility: The framework should be expendable to adapt to any new error detection mechanism.

Implementation

Before getting into the practical elements of the EDU and how these elements have been implemented, several design aspects might be clarified at first. One of the most important feature in EDU, is its ability to detect the errors in time domain, the error detection mechanism in time domain should be accurate enough to handle the timing errors in high resolution. Thus a high resolution time stamping and timer mechanisms need to be used. To cope with this issue, specific timing functions have been utilized from the OS (Linux OS in our case). Although this procedure might made the EDU a non-portable code, equivalent timing functions of other OS can be found and replace that of Linux. Because of that and to make the code more flexible, the main core of the EDU has been implemented in C Language and provided with native methods to make the interface to the middleware which is already implemented with java. To achieve the message classification inside the EDU, a pre-knowledge about the message specifications both in time and in value domain are required. These specifications should be delivered by the application developer at the design time and before the using of EDU. An XML configuration file has been created to make it easier for the developer to give the specification of its message. To manipulate the specification of the message, a parser function has been created for parsing the information from the XML file and providing them to the main data structure of the EDU. The data structure inside EDU, consists mainly of a hash table that comprises the message ID as a key and a list of check processes’ structures as a value to the related key, see next figure , where each message should pass a number of check points that are associated to the related message during the design phase. Suppose a certain message which has the message ID “101“as in the next figure. Message 101 is supposed to be configured as periodic message and have an integer value that should be tested against a certain threshold by applying the limit check and the 1st derivative check processes, therefore three check processes should applied on this message. By finding out the message ID in the hash table, a pointer to the head of the check processes list will be returned as value, for our case, the pointer refer to the periodic field. The periodic related information of message 101 such as the period value and the phase value will be found in its structure instance (Periodic struct.). After that a periodic check function will be called to compare between the stored time information and the time information that is extracted from the received message, if message 101 met its time specification, then it is considered as timely otherwise untimely message indication may be given to the diagnostic unit. By terminating the periodic check function, the pointer will refer to the second check process and so on until the check processes list is finished.

Fault detection mechanism in time domain

To cover fault hypothesis in time domain, two types of messages may be distinguished according to its timing behavior:

Periodic messages: this type of messages has a fixed period after which a new message of the same type should be transmitted over the network. One periodic message could be transmitted more than one time within one cluster cycle, therefore periodic messages with the same period but at different phases may be differentiated.
Sporadic messages: in contrast to periodic messages, the sporadic messages have to be re-generated within a certain range of time. In other words it has a minimum and maximum inter arrival time and it should be re-transmitted within this range.

For both types the messages could be timely when it meets its specification or it could be represented by untimely if one of these situations happens:

Message came early.
Message came late.
Message didn’t come.

The first two errors can be detected depending on event triggered mechanism by comparing the receiving instance of the message with the pre-known message specification. In case of message losing no event would be generated, therefore another mechanism has been used that depends on time triggered paradigm. The time triggered mechanism is summarized in setting a time out for the waited event by setting the deadline of receiving instance. If the time is spent without receiving a message an indication is made to the diagnosis unit. This type of error detection mechanism helps in detecting the omission failure where a permanent fault could occurs. In order to schedule an event with time out in the future, an event list data structure has been created. The event list which is also called an event calendar handle the events of different messages with maintaining all of the events in time order so that the next event may be readily determined when the current one has completed execution. During the execution of each event, new event may be scheduled. A single event handler has been used to handle all the events of incoming messages. This step has been adopted to avoid the conflict between periodic and sporadic events in case they share the same schedule point of time. Additionally, it is not reasonable to generate an event handler for each message. The event calendar consists mainly of the message ID and the expected next arrival time of the related message as shown in Figure 4. the expected next inter-arrival time for periodic message may be computed by Time schedule periodic = current time + message period For Sporadic Time schedule sporadic = previous arrival time + max interarriaval time It could be seen that the next scheduling point of time for sporadic message depend directly on the previous receiving point of time. Therefore a static list data structure has been created inside the sporadic check function to maintain the previous time stamping of different sporadic messages from lost. In same manner if previous data are required within the current test function, a static list may be generated, each element of the list contain the message ID field and the previous data that are required. The calendar re-arranges itself dynamically after each message arrival in such a way that the earliest schedule time occupies the head position of the list.

Semantic fault detection mechanism

If the ensuring of the deterministic behavior for both the middleware and the communication infrastructure will help a lot in classifying faults regarding time, this will not be the case when a sensor or actuator deviates from its normal operation. It is more complicated to catch an error from the message semantic. However a wide vary of methods are already introduced to detect anomalies of a certain process. These methods may be classified as already done by Isserman in ^[5].

Signal based fault detection: by taking the measured signal several criteria may be applied directly e.g. limit checking or trend checking or may be by analyzing the measured signal a certain specification can be estimated and then tested.
Model based fault detection: this method is more complicated, it takes the measured signals for both input and output for a certain process and apply them to the mathematical model of the process. Then several features can be estimated e.g. parameters, state variables or residuals. By comparing these observed features with their nominal values analytical symptoms are generated.

Three fault detection methods based on directly measured signal have been realized in the framework:

Limit checking: each measured signal Y(t) is normally bounded by one or two thresholds Y_min and Y_max. If the signal exceeds one of its thresholds then anomaly may be detected. Of course, normal fluctuation could occur, then a false alarm should be avoided, however the fault on the other hand should be detected early. Therefore a trade off between too narrow and too wide threshold exists. To use this check function in the framework, a special C data structure has been identified to save the message ID and the related max. and min. threshold. To apply the limit check function for a certain message, an instance of the related structure should be initiated and the limit checking process should be inserted to the List of check process.
Trend checking: the same principle for limit checking may be applied for first derivative of the measured value Y'(t) by setting the minimum and the maximum limits for the trend. Trend checking could detect the fault earlier. To compute Y'(t) the previous value Y(t)_old and its time stamp are required. The issue of handling the previous values that are related to a certain message has been automatically treated and the user only has to inserts its threshold in a specific data structure.
Application specific Plausibility check: when multiple measurements are available for the same process, a relation between these measurements may be establish to be a base for further checking. As an example for plausibility check suppose a process with two measured variables X(t) and Y(t). Under the normal condition the following relationship should apply: Since this type of checking process is application specific, it was very difficult and invaluable to present it as a general function in the framework, however a use case example has been verified within the framework.
If (Y_min< Y(t)< Y_max) Then (X_min

The fault types that have been covered by the EDU are classified as general faults that are not related to a specific application. Many other application specific fault detection methods can be inserted to the EDU by the developer of uAAL components by utilizing the data structure of the EDU.

References

^ J.C. Laprie (ed.). Dependability: Basic Concepts and Terminology, Dependable computing and fault tolerant systems services, Vol 5. Springer-Verlag 1992
^ A. Avizienis, J.-C. Laprie, B. Randall. Fundamental Concepts of Dependability", 2001
^ A. Avizienis. Fault Tolerant Systems, IIEE Trans. Computers Vol C-25 No 12. 1976
^ Kopetz, H.; , "Fault containment and error detection in the time-triggered architecture," Autonomous Decentralized Systems, 2003. ISADS 2003. The Sixth International Symposium on , vol., no., pp. 139- 146, 9-11 April 2003 doi: 10.1109/ISADS.2003.1193942
^ Isermann, Rolf. Fault Diagnosis System. Heidelberg : Springer, 2006.

Reliability - universAAL/middleware GitHub Wiki

Table of Contents

Blackbox Description

Design decisions

Failure Diagnosis Module in universAAL

Blackbox Description

Bundles

Features

Design Decisions

Fault Containment Region in universAAL

Use case and failure modes of the Fault Containment Regions

Hardware Faults

Software Faults

Implementation of The Diagnosis Framework

Error Detection Unit

Backbox Description

Bundles

Features

Design Decisions

Message classification concept

Conceptual model of Error detection unit

Implementation

Fault detection mechanism in time domain

Semantic fault detection mechanism

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️