Introduction
This article will provide information about detecting problems related to alarms and proposed solutions.
Alarmflow Worker
To detect and resolve issues, we need to know the working architecture of our service.
When logs enter Logsign, the parser service performs the parsing process. Then, the logs are sent to the Redis service for checking alarm rules. The alarmflow service checks the alarm rules using the subscriptions structure established with Redis. If there is a rule that needs to be triggered within the alarm, the alarmflow worker service triggers the alarm.
After the alarm is triggered, its log is sent to Elasticsearch to be displayed in the user interface. If any action configuration is available when the alarm is triggered, the data that requires action is processed into a list in the Redis database.
Let's explain the topic with an example.
Suppose you are waiting for a log to trigger your alarm in the Logsign user interface, but you have observed that the alarm has not been triggered despite the log being created. There may be more than one reason for this situation.
Firstly, the column selected in the action.column within the alarm must be in the content of the log that needs to be triggered. If not, the alarm will not be triggered.
Our second control point is silence time, which is a module for preventing the same alarm from triggering again in case of repetition. Make sure that the silence time of the untriggered alarm has not expired.
Another control point is the Logsign-parser service. In the case of an increasing eps, the Logsign parser service can cause slowness in your alarms. To check this, we should check the stats graphics for the Logsign-parser service.
If there is a value for the parser in the Socket Drop graph, you need to increase the parser service.
Our last control point should be the Logsign-alarmflow-worker service. After connecting to the Logsign server's terminal screen, we should check the service with the following command:
journalctl -u logsign-alarmflow-worker -f
If there are any expressions like errors in the Logsign alarmflow-worker service, they should be examined.