netdata performance and health monitoring - downgoon/hello-world GitHub Wiki

谈到运维监控,我们就会想到zabbix等老一套的系统。但它们真的太老了。 本文介绍github上,star数量将近2万的 netdata

netdata is a system for distributed real-time performance and health monitoring. It provides unparalleled insights, in real-time, of everything happening on the system it runs (including applications such as web and database servers), using modern interactive web dashboards.

netdata is fast and efficient, designed to permanently run on all systems (physical & virtual servers, containers, IoT devices), without disrupting their core function.

netdata runs on Linux, FreeBSD, and MacOS.

重要特色

  • 不仅监控系统层面,还监控应用层面(web和数据库服务器);
  • 不仅监控物理机器,虚拟机,而且连IoT设备都监控(低功耗);
  • 运行平台:Linux, FreeBSD, and MacOS (IoT设备也主要是Linux内核,而非安卓)。

Features

  • Stunning interactive bootstrap dashboards mouse and touch friendly, in 2 themes: dark, light

Bootstrap,来自 Twitter,是目前很受欢迎的前端框架。Bootstrap 是基于 HTML、CSS、JAVASCRIPT 的,它简洁灵活,使得 Web 开发更加快捷。[1] 它由Twitter的设计师Mark Otto和Jacob Thornton合作开发,是一个CSS/HTML框架。Bootstrap是基于HTML5和CSS3开发的,它在jQuery的基础上进行了更为个性化和人性化的完善,形成一套自己独有的网站风格,并兼容大部分jQuery插件。

  • Amazingly fast responds to all queries in less than 0.5 ms per metric, even on low-end hardware

  • Highly efficient collects thousands of metrics per server per second, with just 1% CPU utilization of a single core, a few MB of RAM and no disk I/O at all

  • Sophisticated alarming hundreds of alarms, out of the box! supports dynamic thresholds, hysteresis, alarm templates, multiple role-based notification methods (such as email, slack.com, pushover.net, pushbullet.com, telegram.org, twilio.com, messagebird.com)

监控功能还蛮智能的,支持“动态阈值”,而且不仅可以发邮件,还可以推送到slack这种企业IM中。

  • Extensible you can monitor anything you can get a metric for, using its Plugin API (anything can be a netdata plugin, BASH, python, perl, node.js, java, Go, ruby, etc)

功能可拓展性:这个几乎是现代软件的必备。类似nginx的模块开发,号称下一代web服务器的caddy也有。

  • Embeddable it can run anywhere a Linux kernel runs (even IoT) and its charts can be embedded on your web pages too

页面级复用:这个厉害了。很多传统监控软件令人讨厌的就是没法做到页面级复用,不能嵌入到现有系统。它这个不仅提供API层面的集成,而且提供页面级别的集成。

  • Customizable custom dashboards can be built using simple HTML (no javascript necessary)

  • Zero configuration auto-detects everything, it can collect up to 5000 metrics per server out of the box

  • Zero dependencies it is even its own web server, for its static web files and its web API

  • Zero maintenance you just run it, it does the rest

入门非常简单:零配置,零依赖,高度自治(不需要管理)。

  • scales to infinity requiring minimal central resources

  • several operating modes autonomous host monitoring, headless data collector, forwarding proxy, store and forward proxy, central multi-host monitoring, in all possible configurations. Each node may have different metrics retention policy and run with or without health monitoring.

  • time-series back-ends supported can archive its metrics on graphite, opentsdb, prometheus, json document DBs, in the same or lower detail (lower: to prevent it from congesting these servers due to the amount of data collected)

支持主流的时间序列存储:graphite, opentsdb, prometheus,怎么没提到influxdb呢?因为influxdb不具有集群拓展性?influxdb非常适合RD用,因为安装非常简单。


Hello netdata

安装

官方安装指南: https://github.com/firehol/netdata/wiki/Installation

netdata 的安装没有yum install netdata那么方便,而是需要分两步:

  • 依赖安装:区分简版和插件版(以支持拓展开发);既可以用一个在线脚本自动安装,也可以手动类似 yum install 安装依赖。各个操作系统下,简易/繁琐程度不一样。
  • 源码编译:很简单,下载源代码,并执行里面的 netdata-installer.sh 即可自动安装,并启动,然后就可以在本地用浏览器访问 http://localhost:19999 查看了(数据是写在本地的RRD库里)。

mac 升级 bash 版本到 4 http://www.tuicool.com/articles/EjIrmmN

安装依赖

  • mac 系统:
# install required packages
brew install ossp-uuid autoconf automake pkg-config

  • Ubuntu 系统:
# Debian / Ubuntu
apt-get install zlib1g-dev uuid-dev libmnl-dev gcc make git autoconf autoconf-archive autogen automake pkg-config curl
  • CentOS 系统:
# CentOS / Red Hat Enterprise Linux
yum install autoconf automake curl gcc git libmnl-devel libuuid-devel lm_sensors make MySQL-python nc pkgconfig python python-psycopg2 PyYAML zlib-devel

编译源码安装

# download it - the directory 'netdata' will be created
git clone https://github.com/firehol/netdata.git --depth=1
cd netdata

# run script with root privileges to build, install, start netdata
./netdata-installer.sh

比如我执行:

$ sudo ./netdata-installer.sh --install ~/opt

注意:

  • netdata-installer.sh 需要以root身份运行;
  • netdata-installer.sh 安装后,默认包含启动。

配置项

  • If you don't want to run it straight-away, add --dont-start-it option.

  • If you don't want to install it on the default directories, you can run the installer like this: ./netdata-installer.sh --install /opt. This one will install netdata in /opt/netdata.

安装成功后提示:

netdata by default listens on all IPs on port 19999,
so you can access it with:

  http://this.machine.ip:19999/

To stop netdata, just kill it, with:

  killall netdata

To start it, just run it:

~/opt/netdata/usr/sbin/netdata


Uninstall script generated: ./netdata-uninstaller.sh
Update script generated   : ./netdata-updater.sh

安装过程中可能的问题

  • mac 下安装,出现 xcode 版本低的问题

编译时,出现:

macos_sysctl.c:117:10: fatal error: 'netinet6/scope6_var.h' file not found
#include <netinet6/scope6_var.h>
         ^
1 error generated.
make[2]: *** [macos_sysctl.o] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
 FAILED

macos_sysctl.c 代码的 L117行,有关于IPV6的头文件引用。这个代码在Linux下不会执行。官网提示了我们,为了在mac下编译,需要安装xcode-select --install,安装时,发现已经安装了,但版本低:

$ xcode-select --install
xcode-select: error: command line tools are already installed, use "Software Update" to install updates

xcode的更新,需要用softwareupdate来管理(mac系统软件检查更新用的),先查看下有哪些需要更新的:

softwareupdate --list
Software Update Tool
Copyright 2002-2012 Apple Inc.

Finding available software
Software Update found the following new or updated software:
   * OSXUpd10.10.5-10.10.5
	OS X 更新 (10.10.5), 795798K [recommended] [restart]
   * Command Line Tools (OS X 10.10) for Xcode-7.2
	Command Line Tools (OS X 10.10) for Xcode (7.2), 160646K [recommended]

然后我们选择只更新Xcode,执行:

$ sudo softwareupdate --install 'Command Line Tools (OS X 10.10) for Xcode-7.2'

Software Update Tool
Copyright 2002-2012 Apple Inc.
Finding available software
Downloaded Command Line Tools (OS X 10.10) for Xcode
Installing Command Line Tools (OS X 10.10) for Xcode
Done with Command Line Tools (OS X 10.10) for Xcode
Done.

如果需要全部更新,执行:

$ softwareupdate --install -a

整个系统更新到 10.12.3 就能用了。

osquery> select * from os_version;
         name = Mac OS X
      version = 10.12.3
        major = 10
        minor = 12
        patch = 3
        build = 16D32
     platform = darwin
platform_like = darwin
     codename =

netdata 能监控哪些东西呢?

netdata collects several thousands of metrics per device. All these metrics are collected and visualized in real-time.

Almost all metrics are auto-detected, without any configuration.

  • CPU: 使用率,中断次数,每个核的情况

  • 内存:RAM,交换区等

  • 磁盘:per disk: I/O, operations, backlog, utilization, space, software RAID (md)

  • 网络:带宽(收发),包数量,出错数量,丢弃数量。

  • 网络时延:现在主流时延采集用 fping ? 之前用的蛮多的是 smokeping 。

  • 连接数&进程数

  • Nginx监控:基于 nginx stub-status 的数据。

  • mysql监控: multiple servers, each showing: bandwidth, queries/s, handlers, locks, issues, tmp operations, connections, binlog metrics, threads, innodb metrics, and more

  • Redis监控:multiple servers, each showing: operations, hit rate, memory, keys, clients, slaves

  • memcached监控:multiple servers, each showing: bandwidth, connections, items

  • elasticsearch监控:search and index performance, latency, timings, cluster statistics, threads statistics, etc

  • bind 域名解析服务监控:multiple servers, each showing: clients, requests, queries, updates, failures and several per view metrics

  • Squid/varnish的监控

  • Hardware sensors lm_sensors and IPMI: temperature, voltage, fans, power, humidity