The process and instrumentation for acquiring and utilizing network data remotely for network monitoring and operation. A general term for a large set of network visibility techniques and protocols, concerning aspects like data generation, collection, correlation, and consumption. Network telemetry addresses the current network operation issues and enables smooth evolution toward future intent-driven autonomous networks (Song et al., 2021) .
Network visibility is the ability of management tools to see the state and behavior of a network, which is essential for successful network operation. It is beneficial to clarify the concept and provide a clear architectural framework for network telemetry, so we can articulate the technical field, and better align the related techniques and standard works (Song et. al., 2021).
NETWORK DATA ANALYTICS
Thanks to the advance of computing and storage technologies, network big data analytics gives network operators an opportunity to gain network insights and move towards network autonomy. Some operators start to explore the application of Artificial Intelligence (AI) to make sense of network data. Software tools can use the network data to detect and react to network faults, anomalies, and policy violations, as well as predicting future events. In turn, the network policy updates for planning, intrusion prevention, optimization, and self-healing may be applied.
It is conceivable that an autonomic network is the logical next step for network evolution following Software Defined Networking (SDN), aiming to reduce (or even eliminate) human labor, make more efficient use of network resources, and provide better services more aligned with customer requirements.
NETCONF: Network Configuration Protocol, The NETCONF protocol defines a simple mechanism through which a network device can be managed, configuration data information can be retrieved, and new configuration data can be uploaded and manipulated. The protocol allows the device to expose a full, formal application programming interface (API). Applications can use this straightforward API to send and receive full and partial configuration data sets. A key aspect of NETCONF is that it allows the functionality of the management protocol to closely mirror the native functionality of the device. This reduces implementation costs and allows timely access to new features. In addition, applications can access both the syntactic and semantic content of the device’s native user interface.
gNMI: gRPC Network Management Interface, a network management protocol from OpenConfig Operator Working Group, mainly contributed by Google. gRPC Network Management Interface (NMI) is a service defines an interface for a network management system to interact with a network element.
gRPC: gRPC Remote Procedure Call, an open-source high performance RPC framework that gNMI is based on. gRPC is a modern open-source high performance Remote Procedure Call (RPC) framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in the last mile of distributed computing to connect devices, mobile applications and browsers to backend services (gRPC, 2021).
RESTCONF: An HTTP-based protocol that provides a programmatic interface for accessing data defined in YANG, using the datastore concepts defined in NETCONF. RESTCONF uses HTTP methods to provide CRUD operations on a conceptual datastore containing YANG-defined data, which is compatible with a server that implements NETCONF datastores.
YANG: YANG is a data modeling language for the definition of data sent over network management protocols such as the NETCONF and RESTCONF. YANG is a data modeling language used to model configuration data, state data, Remote Procedure Calls, and notifications for network management protocols.
TOP LEVEL MODULES
Telemetry can be applied on the forwarding plane, the control plane, and the management plan in a network, as well as other sources out of the network, Therefore, we categorize the network telemetry into four distinct modules (management plane, control plane, forwarding plane, and external data and event telemetry) with each having its own interface to Network Operation Applications.
We summarize the major differences of the four modules in the following table. They are compared from six angles:
Data Export Location
Telemetry Application Protocol
Data Transport Method
USE CASES FOR NETWORK OPERATIONS
The following set of use cases is essential for network operations. While the list is by no means exhaustive, it is enough to highlight the requirements for data velocity, variety, volume, and veracity, the attributes of big data, in networks.
- Security: Network intrusion detection and prevention systems need to monitor network traffic and activities and act upon anomalies. Given increasingly sophisticated attack vectors coupled with increasingly severe consequences of security breaches, new tools and techniques need to be developed, relying on wider and deeper visibility into networks. The ultimate goal is to achieve security with no, or only minimal, human intervention, and without disrupting legitimate traffic flows.
- Policy and Intent Compliance: Network policies are the rules that constrain the services for network access, provide service differentiation, or enforce specific treatment on the traffic. For example, a service function chain is a policy that requires the selected flows to pass through a set of ordered network functions. Intent, as defined in [I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational goals that a network should meet and outcomes that a network is supposed to deliver, defined in a declarative manner without specifying how to achieve or implement them. Any violation must be reported immediately, potentially resulting in updates to how the policy or intent is applied in the network to ensure that it remains in force, or otherwise alerting the network administrator to the policy or intent violation.
- SLA Compliance: A Service-Level Agreement (SLA) is a service contract between a service provider and a client, which include the metrics for the service measurement and remedy/penalty procedures when the service level misses the agreement. Users need to check if they get the service as promised, and network operators need to evaluate how they can deliver services that can meet the SLA based on real-time network telemetry data, including data from network measurements (Song et. al, 2021).
- Root Cause Analysis: Many network failure can be the effect of a sequence of chained events. Troubleshooting and recovery require quick identification of the root cause of any observable issues. However, the root cause is not always straightforward to identify, especially when the failure is sporadic and the number of event messages, both related and unrelated to the same cause, is overwhelming. While technologies such as machine learning can be used for root cause analysis, it is up to the network to sense and provide the relevant diagnostic data which are either actively fed into, or passively retrieved by, the root cause analysis applications.
- Network Optimization: This covers all short-term and long-term network optimization techniques, including load balancing, Traffic Engineering (TE), and network planning. Network operators are motivated to optimize their network utilization and differentiate services for better Return On Investment (ROI) or lower Capital Expenditures (CAPEX). In some cases, micro-bursts need to be detected in a very short time-frame so that fine-grained traffic control can be applied to avoid network congestion. Long-term planning of network capacity and topology requires analysis of real-world network telemetry data that is obtained over long periods of time.
- Event Tracking and Prediction: The visibility into traffic path and performance is critical for services and applications that rely on healthy network operation. Numerous related network events are of interest to network operators. For example, Network operators want to learn where and why packets are dropped for an application flow. They also want to be warned of issues in advance, so proactive actions can be taken to avoid catastrophic consequences.
CHALLENGES AND LIMITATIONS
Most use cases need to continuously monitor the network and dynamically refine the data collection in real-time. Poll-based low-frequency data collection is ill-suited for these applications. Subscription-based streaming data directly pushed from the data source (e.g., the forwarding chip) is preferred to provide sufficient data quantity and precision at scale.
- Comprehensive data is needed, ranging from packet processing engines to traffic manager, from line cards to main control board, from user flows to control protocol packets, from device configurations to operations, and from physical layer to application layer. Conventional OAM only covers a narrow range of data (e.g., SNMP only handles data from the Management Information Base (MIB)). Classical network devices cannot provide all the necessary probes. More open and programmable network devices are therefore needed.
- The conventional passive measurement techniques can either consume excessive network resources and produce excessive redundant data, or lead to inaccurate results; on the other hand, the conventional active measurement techniques can interfere with the user traffic, and their results are indirect. Techniques that can collect direct and on-demand data from user traffic are more favorable.
gRPC. (2021). Introduction to gRPC. http://www.grpc.io/docs/what-is-grpc/introduction/
Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., & Wang, A. (2021). Network Telemetry Framework. IETF. http://www.ietf.org/id/draft-IETF-opsawg-ntf-13.html
wenovus. (2021). gNMI – gRPC Network Management Interface. Github. http://www.github.com/openconfig/reference/tree/master/rpc/gnmi