Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Adaptive Real-time Monitoring for Large-scale Networked Systems
KTH, School of Electrical Engineering (EES), Communication Networks.
2008 (English)Doctoral thesis, comprehensive summary (Other scientific)
Abstract [en]

Large-scale networked systems, such as the Internet and server clusters, are omnipresent today. They increasingly deliver services that are critical to both businesses and the society at large, and therefore their continuous and correct operation must be guaranteed. Achieving this requires the realization of adaptive management systems, which continuously reconfigure such large-scale dynamic systems, in order to maintain their state near a desired operating point, despite changes in the networking conditions.The focus of this thesis is continuous real-time monitoring, which is essential for the realization of adaptive management systems in large-scale dynamic environments. Real-time monitoring provides the necessary input to the decision-making process of network management, enabling management systems to perform self-configuration and self-healing tasks.We have developed, implemented, and evaluated a design for real-time continuous monitoring of global metrics with performance objectives, such as monitoring overhead and estimation accuracy. Global metrics describe the state of the system as a whole, in contrast to local metrics, such as device counters or local protocol states, which capture the state of a local entity. Global metrics are computed from local metrics using aggregation functions, such as SUM, AVERAGE and MAX.Our approach is based on in-network aggregation, where global metrics are incrementally computed using spanning trees. Performance objectives are achieved through filtering updates to local metrics that are sent along that tree. A key part in the design is a model for the distributed monitoring process that relates performance metrics to parameters that tune the behavior of a monitoring protocol. The model allows us to describe the behavior of individual nodes in the spanning tree in their steady state. The model has been instrumental in designing a monitoring protocol that is controllable and achieves given performance objectives.We have evaluated our protocol, called A-GAP, experimentally, through simulation and testbed implementation. It has proved to be effective in meeting performance objectives, efficient, adaptive to changes in the networking conditions, controllable along different performance dimensions, and scalable. We have implemented a prototype on a testbed of commercial routers. The testbed measurements are consistent with simulation studies we performed for different topologies and network sizes. This proves the feasibility of the design, and, more generally, the feasibility of effective and efficient real-time monitoring in large network environments.

Place, publisher, year, edition, pages
Stockholm: KTH , 2008. , 46 p.
Series
Trita-EE, ISSN 1653-5146 ; 2008:051
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:kth:diva-9459ISBN: 978-91-7415-168-8 (print)OAI: oai:DiVA.org:kth-9459DiVA: diva2:114069
Public defence
2008-11-21, Salongen, KTHB,, Osquarsbacke 31, KTH, Stockholm, 10:00 (English)
Opponent
Supervisors
Note
QC 20100727Available from: 2008-11-05 Created: 2008-11-05 Last updated: 2010-07-27Bibliographically approved
List of papers
1. A-GAP: An Adaptive Protocol for Continuous Network Monitoring with Accuracy Objectives
Open this publication in new window or tab >>A-GAP: An Adaptive Protocol for Continuous Network Monitoring with Accuracy Objectives
2007 (English)In: IEEE Transactions on Network and Service Management, ISSN 1932-4537, Vol. 4, no 1, 2-12 p.Article in journal (Refereed) Published
Abstract [en]

We present A-GAP, a novel protocol for continuous monitoring of network state variables, which aims at achieving a given monitoring accuracy with minimal overhead. Network state variables are computed from device counters using aggregation functions, such as SUM, AVERAGE and MAX. The accuracy objective is expressed as the average estimation error. A-GAP is decentralized and asynchronous to achieve robustness and scalability. It executes on an overlay that interconnects management processes on the devices. On this overlay, the protocol maintains a spanning tree and updates the network state variables through incremental aggregation. Based on a stochastic model, it dynamically configures local filters that control whether an update is sent towards the root of the tree. We evaluate A-GAP through simulation using real traces and two different types of topologies of up to 650 nodes. The results show that we can effectively control the trade-off between accuracy and protocol overhead, and that the overhead can be reduced by almost two orders of magnitude when allowing for small errors. The protocol quickly adapts to a node failure and exhibits short spikes in the estimation error. Lastly, it can provide an accurate estimate of the error distribution in real-time.

Keyword
Distributed management, real-time monitoring, large-scale distributed systems, adaptive systems
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-9464 (URN)10.1109/TNSM.2007.030101 (DOI)2-s2.0-34547880763 (Scopus ID)
Note
QC 20100727Available from: 2008-11-05 Created: 2008-11-05 Last updated: 2010-07-27Bibliographically approved
2. Monitoring Flow Aggregates with Controllable Accuracy
Open this publication in new window or tab >>Monitoring Flow Aggregates with Controllable Accuracy
2007 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 4787, 64-75 p.Article in journal (Refereed) Published
Abstract [en]

In this paper, we show the feasibility of real-time flow monitoringwith controllable accuracy in today’s IP networks. Our approach is based onNetflow and A-GAP. A-GAP is a protocol for continuous monitoring ofnetwork state variables, which are computed from device metrics usingaggregation functions, such as SUM, AVERAGE and MAX. A-GAP isdesigned to achieve a given monitoring accuracy with minimal overhead. AGAPis decentralized and asynchronous to achieve robustness and scalability.The protocol incrementally computes aggregation functions inside the networkand, based on a stochastic model, it dynamically configures local filters thatcontrol the overhead and accuracy. We evaluate a prototype in a testbed of 16commercial routers and provide measurements from a scenario where theprotocol continuously estimates the total number of FTP flows in the network.Local flow metrics are read out from Netflow buffers and aggregated in realtime.We evaluate the prototype for the following criteria. First, the ability toeffectively control the trade off between monitoring accuracy and processingoverhead; second, the ability to accurately predict the distribution of theestimation error ; third, the impact of a sudden change in topology on theperformance of the protocol. The testbed measurements are consistent withsimulation studies we performed for different topologies and network sizes,which proves the feasibility of the protocol design, and, more generally, thefeasibility of effective and efficient real-time flow monitoring in large networkenvironments.

Keyword
Computer simulation; Error analysis; Network protocols; Robustness (control systems); Routers; Scalability; Controllable accuracy; Estimation error; Local flow metrics
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-9465 (URN)10.1007/978-3-540-75869-3_6 (DOI)000251162800006 ()2-s2.0-38149116163 (Scopus ID)
Note
Conference: 10th IFIP/IEEE International Conference on Management of Multimedia and Mobile Networks and Services. San Jose, CA. OCT 31-NOV 02, 2007 Available from: 2008-11-05 Created: 2008-11-05 Last updated: 2011-09-12Bibliographically approved
3. Real-time Network Monitoring Supporting Percentile Error Objectives
Open this publication in new window or tab >>Real-time Network Monitoring Supporting Percentile Error Objectives
2007 (English)In: 14th HP Software University Association (HP-SUA) Workshop, 8-11 July 2007,Munich, Germany, 2007Conference paper, Published paper (Refereed)
Abstract [en]

We report on the versatility of A-GAP for supporting different typesof accuracy objectives. Previously, we considered accuracy objectivesexpressed in terms of the average error. In this paper, we focus on percentileerror objectives. A-GAP is a protocol for continuous monitoring of networkstate variables. Network state variables are computed from device countersusing aggregation functions, such as SUM, AVERAGE and MAX. A-GAP isdesigned to achieve a given monitoring accuracy with minimal overhead. AGAPis decentralized and asynchronous to achieve robustness and scalability. Itexecutes on an overlay that interconnects management processes on the devices.On this overlay, the protocol maintains a spanning tree and updates the networkstate variables through incremental aggregation. Based on a stochastic model, itdynamically configures local filters that control whether an update is senttowards the root of the tree. We evaluate A-GAP through simulation using realtraces for an ISP topology (Abovenet). The results prove the versatility of AGAPfor supporting different types of accuracy objectives. The results alsoshow that we can effectively control the trade-off between accuracy andprotocol overhead, and that the overhead can be reduced significantly byallowing small errors.

Keyword
Distributed management, real-time monitoring, large-scale distributed systems, adaptive systems
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-9466 (URN)
Note
QC 20100727Available from: 2008-11-05 Created: 2008-11-05 Last updated: 2010-07-27Bibliographically approved
4. Controlling Performance Trade-offs in Adaptive Network Monitoring
Open this publication in new window or tab >>Controlling Performance Trade-offs in Adaptive Network Monitoring
2009 (English)In: 11th IFIP/IEEE International Symposium on Integrated Network Management (IM 2009), IEEE , 2009, 359--366 p.Conference paper, Published paper (Refereed)
Abstract [en]

A key requirement for autonomic (i.e., self-*) management systems is a short adaptation time to changes in the networking conditions. In this paper, we show that the adaptation time of a distributed monitoring protocol can be controlled. We show this for A-CAP, a protocol for continuous monitoring of global metrics with controllable accuracy. We demonstrate through simulations that, for the case of A-GAP, the choice of the topology of the aggregation tree controls the tradeoff between adaptation time and protocol overhead in steady-state. Generally, allowing a larger adaptation time permits reducing the protocol overhead. Our results suggest that the adaptation time primarily depends on the height of the aggregation tree and that the protocol overhead is strongly influenced by the number of internal nodes. We outline how A-GAP can be extended to dynamically self-configure and to continuously adapt its configuration to changing conditions, in order to meet a set of performance objectives, including adaptation time, protocol overhead, and estimation accuracy.

Place, publisher, year, edition, pages
IEEE, 2009
Keyword
Adaptive management, real-time monitoring, large-scale distributed systems
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-9467 (URN)10.1109/INM.2009.5188836 (DOI)000274304300057 ()2-s2.0-70449386894 (Scopus ID)978-1-4244-3486-2 (ISBN)
Conference
IFIP/IEEE International Symposium on Integrated Network Management (IM 2009) New York, NY, JUN 01-05, 2009
Note
“© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.” QC 20100617Available from: 2012-02-15 Created: 2008-11-05 Last updated: 2012-02-15Bibliographically approved

Open Access in DiVA

fulltext(211 kB)439 downloads
File information
File name FULLTEXT02.pdfFile size 211 kBChecksum SHA-512
6a78ba2851d49a5c4f371cbb9cdb9c99c912da195c9b0b9b8ae85806b3367aa3034174e2a624d193f0a24774fabb39f4ef16e6f3d4cebd743519016b00f65ce3
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Gonzalez Prieto, Alberto
By organisation
Communication Networks
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 439 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 407 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf