Xen and the Art of Virtualization
Paul Barham∗, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris,
Alex Ho, Rolf Neugebauer†, Ian Pratt, Andrew Warfield
University of Cambridge Computer Laboratory
15 JJ Thomson Avenue, Cambridge, UK, CB3 0FD
{firstname.lastname}@cl.cam.ac.uk
ABSTRACT
Numerous systems have been designed which use virtualization to
subdivide the ample resources of a modern computer. Some require
specialized hardware, or cannot support commodity operating sys-
tems. Some target 100% binary compatibility at the expense of
performance. Others sacrifice security or functionality for speed.
Few offer resource isolation or performance guarantees; most pro-
vide only best-effort provisioning, risking denial of service.
This paper presents Xen, an x86 virtual machine monitor which
allows multiple commodity operating systems to share conventional
hardware in a safe and resource managed fashion, but without sac-
rificing either performance or functionality. This is achieved by
providing an idealized virtual machine abstraction to which oper-
ating systems such as Linux, BSD and Windows XP, can be ported
with minimal effort.
Our design is targeted at hosting up to 100 virtual machine in-
stances simultaneously on a modern server. The virtualization ap-
proach taken by Xen is extremely efficient: we allow operating sys-
tems such as Linux and Windows XP to be hosted simultaneously
for a negligible performance overhead — at most a few percent
compared with the unvirtualized case. We considerably outperform
competing commercial and freely available solutions in a range of
microbenchmarks and system-wide tests.
Categories and Subject Descriptors
D.4.1 [Operating Systems]: Process Management; D.4.2 [Opera-
ting Systems]: Storage Management; D.4.8 [Operating Systems]:
Performance
General Terms
Design, Measurement, Performance
Keywords
Virtual Machine Monitors, Hypervisors, Paravirtualization
∗Microsoft Research Cambridge, UK
†Intel Research Cambridge, UK
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.
Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.
1. INTRODUCTION
Modern computers are sufficiently powerful to use virtualization
to present the illusion of many smaller virtual machines (VMs),
each running a separate operating system instance. This has led to
a resurgence of interest in VM technology. In this paper we present
Xen, a high performance resource-managed virtual machine mon-
itor (VMM) which enables applications such as server consolida-
tion [42, 8], co-located hosting facilities [14], distributed web ser-
vices [43], secure computing platforms [12, 16] and application
mobility [26, 37].
Successful partitioning of a machine to support the concurrent
execution of multiple operating systems poses several challenges.
Firstly, virtual machines must be isolated from one another: it is not
acceptable for the execution of one to adversely affect the perfor-
mance of another. This is particularly true when virtual machines
are owned by mutually untrusting users. Secondly, it is necessary
to support a variety of different operating systems to accommodate
the heterogeneity of popular applications. Thirdly, the performance
overhead introduced by virtualization should be small.
Xen hosts commodity operating systems, albeit with some source
modifications. The prototype described and evaluated in this paper
can support multiple concurrent instances of our XenoLinux guest
operating system; each instance exports an application binary inter-
face identical to a non-virtualized Linux 2.4. Our port of Windows
XP to Xen is not yet complete but is capable of running simple
user-space processes. Work is also progressing in porting NetBSD.
Xen enables users to dynamically instantiate an operating sys-
tem to execute whatever they desire. In the XenoServer project [15,
35] we are deploying Xen on standard server hardware at econom-
ically strategic locations within ISPs or at Internet exchanges. We
perform admission control when starting new virtual machines and
expect each VM to pay in some fashion for the resources it requires.
We discuss our ideas and approach in this direction elsewhere [21];
this paper focuses on the VMM.
There are a number of ways to build a system to host multiple
applications and servers on a shared machine. Perhaps the simplest
is to deploy one or more hosts running a standard operating sys-
tem such as Linux or Windows, and then to allow users to install
files and start processes — protection between applications being
provided by conventional OS techniques. Experience shows that
system administration can quickly become a time-consuming task
due to complex configuration interactions between supposedly dis-
joint applications.
More importantly, such systems do not adequately support per-
formance isolation; the scheduling priority, memory demand, net-
work traffic and disk accesses of one process impact the perfor-
mance of others. This may be acceptable when there is adequate
provisioning and a closed user group (such as in the case of com-
164
putational grids, or the experimental PlanetLab platform [33]), but
not when resources are oversubscribed, or users uncooperative.
One way to address this problem is to retrofit support for per-
formance isolation to the operating system. This has been demon-
strated to a greater or lesser degree with resource containers [3],
Linux/RK [32], QLinux [40] and SILK [4]. One difficulty with
such approaches is ensuring that all resource usage is accounted to
the correct process — consider, for example, the complex interac-
tions between applications due to buffer cache or page replacement
algorithms. This is effectively the problem of “QoS crosstalk” [41]
within the operating system. Performing multiplexing at a low level
can mitigate this problem, as demonstrated by the Exokernel [23]
and Nemesis [27] operating systems. Unintentional or undesired
interactions between tasks are minimized.
We use this same basic approach to build Xen, which multiplexes
physical resources at the granularity of an entire operating system
and is able to provide performance isolation between them. In con-
trast to process-level multiplexing this also allows a range of guest
operating systems to gracefully coexist rather than mandating a
specific application binary interface. There is a price to pay for this
flexibility — running a full OS is more heavyweight than running
a process, both in terms of initialization (e.g. booting or resuming
versus fork and exec), and in terms of resource consumption.
For our target of up to 100 hosted OS instances, we believe this
price is worth paying; it allows individual users to run unmodified
binaries, or collections of binaries, in a resource controlled fashion
(for instance an Apache server along with a PostgreSQL backend).
Furthermore it provides an extremely high level of flexibility since
the user can dynamically create the precise execution environment
their software requires. Unfortunate configuration interactions be-
tween various services and applications are avoided (for example,
each Windows instance maintains its own registry).
The remainder of this paper is structured as follows: in Section 2
we explain our approach towards virtualization and outline how
Xen works. Section 3 describes key aspects of our design and im-
plementation. Section 4 uses industry standard benchmarks to eval-
uate the performance of XenoLinux running above Xen in compar-
ison with stand-alone Linux, VMware Workstation and User-mode
Linux (UML). Section 5 reviews related work, and finally Section 6
discusses future work and concludes.
2. XEN: APPROACH & OVERVIEW
In a traditional VMM the virtual hardware exposed is function-
ally identical to the underlying machine [38]. Although full virtu-
alization has the obvious benefit of allowing unmodified operating
systems to be hosted, it also has a number of drawbacks. This is
particularly true for the prevalent IA-32, or x86, architecture.
Support for full virtualization was never part of the x86 archi-
tectural design. Certain supervisor instructions must be handled by
the VMM for correct virtualization, but executing these with in-
sufficient privilege fails silently rather than causing a convenient
trap [36]. Efficiently virtualizing the x86 MMU is also difficult.
These problems can be solved, but only at the cost of increased
complexity and reduced performance. VMware’s ESX Server [10]
dynamically rewrites portions of the hosted machine code to insert
traps wherever VMM intervention might be required. This transla-
tion is applied to the entire guest OS kernel (with associated trans-
lation, execution, and caching costs) since all non-trapping privi-
leged instructions must be caught and handled. ESX Server imple-
ments shadow versions of system structures such as page tables and
maintains consistency with the virtual tables by trapping every up-
date attempt — this approach has a high cost for update-intensive
operations such as creating a new application process.
Notwithstanding the intricacies of the x86, there are other argu-
ments against full virtualization. In particular, there are situations
in which it is desirable for the hosted operating systems to see real
as well as virtual resources: providing both real and virtual time
allows a guest OS to better support time-sensitive tasks, and to cor-
rectly handle TCP timeouts and RTT estimates, while exposing real
machine addresses allows a guest OS to improve performance by
using superpages [30] or page coloring [24].
We avoid the drawbacks of full virtualization by presenting a vir-
tual machine abstraction that is similar but not identical to the un-
derlying hardware — an approach which has been dubbed paravir-
tualization [43]. This promises improved performance, although
it does require modifications to the guest operating system. It is
important to note, however, that we do not require changes to the
application binary interface (ABI), and hence no modifications are
required to guest applications.
We distill the discussion so far into a set of design principles:
1. Support for unmodified application binaries is essential, or
users will not transition to Xen. Hence we must virtualize all
architectural features required by existing standard ABIs.
2. Supporting full multi-application operating systems is im-
portant, as this allows complex server configurations to be
virtualized within a single guest OS instance.
3. Paravirtualization is necessary to obtain high performance
and strong resource isolation on uncooperative machine ar-
chitectures such as x86.
4. Even on cooperative machine architectures, completely hid-
ing the effects of resource virtualization from guest OSes
risks both correctness and performance.
Note that our paravirtualized x86 abstraction is quite different
from that proposed by the recent Denali project [44]. Denali is de-
signed to support thousands of virtual machines running network
services, the vast majority of which are small-scale and unpopu-
lar. In contrast, Xen is intended to scale to approximately 100 vir-
tual machines running industry standard applications and services.
Given these very different goals, it is instructive to contrast Denali’s
design choices with our own principles.
Firstly, Denali does not target existing ABIs, and so can elide
certain architectural features from their VM interface. For exam-
ple, Denali does not fully support x86 segmentation although it is
exported (and widely used1) in the ABIs of NetBSD, Linux, and
Windows XP.
Secondly, the Denali implementation does not address the prob-
lem of supporting application multiplexing, nor multiple address
spaces, within a single guest OS. Rather, applications are linked
explicitly against an instance of the Ilwaco guest OS in a manner
rather reminiscent of a libOS in the Exokernel [23]. Hence each vir-
tual machine essentially hosts a single-user single-application un-
protected “operating system”. In Xen, by contrast, a single virtual
machine hosts a real operating system which may itself securely
multiplex thousands of unmodified user-level processes. Although
a prototype virtual MMU has been developed which may help De-
nali in this area [44], we are unaware of any published technical
details or evaluation.
Thirdly, in the Denali architecture the VMM performs all paging
to and from disk. This is perhaps related to the lack of memory-
management support at the virtualization layer. Paging within the
1For example, segments are frequently used by thread libraries to address
thread-local data.
165
Memory Management
Segmentation Cannot install fully-privileged segment descriptors and cannot overlap with the top end of the linear
address space.
Paging Guest OS has direct read access to hardware page tables, but updates are batched and validated by
the hypervisor. A domain may be allocated discontiguous machine pages.
CPU
Protection Guest OS must run at a lower privilege level than Xen.
Exceptions Guest OS must register a descriptor table for exception handlers with Xen. Aside from page faults,
the handlers remain the same.
System Calls Guest OS may install a ‘fast’ handler for system calls, allowing direct calls from an application into
its guest OS and avoiding indirecting through Xen on every call.
Interrupts Hardware interrupts are replaced with a lightweight event system.
Time Each guest OS has a timer interface and is aware of both ‘real’ and ‘virtual’ time.
Device I/O
Network, Disk, etc. Virtual devices are elegant and simple to access. Data is transferred using asynchronous I/O rings.
An event mechanism replaces hardware interrupts for notifications.
Table 1: The paravirtualized x86 interface.
VMM is contrary to our goal of performance isolation: malicious
virtual machines can encourage thrashing behaviour, unfairly de-
priving others of CPU time and disk bandwidth. In Xen we expect
each guest OS to perform its own paging using its own guaran-
teed memory reservation and disk allocation (an idea previously
exploited by self-paging [20]).
Finally, Denali virtualizes the ‘namespaces’ of all machine re-
sources, taking the view that no VM can access the resource alloca-
tions of another VM if it cannot name them (for example, VMs have
no knowledge of hardware addresses, only the virtual addresses
created for them by Denali). In contrast, we believe that secure ac-
cess control within the hypervisor is sufficient to ensure protection;
furthermore, as discussed previously, there are strong correctness
and performance arguments for making physical resources directly
visible to guest OSes.
In the following section we describe the virtual machine abstrac-
tion exported by Xen and discuss how a guest OS must be modified
to conform to this. Note that in this paper we reserve the term guest
operating system to refer to one of the OSes that Xen can host and
we use the term domain to refer to a running virtual machine within
which a guest OS executes; the distinction is analogous to that be-
tween a program and a process in a conventional system. We call
Xen itself the hypervisor since it operates at a higher privilege level
than the supervisor code of the guest operating systems that it hosts.
2.1 The Virtual Machine Interface
Table 1 presents an overview of the paravirtualized x86 interface,
factored into three broad aspects of the system: memory manage-
ment, the CPU, and device I/O. In the following we address each
machine subsystem in turn, and discuss how each is presented in
our paravirtualized architecture. Note that although certain parts
of our implementation, such as memory management, are specific
to the x86, many aspects (such as our virtual CPU and I/O devices)
can be readily applied to other machine architectures. Furthermore,
x86 represents a worst case in the areas where it differs significantly
from RISC-style processors — for example, efficiently virtualizing
hardware page tables is more difficult than virtualizing a software-
managed TLB.
2.1.1 Memory management
Virtualizing memory is undoubtedly the most difficult part of
paravirtualizing an architecture, both in terms of the mechanisms
required in the hypervisor and modifications required to port each
guest OS. The task is easier if the architecture provides a software-
managed TLB as these can be efficiently virtualized in a simple
manner [13]. A tagged TLB is another useful feature supported
by most server-class RISC architectures, including Alpha, MIPS
and SPARC. Associating an address-space identifier tag with each
TLB entry allows the hypervisor and each guest OS to efficiently
coexist in separate address spaces because there is no need to flush
the entire TLB when transferring execution.
Unfortunately, x86 does not have a software-managed TLB; in-
stead TLB misses are serviced automatically by the processor by
walking the page table structure in hardware. Thus to achieve the
best possible performance, all valid page translations for the current
address space should be present in the hardware-accessible page
table. Moreover, because the TLB is not tagged, address space
switches typically require a complete TLB flush. Given these limi-
tations, we made two decisions: (i) guest OSes are responsible for
allocating and managing the hardware page tables, with minimal
involvement from Xen to ensure safety and isolation; and (ii) Xen
exists in a 64MB section at the top of every address space, thus
avoiding a TLB flush when entering and leaving the hypervisor.
Each time a guest OS requires a new page table, perhaps be-
cause a new process is being created, it allocates and initializes a
page from its own memory reservation and registers it with Xen.
At this point the OS must relinquish direct write privileges to the
page-table memory: all subsequent updates must be validated by
Xen. This restricts updates in a number of ways, including only
allowing an OS to map pages that it owns, and disallowing writable
mappings of page tables. Guest OSes may batch update requests to
amortize the overhead of entering the hypervisor. The top 64MB
region of each address space, which is reserved for Xen, is not ac-
cessible or remappable by guest OSes. This address region is not
used by any of the common x86 ABIs however, so this restriction
does not break application compatibility.
Segmentation is virtualized in a similar way, by validating up-
dates to hardware segment descriptor tables. The only restrictions
on x86 segment descriptors are: (i) they must have lower privi-
lege than Xen, and (ii) they may not allow any access to the Xen-
reserved portion of the address space.
2.1.2 CPU
Virtualizing the CPU has several implications for guest OSes.
Principally, the insertion of a hypervisor below the operating sys-
tem violates the usual assumption that the OS is the most privileged
166
entity in the system. In order to protect the hypervisor from OS
misbehavior (and domains from one another) guest OSes must be
modified to run at a lower privilege level.
Many processor architectures only provide two privilege levels.
In these cases the guest OS would share the lower privilege level
with applications. The guest OS would then protect itself by run-
ning in a separate address space from its applications, and indirectly
pass control to and from applications via the hypervisor to set the
virtual privilege level and change the current address space. Again,
if the processor’s TLB supports address-space tags then expensive
TLB flushes can
本文档为【Xen_and_the_art_of_virtualization】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。