Difference between revisions of "LustreFS"
(10 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
|- | |- | ||
| | | | ||
#[[LustreFS#Section_LustreFS| | #[[LustreFS#Section_LustreFS|File System]] | ||
#[[LustreFS#Section_Architecture|Lustre | #[[LustreFS#Section_Architecture|Architecture]] | ||
#[[LustreFS#Section_Blocks|Building Blocks]] | |||
#[[LustreFS#Section_Networking|Networking]] | |||
#[[LustreFS#Section_Linux|Lustre and Linux]] | |||
#[[LustreFS#Section_Availability|High Availability]] | |||
|} | |} | ||
__NOTOC__ | |||
Line 12: | Line 17: | ||
'''<h1 id="Section_LustreFS">File System</h1>''' | |||
'''<h1 id="Section_LustreFS"> | |||
Line 20: | Line 24: | ||
'''<h1 id="Section_Architecture"> | '''<h1 id="Section_Architecture">Architecture</h1>''' | ||
Line 34: | Line 38: | ||
'''<h1 id="Section_Blocks">Building Blocks</h1>''' | |||
The major components of a Lustre file system cluster are: | The major components of a Lustre file system cluster are: | ||
*MGS + MGT: Management service, provides a registry of all active Lustre servers and clients, and stores Lustre configuration information. MGT is the management service storage target used to store configuration data. | |||
*MDS + MDTs: Metadata service, provides file system namespace (the file system index), storing the inodes for a file system. MDT is the metadata storage target, the storage device used to hold metadata information persistently. Multiple MDS and MDTs can be added to provide metadata scaling. | |||
*OSS + OSTs: Object Storage service, provides bulk storage of data. Files can be written in stripes across multiple object storage targets (OSTs). Striping delivers scalable performance and capacity for files. OSS are the primary scalable service unit that determines overall aggregate throughput and capacity of the file system. | |||
*Clients: Lustre clients mount each Lustre file system instance using the Lustre Network protocol (LNet). Presents a POSIX-compliant file system to the OS. Applications use standard POSIX system calls for Lustre IO, and do not need to be written specifically for Lustre. | |||
*Network: Lustre is a network-based file system, all IO transactions are sent using network RPCs. Clients have no local persistent storage and are often diskless. Supports many different network technologies, including OPA, IB, Ethernet. | |||
'''<h1 id="Section_Networking">Networking</h1>''' | |||
Line 61: | Line 64: | ||
Lustre and Linux | '''<h1 id="Section_Linux">Lustre and Linux</h1> | ||
Line 70: | Line 73: | ||
'''<h1 id="Section_Availability">High Availability</h1>''' | |||
Line 83: | Line 86: | ||
Building block patterns can vary, which is a reflection the flexibility that Lustre affords integrators and administrators when designing their high performance storage infrastructure. The most common blueprint employs two servers joined to shared storage in an HA clustered pair topology. While HA clusters can vary in the number of servers, a two-node configuration provides the greatest overall flexibility as it represents the smallest storage building block that also provides high availability. Each building block has a well-defined capacity and measured throughput, so Lustre file systems can be designed in terms of the number of building blocks that are required to meet capacity and performance objectives. | Building block patterns can vary, which is a reflection the flexibility that Lustre affords integrators and administrators when designing their high performance storage infrastructure. The most common blueprint employs two servers joined to shared storage in an HA clustered pair topology. While HA clusters can vary in the number of servers, a two-node configuration provides the greatest overall flexibility as it represents the smallest storage building block that also provides high availability. Each building block has a well-defined capacity and measured throughput, so Lustre file systems can be designed in terms of the number of building blocks that are required to meet capacity and performance objectives. | ||
A single Lustre file system can scale linearly based on the number of building blocks. The minimum HA configuration for Lustre is a metadata and management building block that provides the MDS and MGS services, plus a single object storage building block for the OSS services. Using these basic units, one can create file systems with hundreds of OSSs as well as several MDSs, using HA building blocks to provide a reliable, high-performance platform. | A single Lustre file system can scale linearly based on the number of building blocks. The minimum HA configuration for Lustre is a metadata and management building block that provides the MDS and MGS services, plus a single object storage building block for the OSS services. Using these basic units, one can create file systems with hundreds of OSSs as well as several MDSs, using HA building blocks to provide a reliable, high-performance platform. |
Latest revision as of 11:55, 27 August 2021
Contents |
---|
File System
Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License (version 2 only) and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 1 ranked TOP500 supercomputer in June 2020, Fugaku, as well as previous top supercomputers such as Titan and Sequoia.
Architecture
A Lustre file system has three major functional units:
- One or more metadata servers (MDS) nodes that have one or more metadata target (MDT) devices per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a local disk filesystem. However, unlike block-based distributed filesystems, such as GPFS and PanFS, where the metadata server controls all the block allocation, the Lustre metadata server is only involved in pathname and permission checks and is not involved in any file I/O operations, avoiding I/O scalability bottlenecks on the metadata server. The ability to have multiple MDTs in a single filesystem is a new feature in Lustre 2.4 and allows directory subtrees to reside on the secondary MDTs, while 2.7 and later allow large single directories to be distributed across multiple MDTs as well.
- One or more object storage server (OSS) nodes that store file data on one or more object storage target (OST) devices. Depending on the server's hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs.
- Client(s) that access and use the data. Lustre presents all clients with a unified namespace for all the files and data in the filesystem, using standard POSIX semantics, and allows concurrent and coherent read and write access to the files in the filesystem.
The MDT, OST, and client may be on the same node (usually for testing purposes), but in typical production installations these devices are on separate nodes communicating over a network. Each MDT and OST may be part of only a single filesystem, though it is possible to have multiple MDTs or OSTs on a single node that are part of different filesystems. The Lustre Network (LNet) layer can use several types of network interconnects, including native InfiniBand verbs, Omni-Path, RoCE, and iWARP via OFED, TCP/IP on Ethernet, and other proprietary network technologies such as the Cray Gemini interconnect.
Building Blocks
The major components of a Lustre file system cluster are:
- MGS + MGT: Management service, provides a registry of all active Lustre servers and clients, and stores Lustre configuration information. MGT is the management service storage target used to store configuration data.
- MDS + MDTs: Metadata service, provides file system namespace (the file system index), storing the inodes for a file system. MDT is the metadata storage target, the storage device used to hold metadata information persistently. Multiple MDS and MDTs can be added to provide metadata scaling.
- OSS + OSTs: Object Storage service, provides bulk storage of data. Files can be written in stripes across multiple object storage targets (OSTs). Striping delivers scalable performance and capacity for files. OSS are the primary scalable service unit that determines overall aggregate throughput and capacity of the file system.
- Clients: Lustre clients mount each Lustre file system instance using the Lustre Network protocol (LNet). Presents a POSIX-compliant file system to the OS. Applications use standard POSIX system calls for Lustre IO, and do not need to be written specifically for Lustre.
- Network: Lustre is a network-based file system, all IO transactions are sent using network RPCs. Clients have no local persistent storage and are often diskless. Supports many different network technologies, including OPA, IB, Ethernet.
Networking
Applications do not run directly on storage servers: all application I/O is transacted over a Lustre Building Blocks High Performance Data Network (InfiniBand, 10GbE) Management Network Lustre Clients (1 – 100,000+) Intel Manager for Lustre Object Storage Servers Object Storage Targets (OSTs) Metadata Servers Metadata Target (MDT) Management Target (MGT) Storage servers grouped into failover pairs Building block for Scalability. Add more OSS building blocks to increase capacity and throughput bandwidth. Uniformity in building block configuration promotes consistency in performance and behavior as the file system grows. Balance server IO with storage IO capability for best utilisation. Metadata is stored separately from file object data. With DNE, multiple metadata servers can be added to increase namespace capacity and performance 5 network. Lustre network I/O is transmitted using a protocol called LNet, derived from the Portals network programming interface. LNet has native support for TCP/IP networks as well as RDMA networks such as Intel Omni-Path Architecture (OPA) and InfiniBand. LNet supports heterogeneous network environments. LNet can aggregate IO across independent interfaces, enabling network multipathing. Servers and clients can be multi-homed, and traffic can be routed using dedicated machines called LNet routers. Lustre network (LNet) routers provide a gateway between different LNet networks. Multiple routers can be grouped into pools to provide performance scalability and to provide multiple routes for availability.
The Lustre network protocol is connection-based: end-points maintain shared, coordinated state. Servers maintain exports for each active connection from a client to a server storage target, and clients maintain imports as an inverse of server exports. Connections are pertarget and per-client: if a server exports N targets to P clients, there will be (N * P) exports on the server, and each client will have N imports from that server. Clients will have imports for every Lustre storage target from every server that represents the file system.
Most Lustre protocol actions are initiated by clients. The most common activity in the Lustre protocol is for a client to initiate an RPC to a specific target. A server may also initiate an RPC to the target on another server, e.g. an MDS RPC to the MGS for configuration data; or an RPC from MDS to an OST to update the MDS’s state with available space data. Object storage servers never communicate with other object storage servers: all coordination is managed via the MDS or MGS. OSS do not initiate connections to clients or to an MDS. An OSS is relatively passive: it waits for incoming requests from either an MDS or Lustre clients.
Lustre and Linux
The core of Lustre runs in the Linux kernel on both servers and clients. Lustre servers have a choice of backend storage target formats, either LDISKFS (derived from EXT4), or ZFS (ported from OpenSolaris to Linux). Lustre servers using LDISKFS storage require patches to the Linux kernel. These patches are to improve performance, or to enable instrumentation useful during the automated test processes in Lustre’s software development lifecycle. The list of patches continues to reduce as kernel development advances, and there are initiatives underway to completely remove customized patching of the Linux kernel for Lustre servers.
Lustre servers using ZFS OSD storage and Lustre clients do not require patched kernels. The Lustre client software is being merged into mainstream Linux kernel, and is available in kernel-staging.
High Availability
Service availability / continuity is sustained using a High Availability failover resource management model, where multiple servers are connected to shared storage subsystems and services are distributed across the server nodes. Individual storage targets are managed as active-passive failover resources, and multiple resources can run in the same HA configuration for optimal utilisation. If a server develops a fault, then any Lustre storage target managed by the failed server can be transferred to a surviving server that is connected to the same storage array. Failover is completely application-transparent: system calls are guaranteed to complete across failover events.
In order to ensure that failover is handled seamlessly, data modifications in Lustre are asynchronous and transactional. The client software maintains a transaction log. If there is a server failure, the client will automatically re-connect to the failover server and replay transactions that were not committed prior to the failure. Transaction log entries removed once the client receives confirmation that the IO has been committed to disk.
All Lustre server types (MGS, MDS and OSS) support failover. A single Lustre file system installation will usually be comprised of several HA clusters, each providing a discrete set of metadata or object services that is a subset of the whole file system. These discrete HA clusters are the building blocks for a high-availability, Lustre parallel distributed file system that can scale to tens of petabytes in capacity and to more than one terabyte-per-second in aggregate throughput performance.
Building block patterns can vary, which is a reflection the flexibility that Lustre affords integrators and administrators when designing their high performance storage infrastructure. The most common blueprint employs two servers joined to shared storage in an HA clustered pair topology. While HA clusters can vary in the number of servers, a two-node configuration provides the greatest overall flexibility as it represents the smallest storage building block that also provides high availability. Each building block has a well-defined capacity and measured throughput, so Lustre file systems can be designed in terms of the number of building blocks that are required to meet capacity and performance objectives.
A single Lustre file system can scale linearly based on the number of building blocks. The minimum HA configuration for Lustre is a metadata and management building block that provides the MDS and MGS services, plus a single object storage building block for the OSS services. Using these basic units, one can create file systems with hundreds of OSSs as well as several MDSs, using HA building blocks to provide a reliable, high-performance platform.