Understanding File System Architectures.
File System Taxonomy
I’ve noticed that Clustered File Systems, Global file systems, parallel file systems and distributed file systems are commonly confused and conflated. To explain VMware vSAN™ Virtual Distributed File System™ (VDFS) I wanted to highlight some things that it is not. I’ll be largely pulling my definitions from Wikipedia but I look forward to hearing your disagreements on twitter. It is work noting some file systems can have elements that cross the taxonomy of file system layers for various reasons. In some cases, some of these definitions are subcategories of others. In other cases, some file systems (GPFS as an example) can operate in different modes (providing RAID and data protection, or simply inherent it from a backing disk array).
Clustered File System
A clustered file system is a file system that is shared by being simultaneously mounted on multiple servers. Note, there are other methods of clustering applications and data that do not involve using a clustered file system.
Parallel file systems
Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. While the vSAN layer mirrors some characteristics (Distributed RAID and striping) it does not 100% match with being a parallel file system.
Examples would include OneFS and GlusterFS.
shared-disk file system
shared disk file systems are a clustered file system but are not a parallel file system. VMFS is a shared disk file system. The most common form of a clustered file system that leverages a storage area network (SAN) for shared access of the underlying LBAs. Clients are forced to handle the translation of file calls, and access control as the underlying shared disk array has no awareness of the actual file system itself. Concurrency control prevents corruption. Ever mounted NTFS to 2 different windows boxes and wondered why it corrupted the file system? NTFS is not a shared disk file system and the different operating systems instances do not independently by default know how to cleanly share the partition when they both try to mount it. In the case of VMFS, each host can mount a given volume as read and write, while cleanly making sure that access to specific subgroups of LBA’s used for different VMDKs (or even shared VMDKs) is properly handled with no data corruption. This is commonly done over a storage area network (SAN) presenting LUNs (SCSI) or namespaces (NVMe over fabrics). protocol to share this is block-based and can range from Fibre Channel, iSCSI, FCoE, FCoTR, SAS, Infiniband etc.
Examples would include: GFS2, VMFS, Apple xSAN (storenext).
Distributed file systems
Distributed file systems do not share block-level access to the same storage but use a network protocol to redirect access to the backing file server exposing the share within the namespace used. In this way, the client does not need to know the specific IP address of the backing file server, as it will request it when it makes the initial request and within the protocol (NFSv4 or SMB) be redirected. This is not exactly a new thing (DFS in Windows is a common example, but similar systems were layered on top of Novell based filers, proprietary filers etc). These redirects are important as they prevent the need to proxy IO from a single namespace server and allow the data path to flow directly from the client to the protocol endpoint that has active access to the file share. This is a bit “same same but different” to how iSCSI redirects allow connection to a target that was not specified in the client pathing, or ALUA pathing handles non-optimized paths in the block storage world. For how vSAN exposes this externally using NFS, Check out this blog, or take a look at this video:
The benefits of a distributed file system?
- Access Transparency. This allows back end physical data migrations/rebuilds to happen without the client needing to be aware and re-pointing at the new physical location. clients are unaware that files are distributed and can access them in the same way as local files are accessed.
- Transparent Scalability. Previously you would be limited to the networking throughput and resources of a single physical file server or host that hosted a file server virtual machine. With a distributed file system each new share can be distributed out onto a different physical server and cleanly allow you to scale throughput for front end access. In the case of VDFS, this scaling is done with containers that the shares are distributed across.
- Capacity and IO path efficiency – Layering a scale-out storage system on top of an existing scale-out storage system can create unwanted copies of data. VDFS uses vSAN SPBM policies on each share and integrates with vSAN to have it handle the data placement and resiliency. In addition layering, a scale-out parallel file system on top of a scale-out storage system leads to unnecessary network hops for the IO path.
- Concurrency transparency: all clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. This is distinctly different from how some global file systems operate.
It is worth noting that VDFS is a distributed file system that exists below the protocol supporting containers. A VDFS volume is mounted and presented to the container host using a secure direct hypervisor interface that bypasses TCP/IP and the vSCSI/VMDK IO paths you would traditionally use to mount a file system to virtual machine or container. I will explore more in the future. For now, Duncan Explains it a bit on this blog.
Examples include: VDFS, Mirosoft DFS, BlueArc Global Namespace
Global File System
Global File Systems are a form of a distributed file system where a distributed namespace provides transparent access to different systems that are potentially highly distributed (IE in completely different parts of the world). This is often accomplished using a blend of caching and the use of weak affinity. There are trade-offs in this approach as if the application layer is not understood by the client accessing the data you have to deal with manually resolving conflicting save attempts of the same file, or forcing one site to be “authoritative” slowing down non-primary site access. While various products in this space have existed they tend to be an intermediate step for an application-aware distributed collaboration platform (or centralizing data access using something like VDI). While async replication can be a part of a global file system, file replication systems like DFS-R would not technically qualify. Solutions like Dropbox/OneDrive have reduced the demand for this kind of solution.
Examples include: Hitachi HDI
Where do various VMware storage technologies fall within this?
VMFS – a Clustered file system, that specifically falls within the shared-disk file system. While powerful and one of the most deployed file systems in the enterprise datacenter, it was designed for use with larger files that are (With some exceptions) only accessed by a single host at a time. While support for higher numbers of files and smaller files has improved significantly over the years, general-purpose file shares are currently not a core design requirement for it.
vVols – Not a clustered file system. An abstraction layer for SAN volumes, or NFS shares. For block volumes (SAN) it leverages SUB-LUN units and directly mounts them to the hosts that need them.
VMFS-L – A non-clustered variant Used in vSAN prior to the 6.0 release. Also used for the ESXi installed volume. File system format is optimized for DAS. Optimization include aggressive caching with for the DAS use case, a stripped lockdown lock manager, and faster formats. You commonly see this used on boot devices today.
VDFS – vSAN Virtual Distributed File System. A Distributed file system that sits inside the hypervisor directly onto of vSAN objects providing the block back end. As a result, it can easily consume SPBM policies on a per-share basis. For anyone paying attention to the back end, you will notice that objects are automatically added and concatenated onto volumes when the maximum object size is reached (256GB). components behind these objects can be striped, or as a result of various reasons be automatically spanned and created across the cluster. It is currently exposed through protocol containers that export NFSv3 or NFSv4.1 as a part of vSAN file services. While VDFS does offer a namespace for NFSv4.1 one connections and handles redirection of share access, it does not currently globally redirect between disparate clusters, so it would not be considered a global file system.