162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci============================
462306a36Sopenharmony_ciCeph Distributed File System
562306a36Sopenharmony_ci============================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciCeph is a distributed network file system designed to provide good
862306a36Sopenharmony_ciperformance, reliability, and scalability.
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciBasic features include:
1162306a36Sopenharmony_ci
1262306a36Sopenharmony_ci * POSIX semantics
1362306a36Sopenharmony_ci * Seamless scaling from 1 to many thousands of nodes
1462306a36Sopenharmony_ci * High availability and reliability.  No single point of failure.
1562306a36Sopenharmony_ci * N-way replication of data across storage nodes
1662306a36Sopenharmony_ci * Fast recovery from node failures
1762306a36Sopenharmony_ci * Automatic rebalancing of data on node addition/removal
1862306a36Sopenharmony_ci * Easy deployment: most FS components are userspace daemons
1962306a36Sopenharmony_ci
2062306a36Sopenharmony_ciAlso,
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ci * Flexible snapshots (on any directory)
2362306a36Sopenharmony_ci * Recursive accounting (nested files, directories, bytes)
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ciIn contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
2662306a36Sopenharmony_cion symmetric access by all clients to shared block devices, Ceph
2762306a36Sopenharmony_ciseparates data and metadata management into independent server
2862306a36Sopenharmony_ciclusters, similar to Lustre.  Unlike Lustre, however, metadata and
2962306a36Sopenharmony_cistorage nodes run entirely as user space daemons.  File data is striped
3062306a36Sopenharmony_ciacross storage nodes in large chunks to distribute workload and
3162306a36Sopenharmony_cifacilitate high throughputs.  When storage nodes fail, data is
3262306a36Sopenharmony_cire-replicated in a distributed fashion by the storage nodes themselves
3362306a36Sopenharmony_ci(with some minimal coordination from a cluster monitor), making the
3462306a36Sopenharmony_cisystem extremely efficient and scalable.
3562306a36Sopenharmony_ci
3662306a36Sopenharmony_ciMetadata servers effectively form a large, consistent, distributed
3762306a36Sopenharmony_ciin-memory cache above the file namespace that is extremely scalable,
3862306a36Sopenharmony_cidynamically redistributes metadata in response to workload changes,
3962306a36Sopenharmony_ciand can tolerate arbitrary (well, non-Byzantine) node failures.  The
4062306a36Sopenharmony_cimetadata server takes a somewhat unconventional approach to metadata
4162306a36Sopenharmony_cistorage to significantly improve performance for common workloads.  In
4262306a36Sopenharmony_ciparticular, inodes with only a single link are embedded in
4362306a36Sopenharmony_cidirectories, allowing entire directories of dentries and inodes to be
4462306a36Sopenharmony_ciloaded into its cache with a single I/O operation.  The contents of
4562306a36Sopenharmony_ciextremely large directories can be fragmented and managed by
4662306a36Sopenharmony_ciindependent metadata servers, allowing scalable concurrent access.
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ciThe system offers automatic data rebalancing/migration when scaling
4962306a36Sopenharmony_cifrom a small cluster of just a few nodes to many hundreds, without
5062306a36Sopenharmony_cirequiring an administrator carve the data set into static volumes or
5162306a36Sopenharmony_cigo through the tedious process of migrating data between servers.
5262306a36Sopenharmony_ciWhen the file system approaches full, new nodes can be easily added
5362306a36Sopenharmony_ciand things will "just work."
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciCeph includes flexible snapshot mechanism that allows a user to create
5662306a36Sopenharmony_cia snapshot on any subdirectory (and its nested contents) in the
5762306a36Sopenharmony_cisystem.  Snapshot creation and deletion are as simple as 'mkdir
5862306a36Sopenharmony_ci.snap/foo' and 'rmdir .snap/foo'.
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ciSnapshot names have two limitations:
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ci* They can not start with an underscore ('_'), as these names are reserved
6362306a36Sopenharmony_ci  for internal usage by the MDS.
6462306a36Sopenharmony_ci* They can not exceed 240 characters in size.  This is because the MDS makes
6562306a36Sopenharmony_ci  use of long snapshot names internally, which follow the format:
6662306a36Sopenharmony_ci  `_<SNAPSHOT-NAME>_<INODE-NUMBER>`.  Since filenames in general can't have
6762306a36Sopenharmony_ci  more than 255 characters, and `<node-id>` takes 13 characters, the long
6862306a36Sopenharmony_ci  snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ciCeph also provides some recursive accounting on directories for nested
7162306a36Sopenharmony_cifiles and bytes.  That is, a 'getfattr -d foo' on any directory in the
7262306a36Sopenharmony_cisystem will reveal the total number of nested regular files and
7362306a36Sopenharmony_cisubdirectories, and a summation of all nested file sizes.  This makes
7462306a36Sopenharmony_cithe identification of large disk space consumers relatively quick, as
7562306a36Sopenharmony_cino 'du' or similar recursive scan of the file system is required.
7662306a36Sopenharmony_ci
7762306a36Sopenharmony_ciFinally, Ceph also allows quotas to be set on any directory in the system.
7862306a36Sopenharmony_ciThe quota can restrict the number of bytes or the number of files stored
7962306a36Sopenharmony_cibeneath that point in the directory hierarchy.  Quotas can be set using
8062306a36Sopenharmony_ciextended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
8162306a36Sopenharmony_ci
8262306a36Sopenharmony_ci setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
8362306a36Sopenharmony_ci getfattr -n ceph.quota.max_bytes /some/dir
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ciA limitation of the current quotas implementation is that it relies on the
8662306a36Sopenharmony_cicooperation of the client mounting the file system to stop writers when a
8762306a36Sopenharmony_cilimit is reached.  A modified or adversarial client cannot be prevented
8862306a36Sopenharmony_cifrom writing as much data as it needs.
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ciMount Syntax
9162306a36Sopenharmony_ci============
9262306a36Sopenharmony_ci
9362306a36Sopenharmony_ciThe basic mount syntax is::
9462306a36Sopenharmony_ci
9562306a36Sopenharmony_ci # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]]
9662306a36Sopenharmony_ci
9762306a36Sopenharmony_ciYou only need to specify a single monitor, as the client will get the
9862306a36Sopenharmony_cifull list when it connects.  (However, if the monitor you specify
9962306a36Sopenharmony_cihappens to be down, the mount won't succeed.)  The port can be left
10062306a36Sopenharmony_cioff if the monitor is using the default.  So if the monitor is at
10162306a36Sopenharmony_ci1.2.3.4::
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ci # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4
10462306a36Sopenharmony_ci
10562306a36Sopenharmony_ciis sufficient.  If /sbin/mount.ceph is installed, a hostname can be
10662306a36Sopenharmony_ciused instead of an IP address and the cluster FSID can be left out
10762306a36Sopenharmony_ci(as the mount helper will fill it in by reading the ceph configuration
10862306a36Sopenharmony_cifile)::
10962306a36Sopenharmony_ci
11062306a36Sopenharmony_ci  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr
11162306a36Sopenharmony_ci
11262306a36Sopenharmony_ciMultiple monitor addresses can be passed by separating each address with a slash (`/`)::
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ci  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ciWhen using the mount helper, monitor address can be read from ceph
11762306a36Sopenharmony_ciconfiguration file if available. Note that, the cluster FSID (passed as part
11862306a36Sopenharmony_ciof the device string) is validated by checking it with the FSID reported by
11962306a36Sopenharmony_cithe monitor.
12062306a36Sopenharmony_ci
12162306a36Sopenharmony_ciMount Options
12262306a36Sopenharmony_ci=============
12362306a36Sopenharmony_ci
12462306a36Sopenharmony_ci  mon_addr=ip_address[:port][/ip_address[:port]]
12562306a36Sopenharmony_ci	Monitor address to the cluster. This is used to bootstrap the
12662306a36Sopenharmony_ci        connection to the cluster. Once connection is established, the
12762306a36Sopenharmony_ci        monitor addresses in the monitor map are followed.
12862306a36Sopenharmony_ci
12962306a36Sopenharmony_ci  fsid=cluster-id
13062306a36Sopenharmony_ci	FSID of the cluster (from `ceph fsid` command).
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ci  ip=A.B.C.D[:N]
13362306a36Sopenharmony_ci	Specify the IP and/or port the client should bind to locally.
13462306a36Sopenharmony_ci	There is normally not much reason to do this.  If the IP is not
13562306a36Sopenharmony_ci	specified, the client's IP address is determined by looking at the
13662306a36Sopenharmony_ci	address its connection to the monitor originates from.
13762306a36Sopenharmony_ci
13862306a36Sopenharmony_ci  wsize=X
13962306a36Sopenharmony_ci	Specify the maximum write size in bytes.  Default: 64 MB.
14062306a36Sopenharmony_ci
14162306a36Sopenharmony_ci  rsize=X
14262306a36Sopenharmony_ci	Specify the maximum read size in bytes.  Default: 64 MB.
14362306a36Sopenharmony_ci
14462306a36Sopenharmony_ci  rasize=X
14562306a36Sopenharmony_ci	Specify the maximum readahead size in bytes.  Default: 8 MB.
14662306a36Sopenharmony_ci
14762306a36Sopenharmony_ci  mount_timeout=X
14862306a36Sopenharmony_ci	Specify the timeout value for mount (in seconds), in the case
14962306a36Sopenharmony_ci	of a non-responsive Ceph file system.  The default is 60
15062306a36Sopenharmony_ci	seconds.
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ci  caps_max=X
15362306a36Sopenharmony_ci	Specify the maximum number of caps to hold. Unused caps are released
15462306a36Sopenharmony_ci	when number of caps exceeds the limit. The default is 0 (no limit)
15562306a36Sopenharmony_ci
15662306a36Sopenharmony_ci  rbytes
15762306a36Sopenharmony_ci	When stat() is called on a directory, set st_size to 'rbytes',
15862306a36Sopenharmony_ci	the summation of file sizes over all files nested beneath that
15962306a36Sopenharmony_ci	directory.  This is the default.
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ci  norbytes
16262306a36Sopenharmony_ci	When stat() is called on a directory, set st_size to the
16362306a36Sopenharmony_ci	number of entries in that directory.
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ci  nocrc
16662306a36Sopenharmony_ci	Disable CRC32C calculation for data writes.  If set, the storage node
16762306a36Sopenharmony_ci	must rely on TCP's error correction to detect data corruption
16862306a36Sopenharmony_ci	in the data payload.
16962306a36Sopenharmony_ci
17062306a36Sopenharmony_ci  dcache
17162306a36Sopenharmony_ci        Use the dcache contents to perform negative lookups and
17262306a36Sopenharmony_ci        readdir when the client has the entire directory contents in
17362306a36Sopenharmony_ci        its cache.  (This does not change correctness; the client uses
17462306a36Sopenharmony_ci        cached metadata only when a lease or capability ensures it is
17562306a36Sopenharmony_ci        valid.)
17662306a36Sopenharmony_ci
17762306a36Sopenharmony_ci  nodcache
17862306a36Sopenharmony_ci        Do not use the dcache as above.  This avoids a significant amount of
17962306a36Sopenharmony_ci        complex code, sacrificing performance without affecting correctness,
18062306a36Sopenharmony_ci        and is useful for tracking down bugs.
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ci  noasyncreaddir
18362306a36Sopenharmony_ci	Do not use the dcache as above for readdir.
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ci  noquotadf
18662306a36Sopenharmony_ci        Report overall filesystem usage in statfs instead of using the root
18762306a36Sopenharmony_ci        directory quota.
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ci  nocopyfrom
19062306a36Sopenharmony_ci        Don't use the RADOS 'copy-from' operation to perform remote object
19162306a36Sopenharmony_ci        copies.  Currently, it's only used in copy_file_range, which will revert
19262306a36Sopenharmony_ci        to the default VFS implementation if this option is used.
19362306a36Sopenharmony_ci
19462306a36Sopenharmony_ci  recover_session=<no|clean>
19562306a36Sopenharmony_ci	Set auto reconnect mode in the case where the client is blocklisted. The
19662306a36Sopenharmony_ci	available modes are "no" and "clean". The default is "no".
19762306a36Sopenharmony_ci
19862306a36Sopenharmony_ci	* no: never attempt to reconnect when client detects that it has been
19962306a36Sopenharmony_ci	  blocklisted. Operations will generally fail after being blocklisted.
20062306a36Sopenharmony_ci
20162306a36Sopenharmony_ci	* clean: client reconnects to the ceph cluster automatically when it
20262306a36Sopenharmony_ci	  detects that it has been blocklisted. During reconnect, client drops
20362306a36Sopenharmony_ci	  dirty data/metadata, invalidates page caches and writable file handles.
20462306a36Sopenharmony_ci	  After reconnect, file locks become stale because the MDS loses track
20562306a36Sopenharmony_ci	  of them. If an inode contains any stale file locks, read/write on the
20662306a36Sopenharmony_ci	  inode is not allowed until applications release all stale file locks.
20762306a36Sopenharmony_ci
20862306a36Sopenharmony_ciMore Information
20962306a36Sopenharmony_ci================
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ciFor more information on Ceph, see the home page at
21262306a36Sopenharmony_ci	https://ceph.com/
21362306a36Sopenharmony_ci
21462306a36Sopenharmony_ciThe Linux kernel client source tree is available at
21562306a36Sopenharmony_ci	- https://github.com/ceph/ceph-client.git
21662306a36Sopenharmony_ci
21762306a36Sopenharmony_ciand the source for the full system is at
21862306a36Sopenharmony_ci	https://github.com/ceph/ceph.git
219