162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=====================
462306a36Sopenharmony_ciFake NUMA For CPUSets
562306a36Sopenharmony_ci=====================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ci:Author: David Rientjes <rientjes@cs.washington.edu>
862306a36Sopenharmony_ci
962306a36Sopenharmony_ciUsing numa=fake and CPUSets for Resource Management
1062306a36Sopenharmony_ci
1162306a36Sopenharmony_ciThis document describes how the numa=fake x86_64 command-line option can be used
1262306a36Sopenharmony_ciin conjunction with cpusets for coarse memory management.  Using this feature,
1362306a36Sopenharmony_ciyou can create fake NUMA nodes that represent contiguous chunks of memory and
1462306a36Sopenharmony_ciassign them to cpusets and their attached tasks.  This is a way of limiting the
1562306a36Sopenharmony_ciamount of system memory that are available to a certain class of tasks.
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ciFor more information on the features of cpusets, see
1862306a36Sopenharmony_ciDocumentation/admin-guide/cgroup-v1/cpusets.rst.
1962306a36Sopenharmony_ciThere are a number of different configurations you can use for your needs.  For
2062306a36Sopenharmony_cimore information on the numa=fake command line option and its various ways of
2162306a36Sopenharmony_ciconfiguring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst.
2262306a36Sopenharmony_ci
2362306a36Sopenharmony_ciFor the purposes of this introduction, we'll assume a very primitive NUMA
2462306a36Sopenharmony_ciemulation setup of "numa=fake=4*512,".  This will split our system memory into
2562306a36Sopenharmony_cifour equal chunks of 512M each that we can now use to assign to cpusets.  As
2662306a36Sopenharmony_ciyou become more familiar with using this combination for resource control,
2762306a36Sopenharmony_ciyou'll determine a better setup to minimize the number of nodes you have to deal
2862306a36Sopenharmony_ciwith.
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ciA machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
3162306a36Sopenharmony_ci
3262306a36Sopenharmony_ci	Faking node 0 at 0000000000000000-0000000020000000 (512MB)
3362306a36Sopenharmony_ci	Faking node 1 at 0000000020000000-0000000040000000 (512MB)
3462306a36Sopenharmony_ci	Faking node 2 at 0000000040000000-0000000060000000 (512MB)
3562306a36Sopenharmony_ci	Faking node 3 at 0000000060000000-0000000080000000 (512MB)
3662306a36Sopenharmony_ci	...
3762306a36Sopenharmony_ci	On node 0 totalpages: 130975
3862306a36Sopenharmony_ci	On node 1 totalpages: 131072
3962306a36Sopenharmony_ci	On node 2 totalpages: 131072
4062306a36Sopenharmony_ci	On node 3 totalpages: 131072
4162306a36Sopenharmony_ci
4262306a36Sopenharmony_ciNow following the instructions for mounting the cpusets filesystem from
4362306a36Sopenharmony_ciDocumentation/admin-guide/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory
4462306a36Sopenharmony_ciaddress spaces) to individual cpusets::
4562306a36Sopenharmony_ci
4662306a36Sopenharmony_ci	[root@xroads /]# mkdir exampleset
4762306a36Sopenharmony_ci	[root@xroads /]# mount -t cpuset none exampleset
4862306a36Sopenharmony_ci	[root@xroads /]# mkdir exampleset/ddset
4962306a36Sopenharmony_ci	[root@xroads /]# cd exampleset/ddset
5062306a36Sopenharmony_ci	[root@xroads /exampleset/ddset]# echo 0-1 > cpus
5162306a36Sopenharmony_ci	[root@xroads /exampleset/ddset]# echo 0-1 > mems
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciNow this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
5462306a36Sopenharmony_cimemory allocations (1G).
5562306a36Sopenharmony_ci
5662306a36Sopenharmony_ciYou can now assign tasks to these cpusets to limit the memory resources
5762306a36Sopenharmony_ciavailable to them according to the fake nodes assigned as mems::
5862306a36Sopenharmony_ci
5962306a36Sopenharmony_ci	[root@xroads /exampleset/ddset]# echo $$ > tasks
6062306a36Sopenharmony_ci	[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
6162306a36Sopenharmony_ci	[1] 13425
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ciNotice the difference between the system memory usage as reported by
6462306a36Sopenharmony_ci/proc/meminfo between the restricted cpuset case above and the unrestricted
6562306a36Sopenharmony_cicase (i.e. running the same 'dd' command without assigning it to a fake NUMA
6662306a36Sopenharmony_cicpuset):
6762306a36Sopenharmony_ci
6862306a36Sopenharmony_ci	========	============	==========
6962306a36Sopenharmony_ci	Name		Unrestricted	Restricted
7062306a36Sopenharmony_ci	========	============	==========
7162306a36Sopenharmony_ci	MemTotal	3091900 kB	3091900 kB
7262306a36Sopenharmony_ci	MemFree		42113 kB	1513236 kB
7362306a36Sopenharmony_ci	========	============	==========
7462306a36Sopenharmony_ci
7562306a36Sopenharmony_ciThis allows for coarse memory management for the tasks you assign to particular
7662306a36Sopenharmony_cicpusets.  Since cpusets can form a hierarchy, you can create some pretty
7762306a36Sopenharmony_ciinteresting combinations of use-cases for various classes of tasks for your
7862306a36Sopenharmony_cimemory management needs.
79