Authentication Mechanism

From wiki.hpc.mk
Revision as of 09:24, 31 August 2021 by Boris (talk | contribs)
Contents
  1. Setting up a user authentication mechanism
  2. Setting up a local database for storing users and policies
  3. Database configuration for storing user information
  4. Examples with GPU memory selection


Setting up a user authentication mechanism

For greater security in the communication of the master server with the clients, it is necessary to set up two-way authentication with the help of a shared key. According to standard practice, Slurm uses the munge protocol to validate the mutual communication of system components. In short, it is designed for large HPC environments, in which the UID and GID identifiers of users and groups are validated with a shared cryptographic key. It typically uses a 128 AES encryption scheme as well as a SHA-256 hash validation message. The generated key is then copied to all nodes to be used in the cluster along with the corresponding permissions. It should also be noted that users who will use Slurm should have the same UID and GUID everywhere, for better security and further authentication when sending tasks.


Setting up a local database for storing users and policies

Mariadb is used as a database to store user data. In the initial installation, tables should be created for storing data about the users, their properties, the groups they join as well as the permissions or policies that are allowed accordingly. Then the appropriate service is installed to monitor all these processes, SlurmDBD. The main and basic task of this process involves collecting data from multiple clusters at one location for all user activities.  

Each user's UID is used as the main identifier, which can keep detailed statistics for each task that a user will perform. The authentication is done through the munge service, which checks the list of users on each node (reads the contents of Linux file /etc/passwd) and based it stores the information about every user. There are several different tables in which this data is written, the most important of which are:

  • AccountingStorageType - Controlling the steps performed by the task as well as the required resources,
  • JobCompType - Writing data about the tasks, which contains basic information such as name, user who started it, allocated nodes and resources, start time, completion time, output status. This table can be expanded with additional information about the databases it uses (MySQL or MariaDB).

Enrollment control is done by the slurmctld service (the main slurm cluster control process). Thus, potential sensitive data that is available to all users in the process should be properly protected and authenticated. The same applies if the data is sent through a network protocol (TCP, UDP) where full protection should be provided throughout the communication channel. The data stored directly in the database are encrypted with an appropriate protocol that provides security and protection in case of possible system abuses.


Database configuration for storing user information

In addition to the standard way of storing data in a text file (which is the basic way of storing all SLURM data), in our case we use MariaDB database system. It provides high availability and tolerance in case of unwanted hardware or physical problems, which activates the backups of the databases. Newer versions of MySQL and MariaDB use InnoDB as the main table management service. The most important features when configuring InnoDB include the following parameters:

  • Innodb_buffer_pool_size - The size of bytes in the buffer space, which caches all tables and indexes of the data. Depending on the processor architecture (32 or 64 bits) the maximum value in bytes is set. A higher value requires a larger amount of disk I / O operations when accessing the same table multiple times. In our case that value is set to 1024MB,
  • Innodb_lock_wait_timeout - Time expressed in seconds to execute an InnoDB transaction waiting to access a queue locked by another process. In principle, if a certain transaction (INSERT or UPDATE) waits longer than the above defined, then the transaction is canceled and a roll-back is made to the entire operation. In our case the value is set to 900 seconds (i.e. 15 minutes),
  • Innodb_page_size - The size of the data expressed in data files, with a standard size of 16KB. These files are organized as segments, and if a row in a table is larger than the default value, multiple files are combined into a single segment. Smaller sizes are generally recommended for drives with SSD technology for higher performance.