You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+18-18Lines changed: 18 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ or:
32
32
## Supported OS:
33
33
The stack allowa various combination of OS. Here is a list of what has been tested. We can't guarantee any of the other combination.
34
34
35
-
| Bastion | Compute |
35
+
|Controller| Compute |
36
36
|---------------|--------------|
37
37
| OL7 | OL7 |
38
38
| OL7 | OL8 |
@@ -41,7 +41,7 @@ The stack allowa various combination of OS. Here is a list of what has been test
41
41
| OL8 | OL7 |
42
42
| Ubuntu 20.04 | Ubuntu 20.04 |
43
43
44
-
When switching to Ubuntu, make sure the username is changed from opc to Ubuntu in the ORM for both the bastion and compute nodes.
44
+
When switching to Ubuntu, make sure the username is changed from opc to Ubuntu in the ORM for both the controller and compute nodes.
45
45
## How is resizing different from autoscaling ?
46
46
Autoscaling is the idea of launching new clusters for jobs in the queue.
47
47
Resizing a cluster is changing the size of a cluster. In some case growing your cluster may be a better idea, be aware that this may lead to capacity errors. Because Oracle CLoud RDMA is non virtualized, you get much better performance but it also means that we had to build HPC islands and split our capacity across different network blocks.
@@ -62,7 +62,7 @@ Resizing of HPC cluster with Cluster Network consist of 2 major sub-steps:
62
62
63
63
## resize.sh usage
64
64
65
-
The resize.sh is deployed on the bastion node as part of the HPC cluster Stack deployment. Unreachable nodes have been causing issues. If nodes in the inventory are unreachable, we will not do cluster modification to the cluster unless --remove_unreachable is also specified. That will terminate the unreachable nodes before running the action that was requested (Example Adding a node)
65
+
The resize.sh is deployed on the controller node as part of the HPC cluster Stack deployment. Unreachable nodes have been causing issues. If nodes in the inventory are unreachable, we will not do cluster modification to the cluster unless --remove_unreachable is also specified. That will terminate the unreachable nodes before running the action that was requested (Example Adding a node)
66
66
67
67
```
68
68
/opt/oci-hpc/bin/resize.sh -h
@@ -92,7 +92,7 @@ optional arguments:
92
92
OCID of the localhost
93
93
--cluster_name CLUSTER_NAME
94
94
Name of the cluster to resize. Defaults to the name
95
-
included in the bastion
95
+
included in the controller
96
96
--nodes NODES [NODES ...]
97
97
List of nodes to delete
98
98
--no_reconfigure If present. Does not rerun the playbooks
@@ -284,14 +284,14 @@ When the cluster is already being destroyed, it will have a file `/opt/oci-hpc/a
284
284
## Autoscaling Monitoring
285
285
If you selected the autoscaling monitoring, you can see what nodes are spinning up and down as well as running and queued jobs. Everything will run automatically except the import of the Dashboard in Grafana due to a problem in the Grafana API.
286
286
287
-
To do it manually, in your browser of choice, navigate to bastionIP:3000. Username and password are admin/admin, you can change those during your first login. Go to Configuration -> Data Sources. Select autoscaling. Enter Password as Monitor1234! and click on 'Save & test'. Now click on the + sign on the left menu bar and select import. Click on Upload JSON file and upload the file the is located at `/opt/oci-hpc/playbooks/roles/autoscaling_mon/files/dashboard.json`. Select autoscaling (MySQL) as your datasource.
287
+
To do it manually, in your browser of choice, navigate to controllerIP:3000. Username and password are admin/admin, you can change those during your first login. Go to Configuration -> Data Sources. Select autoscaling. Enter Password as Monitor1234! and click on 'Save & test'. Now click on the + sign on the left menu bar and select import. Click on Upload JSON file and upload the file the is located at `/opt/oci-hpc/playbooks/roles/autoscaling_mon/files/dashboard.json`. Select autoscaling (MySQL) as your datasource.
288
288
289
289
You will now see the dashboard.
290
290
291
291
292
292
# LDAP
293
-
If selected bastion host will act as an LDAP server for the cluster. It's strongly recommended to leave default, shared home directory.
294
-
User management can be performed from the bastion using ``` cluster ``` command.
293
+
If selected controller host will act as an LDAP server for the cluster. It's strongly recommended to leave default, shared home directory.
294
+
User management can be performed from the controller using ``` cluster ``` command.
295
295
Example of cluster command to add a new user:
296
296
```cluster user add name```
297
297
By default, a `privilege` group is created that has access to the NFS and can have sudo access on all nodes (Defined at the stack creation. This group has ID 9876) The group name can be modified.
@@ -301,21 +301,21 @@ To avoid generating a user-specific key for passwordless ssh between nodes, use
301
301
302
302
# Shared home folder
303
303
304
-
By default, the home folder is NFS shared directory between all nodes from the bastion. You have the possibility to use a FSS to share it as well to keep working if the bastion goes down. You can either create the FSS from the GUI. Be aware that it will get destroyed when you destroy the stack. Or you can pass an existing FSS IP and path. If you share an existing FSS, do not use /home as mountpoint. The stack will take care of creating a $nfsshare/home directory and mounting it at /home after copying all the appropriate files.
304
+
By default, the home folder is NFS shared directory between all nodes from the controller. You have the possibility to use a FSS to share it as well to keep working if the controller goes down. You can either create the FSS from the GUI. Be aware that it will get destroyed when you destroy the stack. Or you can pass an existing FSS IP and path. If you share an existing FSS, do not use /home as mountpoint. The stack will take care of creating a $nfsshare/home directory and mounting it at /home after copying all the appropriate files.
305
305
306
306
# Deploy within a private subnet
307
307
308
-
If "true", this will create a private endpoint in order for Oracle Resource Manager to configure the bastion VM and the future nodes in private subnet(s).
309
-
* If "Use Existing Subnet" is false, Terraform will create 2 private subnets, one for the bastion and one for the compute nodes.
310
-
* If "Use Existing Subnet" is also true, the user must indicate a private subnet for the bastion VM. For the compute nodes, they can reside in another private subnet or the same private subent as the bastion VM.
308
+
If "true", this will create a private endpoint in order for Oracle Resource Manager to configure the controller VM and the future nodes in private subnet(s).
309
+
* If "Use Existing Subnet" is false, Terraform will create 2 private subnets, one for the controller and one for the compute nodes.
310
+
* If "Use Existing Subnet" is also true, the user must indicate a private subnet for the controller VM. For the compute nodes, they can reside in another private subnet or the same private subent as the controller VM.
311
311
312
-
The bastion VM will reside in a private subnet. Therefore, the creation of a "bastion service" (https://docs.oracle.com/en-us/iaas/Content/Bastion/Concepts/bastionoverview.htm), a VPN or FastConnect connection is required. If a public subnet exists in the VCN, adapting the security lists and creating a jump host can also work. Finally, a Peering can also be established betwen the private subnet and another VCN reachable by the user.
312
+
The controller VM will reside in a private subnet. Therefore, the creation of a "controller service" (https://docs.oracle.com/en-us/iaas/Content/controller/Concepts/controlleroverview.htm), a VPN or FastConnect connection is required. If a public subnet exists in the VCN, adapting the security lists and creating a jump host can also work. Finally, a Peering can also be established betwen the private subnet and another VCN reachable by the user.
313
313
314
314
315
315
316
316
## max_nodes_partition.py usage
317
317
318
-
Use the alias "max_nodes" to run the python script max_nodes_partition.py. You can run this script only from bastion.
318
+
Use the alias "max_nodes" to run the python script max_nodes_partition.py. You can run this script only from controller.
319
319
320
320
$ max_nodes --> Information about all the partitions and their respective clusters, and maximum number of nodes distributed evenly per partition
321
321
@@ -324,13 +324,13 @@ $ max_nodes --include_cluster_names xxx yyy zzz --> where xxx, yyy, zzz are clus
324
324
325
325
## validation.py usage
326
326
327
-
Use the alias "validate" to run the python script validation.py. You can run this script only from bastion.
327
+
Use the alias "validate" to run the python script validation.py. You can run this script only from controller.
328
328
329
329
The script performs these checks.
330
330
-> Check the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files.
331
331
-> PCIe bandwidth check
332
332
-> GPU Throttle check
333
-
-> Check whether md5 sum of /etc/hosts file on nodes matches that on bastion
333
+
-> Check whether md5 sum of /etc/hosts file on nodes matches that on controller
334
334
335
335
Provide at least one argument: [-n NUM_NODES][-p PCIE][-g GPU_THROTTLE][-e ETC_HOSTS]
336
336
@@ -343,7 +343,7 @@ Below are some examples for running this script.
343
343
344
344
validate -n y --> This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. The clusters considered will be the default cluster if any and cluster(s) found in /opt/oci-hpc/autoscaling/clusters directory. The number of nodes considered will be from the resize script using the clusters we got before.
345
345
346
-
validate -n y -cn <clusternamefile> --> This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. It will also check whether md5 sum of /etc/hosts file on all nodes matches that on bastion. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
346
+
validate -n y -cn <clusternamefile> --> This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. It will also check whether md5 sum of /etc/hosts file on all nodes matches that on controller. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
347
347
348
348
validate -p y -cn <clusternamefile> --> This will run the pcie bandwidth check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
349
349
@@ -364,12 +364,12 @@ validate -n y -p y -g y -e y -cn <cluster name file>
364
364
## /opt/oci-hpc/scripts/collect_logs.py
365
365
This is a script to collect nvidia bug report, sosreport, console history logs.
366
366
367
-
The script needs to be run from the bastion. In the case where the host is not ssh-able, it will get only console history logs for the same.
367
+
The script needs to be run from the controller. In the case where the host is not ssh-able, it will get only console history logs for the same.
368
368
369
369
It requires the below argument.
370
370
--hostname <HOSTNAME>
371
371
372
-
And --compartment-id <COMPARTMENT_ID> is optional (i.e. assumption is the host is in the same compartment as the bastion).
372
+
And --compartment-id <COMPARTMENT_ID> is optional (i.e. assumption is the host is in the same compartment as the controller).
373
373
374
374
Where HOSTNAME is the node name for which you need the above logs and COMPARTMENT_ID is the OCID of the compartment where the node is.
0 commit comments