Installation and configuration of the ray module in the server

ray official website：Welcome to the Ray documentation — Ray 2.3.1

Some python modules call ray, which can be really annoying if you don't have a good installation and configuration, and there are differences between using ray on a server and on a windows system.

1. conda creationvirtualized environment

If the python path in the virtual environment after creation does not match the python path in the first call to the system (e.g., multiple virtual environments), the python path in the subsequent calls to thepip orray Always use absolute paths




# Create and activate a virtual environment ray



conda create -c conda-forge python=3.8 -n ray



conda activate ray

2. Installation of ray

In the process of installing ray, in addition to installing the most basic features of ray, you also need to install some dependency libraries, which are equivalent to some of the extensions of ray, you can refer to the specific onesRay Default ::




# There are others besides these five, but enough is enough #



pip install -U "ray[default]"




pip install -U "ray[air]" 



pip install -U "ray[tune]" 



pip install -U "ray[rllib]"  



pip install -U "ray[serve]"

3. ray start

together withwindows Different.To run ray on a server, you must first create a ray cluster, after which the cluster is initialized (connected to) in python. If you just need to run commands on a separate server, and don't need to communicate between servers, or between a remote server and a local computer, then just run theThe simplest command below




# Create a file with30ray cluster for cpu



ray start --head --num-cpus=30




 



# Logs




Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run .



 



Local node IP: 10.11.11.179





2023-04-16 01:49:35,305 ERROR :1169 -- Failed to start the dashboard 




2023-04-16 01:49:35,306 ERROR :1194 -- Error should be written to '' or ''. We are printin.




2023-04-16 01:49:35,307 ERROR :1238 -- 



The last 20 lines of /tmp/ray/session_2023-04-16_01-48-54_707151_109593/logs/ (it contains the error message from  




2023-04-16 01:49:32,933 INFO :239 -- Starting dashboard metrics server on port 44227





2023-04-16 01:49:32,941 INFO :112 -- Get all modules by type: DashboardHeadModule



 



 



--------------------



Ray runtime started.



--------------------



 




Next steps



  To connect to this Ray runtime from another node, run




    ray start --address='10.11.11.179:6379'




  



  Alternatively, use the following Python code:



    import ray



    (address='auto')



  



  To see the status of the cluster, use




    ray status




  



  If connection fails, check your firewall settings and network configuration.



  



  To terminate the Ray runtime, run




    ray stop

4. head nodes and worker nodes

If there is a need for servers to communicate with each other or between remote servers and local computers, then it is a bit more complicated and there are two concepts involved in configuring a ray cluster:Master and worker nodes

The head node is the central node of the Ray cluster, which is responsible for coordinating task execution and resource management, and for:

Assigning tasks: head node assigns tasks to worker nodes so that they can execute them.
Managing resources: the head node is responsible for managing the resources in the cluster, such as CPU, memory, and GPUs, to ensure that tasks are executed correctly and the appropriate resources are used.
Tracking task status: the head node tracks the execution status of the task and returns the result to the caller when the task is completed.

The worker nodes are compute nodes in the Ray cluster that are responsible for executing tasks and returning the results to the head node:

Receiving tasks: worker nodes receive tasks assigned to them from head nodes and execute the instructions for the tasks.
Execute the task: the worker node executes the code for the task and returns the result to the head node.
Release resources: worker node releases the used resources after completing the task so that other tasks can use them.

In a nutshell, the difference between head nodes and worker nodes lies in their responsibilities and behaviors. head nodes are the central nodes of the cluster and are responsible for coordinating and managing the execution of tasks, while worker nodes are the compute nodes of the cluster and are responsible for executing tasks. In a Ray cluster, head nodes and worker nodes communicate with each other to ensure that tasks are executed correctly and results are returned.

PS: Personally, I understand the relationship between parent and child processes or between base and virtual environments.

Specific Steps:

Specify the server IP address

ifconfig

Initialize to create a master node




ray start --head --port=6379 --num-cpus=<number_of_cpus> --redis-password=<password>




# port = 0Random port



# port=6379default port

Worker node connected to master node

ray start --address=<address_of_head_node>:<port_of_head_node> --num-cpus=<number_of_cpus>

5. Enabling commands in the ray cluster via python




import ray



# Initialize ray



ray.init(address='auto')



# Some specified functions



()

Note: In python, the end command must be followed by (), otherwise the next time you run (), you will get an error!

6. Other orders




ray dashboard 



 



# Errors are reported due to the lack of a visualization window GUI on the server, but they do not affect use



'''



Usage: ray dashboard [OPTIONS] CLUSTER_CONFIG_FILE



Try 'ray dashboard --help' for help.







Error: Missing argument 'CLUSTER_CONFIG_FILE'.



'''



 



 



 



ray status



# Logs



'''



======== Autoscaler status: 2023-04-16 04:04:48.643492 ========



Node status



---------------------------------------------------------------



Healthy:



 1 node_36f47b4427ed06ce849863e323f684649dce7aa5c1ad7d3be38416aa



Pending:



 (no pending nodes)



Recent failures:



 (no failures)







Resources



---------------------------------------------------------------



Usage:



 0.0/30.0 CPU



 0.00/595.632 GiB memory



 0.00/186.265 GiB object_store_memory







Demands:



 (no resource demands)



'''



 



# Shut down the ray cluster



ray stop