1:orphan: 2 3.. _multiprocessing-doc: 4 5Multiprocessing package - torch.multiprocessing 6=============================================== 7 8.. automodule:: torch.multiprocessing 9.. currentmodule:: torch.multiprocessing 10 11.. warning:: 12 13 If the main process exits abruptly (e.g. because of an incoming signal), 14 Python's ``multiprocessing`` sometimes fails to clean up its children. 15 It's a known caveat, so if you're seeing any resource leaks after 16 interrupting the interpreter, it probably means that this has just happened 17 to you. 18 19Strategy management 20------------------- 21 22.. autofunction:: get_all_sharing_strategies 23.. autofunction:: get_sharing_strategy 24.. autofunction:: set_sharing_strategy 25 26 27.. _multiprocessing-cuda-sharing-details: 28 29Sharing CUDA tensors 30-------------------- 31 32Sharing CUDA tensors between processes is supported only in Python 3, using 33a ``spawn`` or ``forkserver`` start methods. 34 35 36Unlike CPU tensors, the sending process is required to keep the original tensor 37as long as the receiving process retains a copy of the tensor. The refcounting is 38implemented under the hood but requires users to follow the next best practices. 39 40.. warning:: 41 If the consumer process dies abnormally to a fatal signal, the shared tensor 42 could be forever kept in memory as long as the sending process is running. 43 44 451. Release memory ASAP in the consumer. 46 47:: 48 49 ## Good 50 x = queue.get() 51 # do somethings with x 52 del x 53 54:: 55 56 ## Bad 57 x = queue.get() 58 # do somethings with x 59 # do everything else (producer have to keep x in memory) 60 612. Keep producer process running until all consumers exits. This will prevent 62the situation when the producer process releasing memory which is still in use 63by the consumer. 64 65:: 66 67 ## producer 68 # send tensors, do something 69 event.wait() 70 71 72:: 73 74 ## consumer 75 # receive tensors and use them 76 event.set() 77 783. Don't pass received tensors. 79 80:: 81 82 # not going to work 83 x = queue.get() 84 queue_2.put(x) 85 86 87:: 88 89 # you need to create a process-local copy 90 x = queue.get() 91 x_clone = x.clone() 92 queue_2.put(x_clone) 93 94 95:: 96 97 # putting and getting from the same queue in the same process will likely end up with segfault 98 queue.put(tensor) 99 x = queue.get() 100 101 102Sharing strategies 103------------------ 104 105This section provides a brief overview into how different sharing strategies 106work. Note that it applies only to CPU tensor - CUDA tensors will always use 107the CUDA API, as that's the only way they can be shared. 108 109File descriptor - ``file_descriptor`` 110^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 111 112 113.. note:: 114 115 This is the default strategy (except for macOS and OS X where it's not 116 supported). 117 118This strategy will use file descriptors as shared memory handles. Whenever a 119storage is moved to shared memory, a file descriptor obtained from ``shm_open`` 120is cached with the object, and when it's going to be sent to other processes, 121the file descriptor will be transferred (e.g. via UNIX sockets) to it. The 122receiver will also cache the file descriptor and ``mmap`` it, to obtain a shared 123view onto the storage data. 124 125Note that if there will be a lot of tensors shared, this strategy will keep a 126large number of file descriptors open most of the time. If your system has low 127limits for the number of open file descriptors, and you can't raise them, you 128should use the ``file_system`` strategy. 129 130File system - ``file_system`` 131^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 132 133This strategy will use file names given to ``shm_open`` to identify the shared 134memory regions. This has a benefit of not requiring the implementation to cache 135the file descriptors obtained from it, but at the same time is prone to shared 136memory leaks. The file can't be deleted right after its creation, because other 137processes need to access it to open their views. If the processes fatally 138crash, or are killed, and don't call the storage destructors, the files will 139remain in the system. This is very serious, because they keep using up the 140memory until the system is restarted, or they're freed manually. 141 142To counter the problem of shared memory file leaks, :mod:`torch.multiprocessing` 143will spawn a daemon named ``torch_shm_manager`` that will isolate itself from 144the current process group, and will keep track of all shared memory allocations. 145Once all processes connected to it exit, it will wait a moment to ensure there 146will be no new connections, and will iterate over all shared memory files 147allocated by the group. If it finds that any of them still exist, they will be 148deallocated. We've tested this method and it proved to be robust to various 149failures. Still, if your system has high enough limits, and ``file_descriptor`` 150is a supported strategy, we do not recommend switching to this one. 151 152Spawning subprocesses 153--------------------- 154 155.. note:: 156 157 Available for Python >= 3.4. 158 159 This depends on the ``spawn`` start method in Python's 160 ``multiprocessing`` package. 161 162Spawning a number of subprocesses to perform some function can be done 163by creating ``Process`` instances and calling ``join`` to wait for 164their completion. This approach works fine when dealing with a single 165subprocess but presents potential issues when dealing with multiple 166processes. 167 168Namely, joining processes sequentially implies they will terminate 169sequentially. If they don't, and the first process does not terminate, 170the process termination will go unnoticed. Also, there are no native 171facilities for error propagation. 172 173The ``spawn`` function below addresses these concerns and takes care 174of error propagation, out of order termination, and will actively 175terminate processes upon detecting an error in one of them. 176 177.. automodule:: torch.multiprocessing.spawn 178.. currentmodule:: torch.multiprocessing.spawn 179 180.. autofunction:: spawn 181 182.. currentmodule:: torch.multiprocessing 183 184 185.. class:: SpawnContext 186 187 Returned by :func:`~spawn` when called with ``join=False``. 188 189 .. automethod:: join 190 191 192.. This module needs to be documented. Adding here in the meantime 193.. for tracking purposes 194.. py:module:: torch.multiprocessing.pool 195.. py:module:: torch.multiprocessing.queue 196.. py:module:: torch.multiprocessing.reductions 197