• Home
Name Date Size #Lines LOC

..--

local/03-May-2024-2115

python/03-May-2024-592440

scripts/03-May-2024-1,6421,056

server/03-May-2024-463308

DockerfileD03-May-20241.6 KiB5143

Dockerfile.localD03-May-20241.2 KiB3832

README.mdD03-May-20243.8 KiB11777

__init__.pyD03-May-20240 10

build_server.shD03-May-20243 KiB9136

local_test.shD03-May-20245.6 KiB16474

remote_test.shD03-May-20245.2 KiB14253

README.md

1# Testing Distributed Runtime in TensorFlow
2This folder containers tools and test suites for the GRPC-based distributed
3runtime in TensorFlow.
4
5There are three general modes of testing:
6
7**1) Launch a docker container and run parameters servers and workers as
8 separate processes therein.**
9
10For example:
11
12    ./local_test.sh
13
14By default, local_test.sh runs the MNIST-with-replicas model as a test.
15However, you can use the --model_name flag to run the tf-learn/wide&deep
16cesnsu model:
17
18    ./local_test.sh --model_name CENSUS_WIDENDEEP
19
20**2) Launch a remote k8s cluster on Google Kubernetes Engine (GKE) and run the
21test suite on it**
22
23For example:
24
25    export TF_DIST_GCLOUD_PROJECT="tensorflow-testing"
26    export TF_DIST_GCLOUD_COMPUTE_ZONE="us-central1-f"
27    export TF_DIST_CONTAINER_CLUSTER="test-cluster-1"
28    export TF_DIST_GCLOUD_KEY_FILE="/var/gcloud-secrets/my-gcloud-key.json"
29    ./remote_test.sh
30
31Here you specify the Google Compute Engine (GCE) project, compute zone and
32container cluster with the first three environment variables, in that order.
33The environment variable "TF_DIST_GCLOUD_KEY_FILE_DIR" is a directory in which
34the JSON service account key file named "tensorflow-testing.json" is located.
35You can use the flag "--setup-cluster-only" to perform only the cluster setup
36step and skip the testing step:
37
38    ./remote_test.sh --setup_cluster_only
39
40**3) Run the test suite on an existing k8s TensorFlow cluster**
41
42For example:
43
44    export TF_DIST_GRPC_SERVER_URL="grpc://11.22.33.44:2222"
45    ./remote_test.sh
46
47The IP address above is a dummy example. Such a cluster may have been set up
48using the command described at the end of the previous section.
49
50
51**Asynchronous and synchronous parameter updates**
52
53There are two modes for the coordination of the parameters from multiple
54workers: asynchronous and synchronous.
55
56In the asynchronous mode, the parameter updates (gradients) from the workers
57are applied to the parameters without any explicit coordination. This is the
58default mode in the tests.
59
60In the synchronous mode, a certain number of parameter updates are aggregated
61from the model replicas before the update is applied to the model parameters.
62To use this mode, do:
63
64    # For remote testing
65    ./remote_test.sh --sync_replicas
66
67    # For local testing
68    ./local_test.sh --sync_replicas
69
70
71**Specifying the number of workers**
72
73You can specify the number of workers by using the --num-workers option flag,
74e.g.,
75
76    # For remote testing
77    ./remote_test.sh --num_workers 4
78
79    # For local testing
80    ./local_test.sh --num_workers 4
81
82
83**Building the GRPC server Docker image**
84
85To build the Docker image for a test server of TensorFlow distributed runtime,
86run:
87
88    ./build_server.sh <docker_image_name>
89
90**Using the GRPC server Docker image**
91To launch a container as a TensorFlow GRPC server, do as the following example:
92
93    docker run tensorflow/tf_grpc_server --cluster_spec="worker|localhost:2222;foo:2222,ps|bar:2222;qux:2222" --job_name=worker --task_id=0
94
95**Generating configuration file for TensorFlow k8s clusters**
96
97The script at "scripts/k8s_tensorflow.py" can be used to generate yaml
98configuration files for a TensorFlow k8s cluster consisting of a number of
99workers and parameter servers. For example:
100
101    scripts/k8s_tensorflow.py \
102        --num_workers 2 \
103        --num_parameter_servers 2 \
104        --grpc_port 2222 \
105        --request_load_balancer true \
106        --docker_image "tensorflow/tf_grpc_server" \
107        > tf-k8s-with-lb.yaml
108
109The yaml configuration file generated in the previous step can be used to a
110create a k8s cluster running the specified numbers of worker and parameter
111servers. For example:
112
113    kubectl create -f tf-k8s-with-lb.yaml
114
115See [Kubernetes kubectl documentation](http://kubernetes.io/docs/user-guide/kubectl-overview/)
116for more details.
117