• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1#!/usr/bin/env python3
2"""Flight Recorder Trace Analyzer
3
4This script primarily merges data from individual flight recorder buffers from individual ranks in a
5PyTorch Distributed program into a flattened database format that can be used for further analysis.
6
7However as part of the merging process, it is necessary to perform some analysis in order to match operators
8on one rank with corresponding operators on other ranks and register them as one 'collective' entry.  During this
9process, a significant amount of useful information can already be extracted such as where the first mismatch occurs
10in cases of desync (when not all ranks issue a compatible collective in a particular process group).
11
12
13Not Yet Implemented
14- TODO- tracebacks aren't implemented
15
16Known Issues
17- Flight Recorder buffer sequence_id information is not sufficient to match collectives and coalseced collectives
18  unless we have the trace data from the beginning of the program.  To enable confident analysis of trace buffers that
19  do not start from zero (and to simplify the script's matching logic) we need to add more information to the recorder.
20- Currently, the script omits checking the 'status' of collectives.  We can look for the first 'non completed'
21  collective easily enough and report that.
22
23Usage
24python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
25
26- Omitting the optional output file will still yield analysis information to stdout
27- The output file is a pickle of the flat DB, which may change in format in the future.
28- This script is versioned so that we can ensure our future changes to flight recorder are backwards compatible.
29"""
30
31import pickle
32from typing import Optional, Sequence
33
34from tools.flight_recorder.components.builder import build_db
35from tools.flight_recorder.components.config_manager import JobConfig
36from tools.flight_recorder.components.loader import read_dir
37from tools.flight_recorder.components.types import types
38
39
40def main(args: Optional[Sequence[str]] = None) -> None:
41    config = JobConfig()
42    args = config.parse_args(args)
43    details, version = read_dir(args.prefix, args.trace_dir)
44    db = build_db(details, args, version)
45    if args.output:
46        with open(args.output, "wb") as f:
47            pickle.dump((types, db), f)
48
49
50if __name__ == "__main__":
51    main()
52