Profile Data Collection

Principles

This section describes the Profiling APIs. Three Profiling methods are provided:

  • Collecting and flushing profile data

    Write the collected profile data to a file, use the profiling tool to parse the file (see ""Offline Parsing"" in Profiling Instructions), and display the profile data.

    The following two API call modes are available:
    • Call the following APIs: acl.prof.init, acl.prof.start, acl.prof.stop, and acl.prof.finalize. You can obtain the time taken to execute AI Core operators, AI Core metrics, and other information. Currently, the preceding APIs perform process-level control. That is, if the APIs are called in any thread in the process, the calls also take effect in other threads in the same process.

      These APIs can be called repeatedly in a process, allowing for varied Profiling configurations with each call.

    • Call acl.init. During initialization, the Profiling configuration is passed as a JSON configuration file. You can obtain the time taken to execute AI Core operators, AI Core metrics, and other information.

      acl.init can be called only once per process. To modify the Profiling configuration, modify the JSON configuration file. For details, see the description of the acl.init API.

  • Using msproftx extension APIs to collect and flush profile data

    When you need to locate the performance bottleneck of your app or the upper-layer framework program, call msproftx extension APIs during the profiling process (between the acl.prof.start and acl.prof.stop calls). msproftx is used to record the time span of specific events during app running and write data to a profile data file. You can use the profiling tool to parse the file and export the profile data.

    For details about how to parse and export data using the profiling tool, see ""Offline Parsing"" in Profiling Instructions.

    In a process, these APIs can be called for multiple times as required. API calling: acl.prof.create_stamp, acl.prof.push, acl.prof.pop, acl.prof.range_start, acl.prof.range_stop, and acl.prof.destroy_stamp are called between acl.prof.start and acl.prof.stop. These API calls obtain the events that occur at a specific time during app running and record the event time span.

    In a process, these APIs can be called for multiple times as required.

  • Subscription to operator information

    Analyze the collected profile data and write it to the pipeline. Then, the user loads the data to the memory and call the API to obtain the profile data.

    API calling: acl.prof.model_subscribe, acl.prof.get*, and acl.prof.model_unsubscribe. The profile data of operators in the model can be obtained, including the operator name, operator type name, and operator execution time.

Collecting and Flushing Profile Data

Add an exception handling branch following the API calls. The following is a code snippet of key steps only, which is not ready to use.

For details about the allocation and deallocation of runtime resources, see Runtime Resource Allocation and Runtime Resource Deallocation. For details about the API call sequence for model loading, see API Call Sequence. For details about the API call sequence for model inference and input/output data preparation, see Preparing Input/Output Data Structure for Model Execution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import acl
import numpy as np
# ......

# 1. Allocate runtime resources.
# ......

# 2. Load a model. After the model is successfully loaded, model_id that identifies the model is returned.
# ......

# 3. Create data of type aclmdlDataset to describe the inputs and outputs of the model.
# ......

# 4. Initialize Profiling.
# Set the data flush path.
PROF_INIT_PATH='...'
ret = acl.prof.init(PROF_INIT_PATH)

# 5. Set Profiling configurations.
device_list = [0]
ACL_PROF_ACL_API = 0x0001
ACL_PROF_TASK_TIME = 0x0002
ACL_PROF_AICORE_METRICS = 0x0004
ACL_PROF_AICPU_TRACE = 0x0008
ACL_PROF_SYS_HARDWARE_MEM_FREQ = 3

# Create the pointer address of the configuration type.
prof_config = acl.prof.create_config(device_list, 0, 0, ACL_PROF_ACL_API | ACL_PROF_TASK_TIME | ACL_PROF_AICPU | ACL_PROF_AICORE_METRICS | ACL_PROF_L2CACHE | ACL_PROF_HCCL_TRACE)
mem_freq = "15"
ret = acl.prof.set_config(ACL_PROF_SYS_HARDWARE_MEM_FREQ, mem_freq)
ret = acl.prof.start(prof_config)

# 6. Execute the model.
ret = acl.mdl.execute(model_id, input, output)

# 7. Process the model inference result.
# ......

# 8. Destroy allocations such as the model inputs and outputs, free memory, and unload the model.
# ......

# 9. Stop Profiling and destroy the configuration and related resources.
ret = acl.prof.stop(prof_config)
ret = acl.prof.destroy_config(prof_config)
ret = acl.prof.finalize()

# 10. Deallocate runtime resources.
# ......

Using msproftx Extension APIs to Collect and Flush Profile Data

Add an exception handling branch following the API calls. The following is a code snippet of key steps only, which is not ready to use.

For details about the allocation and deallocation of runtime resources, see Runtime Resource Allocation and Runtime Resource Deallocation. For details about the API call sequence for model loading, see API Call Sequence. For details about the API call sequence for model inference and input/output data preparation, see Preparing Input/Output Data Structure for Model Execution.

Example 1 (acl.prof.mark):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Collection items mac_fp16_ratio, mac_int8_ratio, vec_fp32_ratio, and vec_fp16_ratio.
# vec_int32_ratio, vec_misc_ratio
ACL_AICORE_ARITHMETIC_UTILIZATION = 0
# profiling data type config
ACL_PROF_ACL_API = 0x00000001
ACL_PROF_TASK_TIME = 0x00000002
ACL_PROF_MSPROFTX = 0x00000080
# profiling config type
ACL_PROF_SYS_HARDWARE_MEM_FREQ = 3

# Perform initialization.

# Allocate runtime resources.

# Initialize profiling and set the data flush path.
prof_path = "..."
ret = acl.prof.init(PROF_INIT_PATH)
assert ret == 0
device_list = [0]
prof_config = acl.prof.create_config(device_list, ACL_AICORE_ARITHMETIC_UTILIZATION, 0, 
    ACL_PROF_ACL_API | ACL_PROF_TASK_TIME | ACL_PROF_MSPROFTX)
assert prof_config != 0
mem_freq = "15"
ret = acl.prof.set_config(ACL_PROF_SYS_HARDWARE_MEM_FREQ, mem_freq)
self.assertEqual(ret, 0)
ret = acl.prof.start(prof_config)
assert ret == 0

# Load a model. After the model is successfully loaded, model_id that identifies the model is returned.
stamp = acl.prof.create_stamp()
assert stamp != 0
load_msg = "model_load_mark"
ret = acl.prof.set_stamp_trace_message(stamp, load_msg, len(load_msg))
assert ret == 0
ret = acl.prof.mark(stamp)  # Mark the model loading event.
assert ret == 0
acl.prof.destroy_stamp(stamp)

# Create data of type aclmdlDataset to describe the inputs and outputs of the model.

# Execute the model.
stamp = acl.prof.create_stamp()
assert stamp != 0
exec_msg = "model_exec_mark"
ret = acl.prof.set_stamp_trace_message(stamp, exec_msg, len(exec_msg))
assert ret == 0
ret = acl.prof.mark(stamp)  # Mark the model execution event.
assert ret == 0
acl.prof.destroy_stamp(stamp)
ret = acl.mdl.execute(model_id, dataset_input, dataset_output)
assert ret == 0

ret = acl.prof.stop(prof_config)
assert ret == 0
ret = acl.prof.finalize()
assert ret == 0
ret = acl.prof.destroy_config(prof_config)
assert ret == 0

# Deallocate runtime resources.

# Perform deinitialization.

Example 2 (acl.prof.mark_ex, with dotting before and after model execution)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Collection items mac_fp16_ratio, mac_int8_ratio, vec_fp32_ratio, and vec_fp16_ratio.
# vec_int32_ratio, vec_misc_ratio
ACL_AICORE_ARITHMETIC_UTILIZATION = 0
# Collect the profile data output by the user and upper-layer framework program.
ACL_PROF_MSPROFTX = 0x00000080

# Perform initialization.

# Allocate runtime resources.

prof_path = "..."
ret = acl.prof.init(PROF_INIT_PATH)
assert ret == 0
device_list = [0]
prof_config = acl.prof.create_config(device_list, 
    ACL_AICORE_ARITHMETIC_UTILIZATION, 0, ACL_PROF_MSPROFTX)
assert prof_config != 0
ret = acl.prof.start(prof_config)
assert ret == 0
ret = acl.prof.mark_ex("model execute start", stream)
assert ret == 0

# Execute the model.

ret = acl.prof.mark_ex("model execute stop", stream)
assert ret == 0

ret = acl.prof.stop(prof_config)
assert ret == 0
ret = acl.prof.finalize()
assert ret == 0
ret = acl.prof.destroy_config(prof_config)
assert ret == 0

# Deallocate runtime resources.

# Perform deinitialization.

Example 3 (acl.prof.push/acl.prof.pop, applicable to single-thread scenarios):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Collection items mac_fp16_ratio, mac_int8_ratio, vec_fp32_ratio, and vec_fp16_ratio.
# vec_int32_ratio, vec_misc_ratio
ACL_AICORE_ARITHMETIC_UTILIZATION = 0
# profiling data type config
ACL_PROF_ACL_API = 0x00000001
ACL_PROF_TASK_TIME = 0x00000002
# profiling config type
ACL_PROF_SYS_HARDWARE_MEM_FREQ = 3

# Perform initialization.

# Allocate runtime resources.

# Initialize profiling and set the data flush path.
prof_path = "..."
ret = acl.prof.init(PROF_INIT_PATH)
assert ret == 0
device_list = [0]
prof_config = acl.prof.create_config(device_list, ACL_AICORE_ARITHMETIC_UTILIZATION, 0, 
    ACL_PROF_ACL_API | ACL_PROF_TASK_TIME)
assert prof_config != 0
mem_freq = "15"
ret = acl.prof.set_config(ACL_PROF_SYS_HARDWARE_MEM_FREQ, mem_freq)
self.assertEqual(ret, 0)
ret = acl.prof.start(prof_config)
assert ret == 0

# Load a model. After the model is successfully loaded, modelId that identifies the model is returned.

# Create data of type aclmdlDataset to describe the inputs and outputs of the model.

# Execute the model. (The model is executed only in a single thread.)
stamp = acl.prof.create_stamp()
assert stamp != 0
exec_msg = "acl.mdl.execute_duration"
ret = acl.prof.set_stamp_trace_message(stamp, exec_msg, len(exec_msg))
assert ret == 0
ret = acl.prof.push(stamp)
assert ret == 0
ret = acl.mdl.execute(model_id, dataset_input, dataset_output)
assert ret == 0
ret = acl.prof.pop(stamp)
assert ret == 0
acl.prof.destroy_stamp(stamp)

# Process the model inference result.

ret = acl.prof.stop(prof_config)
assert ret == 0
ret = acl.prof.finalize()
assert ret == 0
ret = acl.prof.destroy_config(prof_config)
assert ret == 0

# Deallocate runtime resources.

# Perform deinitialization.

Example 4 (acl.prof.range_start/acl.prof.range_stop, applicable to single-thread or cross-thread scenarios):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Collection items mac_fp16_ratio, mac_int8_ratio, vec_fp32_ratio, and vec_fp16_ratio.
# vec_int32_ratio, vec_misc_ratio
ACL_AICORE_ARITHMETIC_UTILIZATION = 0
# profiling data type config
ACL_PROF_ACL_API = 0x00000001
ACL_PROF_TASK_TIME = 0x00000002
# profiling config type
ACL_PROF_SYS_HARDWARE_MEM_FREQ = 3

# Perform initialization.

# Allocate runtime resources.

# Initialize profiling and set the data flush path.
prof_path = "..."
ret = acl.prof.init(PROF_INIT_PATH)
assert ret == 0
device_list = [0]
prof_config = acl.prof.create_config(device_list, ACL_AICORE_ARITHMETIC_UTILIZATION, 0, 
    ACL_PROF_ACL_API | ACL_PROF_TASK_TIME)
assert prof_config != 0
mem_freq = "15"
ret = acl.prof.set_config(ACL_PROF_SYS_HARDWARE_MEM_FREQ, mem_freq)
self.assertEqual(ret, 0)
ret = acl.prof.start(prof_config)
assert ret == 0

# Load a model. After the model is successfully loaded, modelId that identifies the model is returned.

# Create data of type aclmdlDataset to describe the inputs and outputs of the model.

# Execute the model (the model is executed across threads).
stamp = acl.prof.create_stamp()
assert stamp != 0
exec_msg = "acl.mdl.execute_duration"
ret = acl.prof.set_stamp_trace_message(stamp, exec_msg, len(exec_msg))
assert ret == 0
range_id, ret = acl.prof.range_start(stamp)
assert ret == 0
ret = acl.mdl.execute(model_id, dataset_input, dataset_output)
assert ret == 0
ret = acl.prof.range_stop(range_id)
assert ret == 0
acl.prof.destroy_stamp(stamp)

# Process the model inference result.

ret = acl.prof.stop(prof_config)
assert ret == 0
ret = acl.prof.finalize()
assert ret == 0
ret = acl.prof.destroy_config(prof_config)
assert ret == 0

# Deallocate runtime resources.

# Perform deinitialization.

Subscription to Operator Information

Add an exception handling branch following the API calls. The following is a code snippet of key steps only, which is not ready to use.

For details about the allocation and deallocation of runtime resources, see Runtime Resource Allocation and Runtime Resource Deallocation. For details about the API call sequence for model loading, see API Call Sequence. For details about the API call sequence for model inference and input/output data preparation, see Preparing Input/Output Data Structure for Model Execution.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import acl
import numpy as np
# ......

# 1. Allocate runtime resources.
# ......

# 2. Load a model. After the model is successfully loaded, model_id that identifies the model is returned.
# ......

# 3. Create data of type aclmdlDataset to describe the inputs and outputs of the model.
# ......

# 4. Create a pipeline to read and write the model subscription data.
r, w = os.pipe()

# 5. Create a model subscription configuration and subscribe to the model.
ACL_AICORE_NONE = 0xFF
subscribe_config = acl.prof.create_subscribe_config(1, ACL_AICORE_NONE, w)
# Pass model_id of the model for subscription.
ret = acl.prof.model_subscribe(model_id, subscribe_config)

# 6. Enable the pipeline to read subscription data.
# 6.1 Customize a function to read subscription data from the user memory.
def get_model_info(data, data_len):
    # Obtain the number of operators.
    op_number, ret = acl.prof.get_op_num(data, data_len)
    # Iterate over the operator information in the user memory.
    for i in range(op_number):
       # Obtain the model ID of the operator.
        model_id = acl.prof.get_model_id(data, data_len, i)
       # Obtain the operator type.
        op_type, ret = acl.prof.get_op_type(data, data_len, i, 65)
       # Obtain the operator name.
        op_name, ret = acl.prof.get_op_name(data, data_len, i, 275)
       # Obtain the execution start time of the operator.
        op_start = acl.prof.get_op_start(data, data_len, i)
       # Obtain the execution end time of the operator.
        op_end = acl.prof.get_op_end(data, data_len, i)
       # Obtain the time required for executing the operator.
        op_duration = acl.prof.get_op_duration(data, data_len, i)

# 6.2 Customize a function to read data from the pipeline to the user memory.
def prof_data_read(args):
    fd, ctx = args
    ret = acl.rt.set_context(ctx)
    # Obtain the operator information buffer size (in bytes) per operator.
    buffer_size, ret = acl.prof.get_op_desc_size()
   # Set the number of operators read from the pipe each time.
    N = 10
   # Calculate the total operator information buffer size.
    data_len = buffer_size * N
    # Read data from the pipeline to the allocated memory. The actual size of the read data may be less than buffer_size * N. If there is no data in the pipeline, the process is blocked until data is read.
    while True:
        data = os.read(fd, data_len)
        if len(data) == 0:
            break
        np_data = np.array(data)
        
        bytes_data = np_data.tobytes()
        np_data_ptr = acl.util.bytes_to_ptr(bytes_data)
        size = np_data.itemsize * np_data.size
        # Call the function implemented in 6.1 to parse data in the memory.
        get_model_info(np_data_ptr, size)

# 7. Start the thread to read and parse the pipeline data.
thr_id, ret = acl.util.start_thread(prof_data_read, [r, context])

# 8. Execute the model.
ret = acl.mdl.execute(model_id, input, output)

# 9. Process the model inference result.
# ......

# 10. Destroy allocations such as the model inputs and outputs, free memory, and unload the model.
# ......

# 11. Unsubscribe from the model and destroy the subscription-related resources.
ret = acl.prof.model_unsubscribe(model_id)
ret = acl.util.stop_thread(thr_id)
os.close(r)
ret = acl.prof.destroy_subscribe_config(subscribe_config)

# 12. Deallocate runtime resources.
# ......