diff --git a/doc/design_ros_2.md b/doc/design_ros_2.md new file mode 100644 index 0000000..668aea2 --- /dev/null +++ b/doc/design_ros_2.md @@ -0,0 +1,511 @@ +# ROS 2 tracing + +Design document for ROS 2 tracing, instrumentation, and analysis effort. + +## Goals and requirements + +### Goals + +1. Provide low-overhead tools and resources for robotics software development based on ROS 2. +2. Make tracing easier to use with ROS. + +### Requirements: instrumentation + +Instrumentation should be built around the main uses of ROS 2, and should include relevant information: + +1. Overall + 1. When creating a publisher/subscriber/service/client/etc., appropriate references should be kept in order to correlate with other tracepoints related to the same instance. +1. Publishers & subscriptions + 1. When creating a publisher/subscription, the effective topic name should be included (i.e. including namespace and after remapping). + 2. When publishing a message, some sort of message identifier should be included in the tracepoint so it can be tracked through DDS up to the subscriber's side. +3. Callbacks (subscription, service, client, timer) + 1. Callback function symbol should be included, whenever possible. + 2. Information about callback execution (e.g. start & end) should be available. +4. Timers + 1. Information about the period should be available. +5. Executors + 1. Information about spin cycles & periods should be available. +6. Others + 1. Provide generic tracepoints for user code. + +### Requirements: analysis & visualization + +Analyses process trace data. They should be general enough to be useful for different use-cases, e.g.: + +* Callback duration +* Time between callbacks (between two callback starts and/or a callback end and a start) +* Message age (as the difference between processing time and message timestamp) +* Message size +* Memory usage +* Execution time/proportion accross a process' nodes/components +* Interruptions (noting that these may be more useful as time-based metrics instead of overall statistics): + * scheduling events during a callback + * delay between the moment a thread becomes ready and when it's actually scheduled + * CPU cycles + +with mean, stdev, etc. when applicable. + +Generic tracepoints for ROS 2 user code could be applied to a user-provided model for higher-level behaviour statistics and visualization. + +### Tools/accessibility + +To make tracing ROS 2 more accessible and easier to adopt, we can put effort into integrating LTTng session setup & recording into the ROS 2 launch system. + +This might include converting existing `tracetools` scripts to more flexible Python scripts, and then plugging that into the launch system. + +## Instrumentation design + +This section includes information about ROS 2's design & architecture through descriptions of the main execution flows. The instrumentation can then be built around that. + +### Flow description + +#### Process creation + +In the call to `rclcpp::init()`, a process-specific `rclcpp::Context` object is fetched and CLI arguments are parsed. Much of the work is actually done by `rcl` through a call to `rcl_init()`. This call processes the `rcl_context_t` handle, which is wrapped by the `Context` object. Also, inside this call, `rcl` calls `rmw_init()` to process the `rmw` context (`rmw_context_t`) as well. This `rmw` handle is itself part of the `rcl_context_t` handle. + +This has to be done once per process, and usually at the very beginning. The components that are then instanciated share this context. + +```mermaid +sequenceDiagram + participant process + participant rclcpp + participant Context + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + process->>rclcpp: rclcpp::init(argc, argv) + Note over rclcpp: fetches process-specific Context object + rclcpp->>Context: init(argc, argv) + Note over Context: allocates rcl_context_t handle + Context->>rcl: rcl_init(out rcl_context_t) + Note over rcl: validates & processes rcl_context_t handle + rcl->>rmw: rmw_init(out rmw_context_t) + Note over rmw: validates & processes rmw_context_t handle + + rcl-->>tracetools: TP(rcl_init, rcl_context_t *) +``` + +#### Node/component creation + +In ROS 2, a process can contain multiple nodes. These are sometimes referred to as "components." + +These components are instanciated by the containing process. They are usually classes that extend `rclcpp::Node`, so that the node initialization work is done by the parent constructor. + +This parent constructor will allocate its own `rcl_node_t` handle and call `rcl_node_init()`, which will validate the node name/namespace. `rcl` will also call `rmw_create_node()` to get the node's `rmw` handle (`rmw_node_t`). This will be used later by publishers and subscriptions. + +```mermaid +sequenceDiagram + participant process + participant Component + participant rclcpp + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + process->>Component: Component() + Component->>rclcpp: : Node(node_name, namespace) + Note over rclcpp: allocates rcl_node_t handle + rclcpp->>rcl: rcl_node_init(out rcl_node_t, node_name, namespace) + Note over rcl: validates node name/namespace + Note over rcl: populates rcl_note_t + rcl->>rmw: rmw_create_node(node_name, local_namespace) : rmw_node_t + Note over rmw: creates rmw_node_t handle + + rcl-->>tracetools: TP(rcl_node_init, rcl_node_t *, rmw_node_t *, node_name, namespace) +``` + +#### Publisher creation + +The component calls `create_publisher()`, a `rclcpp::Node` method for convenience. That ends up creating an `rclcpp::Publisher` object which extends `rclcpp::PublisherBase`. The latter allocates an `rcl_publisher_t` handle, fetches the corresponding `rcl_node_t` handle, and calls `rcl_publisher_init()` in its constructor. `rcl` does topic name expansion/remapping/validation. It creates an `rmw_publisher_t` handle by calling `rmw_create_publisher()` of the given `rmw` implementation and associates with the node's `rmw_node_t` handle and the publisher's `rcl_publisher_t` handle. + +If intra-process publishing/subscription is enabled, it will be set up after creating the publisher object, through a call to `PublisherBase::setup_intra_process()`, which calls `rcl_publisher_init()`. + +```mermaid +sequenceDiagram + participant Component + participant rclcpp + participant Publisher + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Component->>rclcpp: create_publisher(topic_name, options, use_intra_process) + Note over rclcpp: (...) + rclcpp->>Publisher: Publisher(topic_name, options) + Note over Publisher: allocates rcl_publisher_t handle + Publisher->>rcl: rcl_publisher_init(out rcl_publisher_t, rcl_node_t, topic_name, options) + Note over rcl: populates rcl_publisher_t + rcl->>rmw: rmw_create_publisher(rmw_node_t, topic_name, qos_options) : rmw_publisher_t + Note over rmw: creates rmw_publisher_t handle + + rcl-->>tracetools: TP(rcl_publisher_init, rcl_node_t *, rcl_publisher_t *, rmw_publisher_t *, topic_name, depth) + + opt use_intra_process + rclcpp->>Publisher: setup_intra_process() + Publisher->>rcl: rcl_publisher_init(...) + end +``` + +#### Subscription creation + +Subscription creation is done in a very similar manner. + +The componenent calls `create_publisher()`, which ends up creating an `rclcpp::Subscription` object which extends `rclcpp::SubscriptionBase`. The latter allocates an `rcl_subscription_t` handle, fetches its `rcl_node_t` handle, and calls `rcl_subscription_init()` in its constructor. `rcl` does topic name expansion/remapping/validation. It creates an `rmw_subscription_t` handle by calling `rmw_create_subscription()` of the given `rmw` implementation and associates it with the node's `rmw_node_t` handle and the subscription's `rcl_subscription_t` handle. + +If intra-process publishing/subscription is enabled, it will be set up after creating the subscription object, through a call to `Subscription::setup_intra_process()`, which calls `rcl_subscription_init()`. This is very similar to a normal (inter-process) subscription, but it sets some flags for later. + +```mermaid +sequenceDiagram + participant Component + participant rclcpp + participant Subscription + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Component->>rclcpp: create_subscription(topic_name, callback, options, use_intra_process) + Note over rclcpp: (...) + rclcpp->>Subscription: Subscription(topic_name, callback, options) + Note over Subscription: allocates rcl_subscription_t handle + Subscription->>rcl: rcl_subscription_init(out rcl_subscription_t, rcl_node_t, topic_name, options) + Note over rcl: populates rcl_subscription_t + rcl->>rmw: rmw_create_subscription(rmw_node_t, topic_name, qos_options) : rmw_subscription_t + Note over rmw: creates rmw_subscription_t handle + + rcl-->>tracetools: TP(rcl_subscription_init, rcl_node_t *, rcl_subscription_t *, rmw_subscription_t *, topic_name, depth) + + opt use_intra_process + rclcpp->>Subscription: setup_intra_process() + Subscription->>rcl: rcl_subscription_init(...) + end + + rclcpp-->>tracetools: TP(rclcpp_subscription_callback_added, rcl_subscription_t *, &any_callback) +``` + +#### Executors + +An `rclcpp::executor::Executor` object is created for a given process. It can be a `SingleThreadedExecutor` or a `MultiThreadedExecutor`. + +Components are instanciated, usually as a `shared_ptr` through `std::make_shared()`, then added to the executor with `Executor::add_node()`. + +After all the components have been added, `Executor::spin()` is called. `SingleThreadedExecutor::spin()` simply loops forever until the process' context isn't valid anymore. It fetches the next `rclcpp::AnyExecutable` (e.g. subscription, timer, service, client), and calls `Executor::execute_any_executable()` with it. This then calls the relevant `execute*()` method (e.g. `execute_timer()`, `execute_subscription()`, `execute_intra_process_subscription()`, `execute_service()`, `execute_client()`). + +```mermaid +sequenceDiagram + participant process + participant Executor + participant tracetools + + process->>Executor: Executor() + Note over process: instanciates components + process->>Executor: add_node(component) + process->>Executor: spin() + loop until shutdown + Executor-->>tracetools: TP(?) + + Note over Executor: get_next_executable() + Note over Executor: execute_any_executable() + Note over Executor: execute_*() + end +``` + +#### Subscription callbacks + +Subscriptions are handled in the `rclcpp` layer. Callbacks are wrapped by an `rclcpp::AnySubscriptionCallback` object, which is registered when creating the `rclcpp::Subscription` object. + +In `execute_*subscription()`, the `Executor` asks the `Subscription` to allocate a message though `Subscription::create_message()`. It then calls `rcl_take*()`, which calls `rmw_take_with_info()`. If that is successful, the `Executor` then passes that on to the subscription through `rclcpp::SubscriptionBase::handle_message()`. This checks if it's the right type of subscription (i.e. inter vs. intra process), then it calls `dispatch()` on the `rclcpp::AnySubscriptionCallback` object with the message (cast to the actual type). This calls the actual `std::function` with the right signature. + +Finally, it returns the message object through `Subscription::return_message()`. + +```mermaid +sequenceDiagram + participant Executor + participant Subscription + participant AnySubscriptionCallback + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Note over Executor: execute_subscription() + Executor->>Subscription: create_message(): std::shared_ptr + Executor->>rcl: rcl_take*(rcl_subscription_t, out msg) : ret + rcl->>rmw: rmw_take_with_info(rmw_subscription_t, out msg, out taken) + Note over rmw: copies available message to msg if there is one + opt RCL_RET_OK == ret + Executor->>Subscription: handle_message(msg) + Note over Subscription: casts msg to its actual type + Subscription->>AnySubscriptionCallback: dispatch(typed_msg) + AnySubscriptionCallback-->>tracetools: TP(callback_start, this, is_intra_process) + Note over AnySubscriptionCallback: std::function(...) + AnySubscriptionCallback-->>tracetools: TP(callback_end, this) + end + Executor->>Subscription: return_message(msg) +``` + +#### Message publishing + +To publish a message, an object is first allocated and then populated by the `Component` (or equivalent). Then, the message is sent to the `Publisher` through `publish()`. This then passes that on to `rcl`, which itself passes it to `rmw`. + +TODO add inter- vs. intra-process execution flow +TODO talk about IntraProcessManager stuff? + +```mermaid +sequenceDiagram + participant Component + participant Publisher + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Note over Component: creates a msg + Component->>Publisher: publish(msg) + Note over Publisher: ... + Publisher->>rcl: rcl_publish(rcl_publisher_t, msg) + rcl->>rmw: rmw_publish(rmw_publisher_t, msg) + rmw-->>tracetools: TP(?) +``` + +#### Service creation + +Service server creation is similar to subscription creation. The `Component` calls `create_service()` which ends up creating a `rclcpp::Service`. In its constructor, it allocates a `rcl_service_t` handle, then calls `rcl_service_init()`. This processes the handle and validates the service name. It calls `rmw_create_service()` to get the corresponding `rmw_service_t` handle. + +```mermaid +sequenceDiagram + participant Component + participant rclcpp + participant Service + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Component->>rclcpp: create_service(service_name, callback) + Note over rclcpp: (...) + rclcpp->>Service: Service(rcl_node_t, service_name, callback, options) + Note over Service: allocates a rcl_service_t handle + Service->>rcl: rcl_service_init(out rcl_service_t, rcl_node_t, service_name, options) + Note over rcl: validates & processes service handle + rcl->>rmw: rmw_create_service(rmw_node_t, service_name, qos_options) : rmw_service_t + Note over rmw: creates rmw_service_t handle + rcl-->>tracetools: TP(rcl_service_init, rcl_node_t *, rcl_service_t *, rmw_service_t *, service_name) + Service-->>tracetools: TP(rclcpp_service_callback_added, rcl_service_t *, &any_callback) +``` + +#### Service callbacks + +Service callbacks are similar to subscription callbacks. In `execute_service()`, the `Executor` allocates request header and request objects. It then calls `rcl_take_request()` and passes them along with the service handle. + +`rcl` calls `rmw_take_request()`. If those are successful, then the `Executor` calls `handle_request()` on the `Service`. This casts the request to its actual type, allocates a response object, and calls `dispatch()` on its `AnyServiceCallback` object, which calls the actual `std::function` with the right signature. + +For the service response, `Service` calls `rcl_send_response()` which calls `rmw_send_response()`. + +```mermaid +sequenceDiagram + participant Executor + participant Service + participant AnyServiceCallback + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Note over Executor: execute_service() + Note over Executor: allocates request header and request + Executor->>rcl: rcl_take_request(rcl_service, out request_header, out request) : ret + rcl->>rmw: rmw_take_request(rmw_service_t, out request_header, out request, out taken) + opt RCL_RET_OK == ret + Executor->>Service: handle_request(request_header, request) + Note over Service: casts request to its actual type + Note over Service: allocates a response object + Service->>AnyServiceCallback: dispatch(request_header, typed_request, response) + AnyServiceCallback-->>tracetools: TP(callback_start, this) + Note over AnyServiceCallback: std::function(...) + AnyServiceCallback-->>tracetools: TP(callback_end, this) + Service->>rcl: rcl_send_response(rcl_service_t, request_header, response) + rcl->>rmw: rmw_send_response(rmw_service_t, request_header, response) + end +``` + +#### Client creation + +Client creation is similar to publisher creation. The `Component` calls `create_client()` which ends up creating a `rclcpp::Client`. In its constructor, it allocates a `rcl_client_t` handle, then calls `rcl_client_init()`. This validates and processes the handle. It also calls `rmw_create_client()` which creates the `rmw_client_t` handle. + +```mermaid +sequenceDiagram + participant Component + participant Node + participant Client + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Component->>Node: create_client(service_name, options) + Node->>Client: Client(service_name, options) + Note over Client: allocates a rcl_client_t handle + Client->>rcl: rcl_client_init(out rcl_client_t, rcl_node_t, service_name, options) + Note over rcl: validates and processes rcl_client_t handle + rcl->>rmw: rmw_create_client(rmw_node_t, service_name, qos_options) : rmw_client_t + Note over rmw: creates rmw_client_t handle + rcl-->>tracetools: TP(rcl_client_init, rcl_node_t *, rcl_client_t *, rmw_client_t *, service_name) +``` + +#### Client request + +A client request has multiple steps. The `Component` (or the owner of the `Client`) first creates a request object. It then calls `Client::async_send_request()` with the request. It can also provide a callback, but it's optional. The `Client` passes that on to `rcl` by calling `rcl_send_request()`. `rcl` generates a sequence number and assigns it to the request, then calls `rmw_send_request()`. Once this is done, the `Client` puts this sequence number in an internal map along with the created promise and future objects, and the callback (which might simply be empty). + +At this point, the `Client` could simply let its callback be called. It can also use the future object returned by `async_send_request()`, and call `rclcpp::spin_until_future_complete()`. This waits until the future object is ready, or until timeout, and returns. + +If this last call was successful, then the `Component` can get the result and do something with it. + +```mermaid +sequenceDiagram + participant Component + participant Executor + participant Client + participant rclcpp + participant rcl + participant rmw + participant tracetools + + Note over rmw: (implementation) + + Note over Component: creates request + Component->>Client: async_send_request(request[, callback]) : result_future + Client->>rcl: rcl_send_request(rcl_client_t, request, out sequence_number) + Note over rcl: assigns sequence_number + rcl-->>tracetools: TP?(rcl_send_request, rcl_client_t *, sequence_number) + rcl->>rmw: rmw_send_request(rmw_client_t, request, sequence_number) + Note over Client: puts sequence_number in a map with promise+callback+future + + Component->>rclcpp: spin_until_future_complete(result_future) : result_status + + Note over Executor: execute_client() + Note over Executor: creates request_header and response objects + Executor->>rcl: rcl_take_response(rcl_client_t, out request_header, out response) : ret + rcl-->>tracetools: TP?() + rcl->>rmw: rmw_take_response(rmw_client_t, out request_header, out response, out taken) + opt RCL_RET_OK == ret + Executor->>Client: handle_response(request_header, response) + Note over Client: gets sequence_number from request_header + Client-->>tracetools: TP?() + Note over Client: gets promise+callback+future from its map + Note over Client: callback(future) + end + + rclcpp->>Component: ready or timeout + opt SUCCESS == result_status + Note over Component: result_future.get() : result + Note over Component: do something with result + end +``` + +#### Timer creation + +Timer creation is similar to subscription creation. The `Component` calls `create_service()` which ends up creating a `rclcpp::WallTimer`. In its constructor, it creates a `rclcpp::Clock` object, which (for a `WallTimer`) is simply a nanosecond clock. It then allocates a `rcl_timer_t` handle, then calls `rcl_timer_init()`. This processes the handle and validates the period. + +Note that `rcl_timer_init()` can take a callback as a parameter, but right now that feature is not used anywhere (`nullptr` is given), and callbacks are instead handled in the `rclcpp` layer. + +```mermaid +sequenceDiagram + participant Component + participant Node + participant WallTimer + participant rcl + participant tracetools + + Component->>Node: create_wall_timer(period, callback) + Node->>WallTimer: WallTimer(period, callback, Context) + Note over WallTimer: creates a Clock object + Note over WallTimer: allocates a rcl_timer_t handle + WallTimer->>rcl: rcl_timer_init(out rcl_timer_t, Clock, rcl_context_t, period) + Note over rcl: validates and processes rcl_timer_t handle + rcl-->>tracetools: TP(rcl_timer_init, rcl_timer_t *, period) + WallTimer-->>tracetools: TP(rclcpp_timer_callback_added, rcl_timer_t *, &callback) +``` + +#### Timer callbacks + +Timer callbacks are similar to susbcription callbacks. In `execute_timer()`, the `Executor` calls `execute_callback()` on the `WallTimer`. The timer then calls `rcl_timer_call()` with its `rcl_timer_t` handle and checks if the callback should be called. + +If it that is the case, then the timer will call the actual `std::function`. Depending on the `std::function` that was given when creating the timer, it will either call the callback without any parameters or it will pass a reference of itself. + +```mermaid +sequenceDiagram + participant Executor + participant WallTimer + participant rcl + participant tracetools + + Note over Executor: execute_timer() + Executor->>WallTimer: execute_callback() + WallTimer->>rcl: rcl_timer_call(rcl_timer_t) : ret + Note over rcl: validates and updates timer + opt RCL_RET_TIMER_CANCELED != ret && RCL_RET_OK == ret + WallTimer-->>tracetools: TP(callback_start, &callback) + Note over WallTimer: std::function(...) + WallTimer-->>tracetools: TP(callback_end, &callback) + end +``` + +## Design & implementation notes + +### Targeted tools/dependencies + +The targeted tools or dependencies are: + +* LTTng for tracing +* pandas and Jupyter for analysis & visualization + +### Design + +The plan is to use LTTng with a ROS wrapper package like `tracetools` for ROS 1. The suggested setup is: + +* a tracing package (e.g. `tracetools`) wraps calls to LTTng +* ROS 2 is instrumented with calls to the tracing package, therefore it becomes a dependency and ships with the core stack +* by default, the tracing package's functions are empty -- they do not do anything +* if users wants to enable tracing, they need to + * install LTTng + * compile the tracing package from source, setting the right compile flag(s) + * overlay it on top of their ROS 2 installation +* use other package(s) for analysis and visualization + +## Architecture + +![](img/tracing_architecture.png) + +### Timeline + +The first goal is to statically instrument ROS 2, aiming for it to be in the ROS 2 E-turtle release (Nov 2019). + +This includes transposing the existing ROS 1 instrumentation to ROS 2, wherever applicable. This step may not include instrumenting DDS implementations, and thus may be limited to the layer(s) right before `rmw`. + +### Notes on client libraries + +ROS offer a client library (`rcl`) written in C as the base for any language-specific implementation, such as `rclcpp` and `rclpy`. + +However, `rcl` is obviously fairly basic, and still does leave a fair amount of implementation work up to the client libraries. For example, callbacks are not handled in `rcl`, and are left to the client library implementations. + +This means that some instrumentation work will have to be re-done for every client library that we want to trace. We cannot simply instrument `rcl`, nor can we only instrument the base `rmw` interface if we want to dig into that. + +This effort should first focus on `rcl` and `rclcpp` , but `rclpy` should eventually be added and supported. + +### ROS 1/2 compatibility + +We could look into making analyses work on both ROS 1 and ROS 2, through a common instrumentation interface (or other abstraction). diff --git a/doc/img/tracing_architecture.png b/doc/img/tracing_architecture.png new file mode 100644 index 0000000..5396b13 Binary files /dev/null and b/doc/img/tracing_architecture.png differ diff --git a/doc/public_post_plan.md b/doc/public_post_plan.md new file mode 100644 index 0000000..67a0c17 --- /dev/null +++ b/doc/public_post_plan.md @@ -0,0 +1,23 @@ + +Plan for a first public post/announcement. +This week. + +Simple example: +* Jupyter notebook to plot: + * subscription callback duration + * CPU cycles/callback +* and resolve/demangle symbols +* Include all files/scripts needed to get the result, but keep it simple +* To be hosted somewhere +* With some documentation/guide on how to try it out + +Post: +* Mention that we're working on this +* Talk about goals in general +* Target = next release +* Show off example +* Invite people to ask questions/discuss + +Architecture considerations (right now but also later): +* Try to use a dataframe as a ROS model abstraction, instead of first going through a layer with python data structures +* \ No newline at end of file