} shared_data_t;

shm = odp_shm_reserve(BLKNAME, sizeof(shared_data_t), ALIGNMENT, shm_flags);

=== Getting the shared memory block address
The returned odp_shm_t handle can then be used to retrieve the actual
address (in the caller's ODP thread virtual address space) of the created
shared memory block.

.getting the address of a shared memory block
[source,c]

shared_data_t *shared_data; shared_data = odp_shm_addr(shm);

The address returned by `odp_shm_addr()` is normally valid only in the calling
ODP thread space: odp_shm_t handles can be shared between ODP threads and
remain valid within any threads, whereas the address returned by
`odp_shm_addr(shm)` may differ from ODP threads to ODP threads (for the same
'shm' block), and should therefore not be shared between ODP threads.  For
instance, it would be correct to send a shm handle using IPC between two ODP
threads and let each of these thread do their own `odp_shm_addr()` to get the
block address. Directly sending the address returned by `odp_shm_addr()` from
one ODP thread to another would however possibly fail (the address may make no
sense in the receiver address space).

The address returned by `odp_shm_addr()` is nevertheless guaranteed to be
aligned according to the alignment requirements provided at block creation
time, even if the call to `odp_shm_addr()` is performed by a different ODP
thread than the one which originally called `odp_shm_reserve()`.

All shared memory blocks are contiguous in any ODP thread addressing space:
'address' to 'address'\+'size' (where 'size' is the shared memory block size,
as provided in the `odp_shm_reserve()` call) is read and writeable and
mapping the shared memory block. There is no fragmentation.

The exception to this rule is if the `odp_shm_t` is created with the
`ODP_SHM_SINGLE_VA` flag. This requests that `odp_shm_addr()` return the same
virtual address for all ODP threads in this instance. Note that there may be a
performance cost or shm size limit associated with providing this function in
some implementations.

=== Memory behavior
By default ODP threads are assumed to behave as cache coherent systems:
Any change performed on a shared memory block is guaranteed to eventually
become visible to other ODP threads sharing this memory block.
Nevertheless, there is no implicit memory barrier associated with any action
on shared memories: *When* a change performed by an ODP thread becomes visible
to another ODP thread is not known: An application using shared memory
blocks has to use some memory barrier provided by ODP to guarantee shared data
validity between ODP threads.

The virtual address at which a given memory block is mapped in different ODP
threads may differ from ODP thread to ODP thread, if ODP threads have separate
virtual spaces (for instance if ODP threads are implemented as processes).
However, the ODP_SHM_SINGLE_VA flag can be used at `odp_shm_reserve()` time
to guarantee address uniqueness in all ODP threads, regardless of their
implementation or creation time.

=== Lookup by name
As mentioned, shared memory handles can be sent from ODP threads to ODP
threads using any IPC mechanism, and then the block address retrieved.
A simpler approach to get the shared memory block handle of an already created
block is to use the `odp_shm_lookup()` API function call.
This nevertheless requires the calling ODP thread to provide the name of the
shared memory block:
`odp_shm_lookup()` will return `ODP_SHM_INVALID` if no shared memory block
with the provided name is known by ODP. When multiple blocks were reserved
using the same name, the lookup function will return the handle of any
of these blocks.

.retrieving a block handle and address from another ODP task
[source,c]

#define BLKNAME "shared_items"

odp_shm_t shm; shared_data_t *shared_data;

shm = odp_shm_lookup(BLKNAME); if (shm != ODP_SHM_INVALID) { shared_data = odp_shm_addr(shm); …​ }

=== Freeing memory
Freeing shared memory is performed using the `odp_shm_free()` API call.
`odp_shm_free()` takes one single argument, the shared memory block handle.
Any ODP thread is allowed to perform a `odp_shm_free()` on a shared memory
block (i.e. the thread performing the `odp_shm_free()` may be different
from the thread which did the `odp_shm_reserve()`). Shared memory blocks should
be freed only once, and once freed, a shared memory block should no longer
be referenced by any ODP threads.

.freeing a shared memory block
[source,c]

if (odp_shm_free(shm) != 0) { …​//handle error }

=== sharing memory with the external world
ODP provides ways of sharing memory with entities located outside
ODP instances:

Sharing a block of memory with an external (non ODP) thread is achieved
by setting the ODP_SHM_PROC flag at `odp_shm_reserve()` time.
How the memory block is retrieved on the Operating System side is
implementation and Operating System dependent.

Sharing a block of memory with an external ODP instance (running
on the same Operating System) is achieved
by setting the ODP_SHM_EXPORT flag at `odp_shm_reserve()` time.
A block of memory created with this flag in an ODP instance A, can be "mapped"
into a remote ODP instance B (on the same OS) by using the
`odp_shm_import()`, on ODP instance B:

.sharing memory between ODP instances: instance A
[source,c]

odp_shm_t shmA; shmA = odp_shm_reserve("memoryA", size, 0, ODP_SHM_EXPORT);

.sharing memory between ODP instances: instance B
[source,c]

odp_shm_t shmB; odp_instance_t odpA;

/* get ODP A instance handle by some OS method */ odpA = …​

/* get the shared memory exported by A: shmB = odp_shm_import("memoryA", odpA, "memoryB", 0, 0);

Note that the handles shmA and shmB are scoped by each ODP instance
(you can not use them outside the ODP instance they belong to).
Also note that both ODP instances have to call `odp_shm_free()` when done.

=== Memory creation flags
The last argument to odp_shm_reserve() is a set of ORed flags.
The following flags are supported:

==== ODP_SHM_PROC
When this flag is given, the allocated shared memory will become visible
outside ODP. Non ODP threads (e.g. usual linux process or linux threads)
will be able to access the memory using native (non ODP) OS calls such as
'shm_open()' and 'mmap' (for linux).
Each ODP implementation should provide a description on exactly how
this mapping should be done on that specific platform.

==== ODP_SHM_EXPORT
When this flag is given, the allocated shared memory will become visible
to other ODP instances running on the same OS.
Other ODP instances willing to see this exported memory should use the
`odp_shm_import()` ODP function.

==== ODP_SHM_SW_ONLY
This flag tells ODP that the shared memory will be used by the ODP application
software only: no HW (such as DMA, or other accelerator) will ever
try to access the memory. No other ODP call will be involved on this memory
(as ODP calls could implicitly involve HW, depending on the ODP
implementation), except for `odp_shm_lookup()` and `odp_shm_free()`.
ODP implementations may use this flag as a hint for performance optimization,
or may as well ignore this flag.

==== ODP_SHM_SINGLE_VA
This flag is used to guarantee the uniqueness of the address at which
the shared memory is mapped: without this flag, a given memory block may be
mapped at different virtual addresses (assuming the target have virtual
addresses) by different ODP threads. This means that the value returned by
`odp_shm_addr()` would be different in different threads, in this case.
Setting this flag guarantees that all ODP threads sharing this memory
block will see it at the same address (`odp_shm_addr()` would return the
same value on all ODP threads, for a given memory block, in this case)
Note that ODP implementations may have restrictions of the amount of memory
which can be allocated with this flag.

== Queues and the Scheduler
Queues are the fundamental event sequencing mechanism provided by ODP and all
ODP applications make use of them either explicitly or implicitly. Queues are
created via the 'odp_queue_create()' API that returns a handle of type
`odp_queue_t` that is used to refer to this queue in all subsequent APIs that
reference it. Queues have one of two ODP-defined _types_, PLAIN, and SCHED that
determine how they are used. PLAIN queues directly managed by the ODP
application while SCHED queues make use of the *ODP scheduler* to provide
automatic scalable dispatching and synchronization services.

.Operations on PLAIN queues
[source,c]

odp_queue_t plain_q1 = odp_queue_create("poll queue 1", ODP_QUEUE_TYPE_PLAIN, NULL); odp_queue_t plain_q2 = odp_queue_create("poll queue 2", ODP_QUEUE_TYPE_PLAIN, NULL); …​ odp_event_t ev = odp_queue_deq(plain_q1); …​do something int rc = odp_queue_enq(plain_q2, ev);

The key distinction is that dequeueing events from PLAIN queues is an
application responsibility while dequeueing events from SCHED queues is the
responsibility of the ODP scheduler.

.Operations on SCHED queues
[source,c]

odp_queue_param_t qp; odp_queue_param_init(&qp); odp_schedule_prio_t prio = …​; odp_schedule_group_t sched_group = …​; qp.sched.prio = prio; qp.sched.sync = ODP_SCHED_SYNC_[PARALLEL|ATOMIC|ORDERED]; qp.sched.group = sched_group; qp.lock_count = n; /* Only relevant for ordered queues */ odp_queue_t sched_q1 = odp_queue_create("sched queue 1", ODP_QUEUE_TYPE_SCHED, &qp);

…​thread init processing

while (1) { odp_event_t ev; odp_queue_t which_q; ev = odp_schedule(&which_q, <wait option>); …​process the event }

With scheduled queues, events are sent to a queue, and the sender chooses
a queue based on the service it needs. The sender does not need to know
which ODP thread (on which core) or hardware accelerator will process
the event, but all the events on a queue are eventually scheduled and processed.

As can be seen, SCHED queues have additional attributes that are specified at
queue create that control how the scheduler is to process events contained
on them. These include group, priority, and synchronization class.

=== Scheduler Groups
The scheduler's dispatching job is to return the next event from the highest
priority SCHED queue that the caller is eligible to receive events from.
This latter consideration is determined by the queues _scheduler group_, which
is set at queue create time, and by the caller's _scheduler group mask_ that
indicates which scheduler group(s) it belongs to. Scheduler groups are
represented by handles of type `odp_scheduler_group_t` and are created by
the *odp_scheduler_group_create()* API. A number of scheduler groups are
_predefined_ by ODP.  These include `ODP_SCHED_GROUP_ALL` (all threads),
`ODP_SCHED_GROUP_WORKER` (all worker threads), and `ODP_SCHED_GROUP_CONTROL`
(all control threads). The application is free to create additional scheduler
groups for its own purpose and threads can join or leave scheduler groups
using the *odp_scheduler_group_join()* and *odp_scheduler_group_leave()* APIs

=== Scheduler Priority
The `prio` field of the `odp_queue_param_t` specifies the queue's scheduling
priority, which is how queues within eligible scheduler groups are selected
for dispatch. Queues have a default scheduling priority of NORMAL but can be
set to HIGHEST or LOWEST according to application needs.

=== Scheduler Synchronization
In addition to its dispatching function, which provide automatic scalability to
ODP applications in many core environments, the other main function of the
scheduler is to provide event synchronization services that greatly simplify
application programming in a parallel processing environment. A queue's
SYNC mode determines how the scheduler handles the synchronization processing
of multiple events originating from the same queue.

Three types of queue scheduler synchronization area supported: Parallel,
Atomic, and Ordered.

==== Parallel Queues
SCHED queues that specify a sync mode of ODP_SCHED_SYNC_PARALLEL are
unrestricted in how events are processed.

.Parallel Queue Scheduling
image::parallel_queue.svg[align="center"]

All events held on parallel queues are eligible to be scheduled simultaneously
and any required synchronization between them is the responsibility of the
application. Events originating from parallel queues thus have the highest
throughput rate, however they also potentially involve the most work on the
part of the application. In the Figure above, four threads are calling
*odp_schedule()* to obtain events to process. The scheduler has assigned
three events from the first queue to three threads in parallel. The fourth
thread is processing a single event from the third queue. The second queue
might either be empty, of lower priority, or not in a scheduler group matching
any of the threads being serviced by the scheduler.

=== Atomic Queues
Atomic queues simplify event synchronization because only a single thread may
process event(s) from  a given atomic queue at a time. Events scheduled from
atomic queues thus can be processed lock free because the locking is being
done implicitly by the scheduler. Note that the caller may receive one or
more events from the same atomic queue if *odp_schedule_multi()* is used. In
this case these multiple events all share the same atomic scheduling context.

.Atomic Queue Scheduling
image::atomic_queue.svg[align="center"]

In this example, no matter how many events may be held in an atomic queue,
only one calling thread can receive scheduled events from it at a time. Here
two threads process events from two different atomic queues. Note that there
is no synchronization between different atomic queues, only between events
originating from the same atomic queue. The queue context associated with the
atomic queue is held until the next call to the scheduler or until the
application explicitly releases it via a call to
*odp_schedule_release_atomic()*.

Note that while atomic queues simplify programming, the serial nature of
atomic queues may impair scaling.

=== Ordered Queues
Ordered queues provide the best of both worlds by providing the inherent
scalability of parallel queues, with the easy synchronization of atomic
queues.

.Ordered Queue Scheduling
image::ordered_queue.svg[align="center"]

When scheduling events from an ordered queue, the scheduler dispatches multiple
events from the queue in parallel to different threads, however the scheduler
also ensures that the relative sequence of these events on output queues
is identical to their sequence from their originating ordered queue.

As with atomic queues, the ordering guarantees associated with ordered queues
refer to events originating from the same queue, not for those originating on
different queues. Thus in this figure three thread are processing events 5, 3,
and 4, respectively from the first ordered queue. Regardless of how these
threads complete processing, these events will appear in their original
relative order on their output queue.

==== Order Preservation
Relative order is preserved independent of whether events are being sent to
different output queues.  For example, if some events are sent to output queue
A while others are sent to output queue B then the events on these output
queues will still be in the same relative order as they were on their
originating queue.  Similarly, if the processing consumes events so that no
output is issued for some of them (_e.g.,_ as part of IP fragment reassembly
processing) then other events will still be correctly ordered with respect to
these sequence gaps. Finally, if multiple events are enqueued for a given
order (_e.g.,_ as part of packet segmentation processing for MTU
considerations), then each of these events will occupy the originator's
sequence in the target output queue(s). In this case the relative order of these
events will be in the order that the thread issued *odp_queue_enq()* calls for
them.

The ordered context associated with the dispatch of an event from an ordered
queue lasts until the next scheduler call or until explicitly released by
the thread calling *odp_schedule_release_ordered()*. This call may be used
as a performance advisory that the thread no longer requires ordering
guarantees for the current context. As a result, any subsequent enqueues
within the current scheduler context will be treated as if the thread was
operating in a parallel queue context.

==== Ordered Locking
Another powerful feature of the scheduler's handling of ordered queues is
*ordered locks*. Each ordered queue has associated with it a number of ordered
locks as specified by the _lock_count_ parameter at queue create time.

Ordered locks provide an efficient means to perform in-order sequential
processing within an ordered context. For example, supposed events with relative
order 5, 6, and 7 are executing in parallel by three different threads. An
ordered lock will enable these threads to synchronize such that they can
perform some critical section in their originating queue order. The number of
ordered locks supported for each ordered queue is implementation dependent (and
queryable via the *odp_config_max_ordered_locks_per_queue()* API). If the
implementation supports multiple ordered locks then these may be used to
protect different ordered critical sections within a given ordered context.

==== Summary: Ordered Queues
To see how these considerations fit together, consider the following code:

.Processing with Ordered Queues
[source,c]

void worker_thread() odp_init_local(); …​other initialization processing

        while (1) {
                ev = odp_schedule(&which_q, ODP_SCHED_WAIT);
                ...process events in parallel
                odp_schedule_order_lock(0);
                ...critical section processed in order
                odp_schedule_order_unlock(0);
                ...continue processing in parallel
                odp_queue_enq(dest_q, ev);
        }
}
This represents a simplified structure for a typical worker thread operating
on ordered queues. Multiple events are processed in parallel and the use of
ordered queues ensures that they will be placed on `dest_q` in the same order
as they originated.  While processing in parallel, the use of ordered locks
enables critical sections to be processed in order within the overall parallel
flow. When a thread arrives at the *odp_schedule_order_lock()* call, it waits
until the locking order for this lock for all prior events has been resolved
and then enters the critical section. The *odp_schedule_order_unlock()* call
releases the critical section and allows the next order to enter it.

=== Scheduler Capabilities and Configuration
As with other ODP components, the ODP scheduler offers a range of capabilities
and configuration options that are used by applications to control its
behavior.

The sequence of API calls used by applications that make use of the scheduler
is as follows:

.ODP API Scheduler Usage
[source,c]
-----
odp_schedule_capability()
odp_schedule_config_init()
odp_schedule_config()
odp_schedule()
-----
The `odp_schedule_capability()` API returns an `odp_schedule_capability_t`
struct that defines various limits and capabilities offered by this
implementation of the ODP scheduler:

.ODP Scheduler Capabilities
[source,c]
-----
/**
 * Scheduler capabilities
 */
typedef struct odp_schedule_capability_t {
	/** Maximum number of ordered locks per queue */
	uint32_t max_ordered_locks;

	/** Maximum number of scheduling groups */
	uint32_t max_groups;

	/** Number of scheduling priorities */
	uint32_t max_prios;

	/** Maximum number of scheduled (ODP_BLOCKING) queues of the default
	 * size. */
	uint32_t max_queues;

	/** Maximum number of events a scheduled (ODP_BLOCKING) queue can store
	 * simultaneously. The value of zero means that scheduled queues do not
	 * have a size limit, but a single queue can store all available
	 * events. */
	uint32_t max_queue_size;

	/** Maximum flow ID per queue
	 *
	 *  Valid flow ID range in flow aware mode of scheduling is from 0 to
	 *  this maximum value. So, maximum number of flows per queue is this
	 *  value plus one. A value of 0 indicates that flow aware mode is not
	 *  supported. */
	uint32_t max_flow_id;

	/** Lock-free (ODP_NONBLOCKING_LF) queues support.
	 * The specification is the same as for the blocking implementation. */
	odp_support_t lockfree_queues;

	/** Wait-free (ODP_NONBLOCKING_WF) queues support.
	 * The specification is the same as for the blocking implementation. */
	odp_support_t waitfree_queues;

} odp_schedule_capability_t;
-----
This struct indicates the various scheduling limits supported by this ODP
implementation. Of note is the `max_flow_id` capability, which indicates
whether this implementation is able to operate in _flow aware mode_.

==== Flow Aware Scheduling
A _flow_ is a sequence of events that share some application-specific meaning
and context. A good example of a flow might be a TCP connection. Various
events associated with that connection, such as packets containing
connection data, as well as associated timeout events used for transmission
control, are logically connected and meaningful to the application processing
that TCP connection.

Normally a single flow is associated with an ODP queue. That is, all events
on a given queue belong to the same flow. So the queue id is synonymous with
the flow id for those events. However, this is not without drawbacks. Queues
are relatively heavyweight objects and provide both synchronization as well as
user contexts. The number of queues supported by a given implementation
(`max_queues`) may be less than the number of flows an application needs to
be able to process.

To address these needs, ODP allows schedulers to operate in flow aware mode
in which flow id is maintained separately as part of each event. Two new
APIs:

* `odp_event_flow_id()`
* `odp_event_flow_id_set()`

are used to query and set a 32-bit flow id associated with individual events.
The assignment and interpretation of individual flow ids is under application
control.

When operating in flow aware mode, it is the combination of flow id and
queue id that is used by the scheduler in making scheduling decisions. So,
for example, an Atomic queue would normally be able to dispatch events only a
single thread at a time. When operating in flow aware mode, however, the
scheduler will provide this exclusion only when two events on the same atomic
queue have the same flow id. If they have different flow ids, then they can be
scheduled concurrently to different threads.

Note that when operating in this mode, any sharing of queue context must be
done with application-provided synchronization controls (similar to how
parallel queues behave).

==== Scheduler Configuration
After determining the scheduler's capabilities, but before starting to use
the scheduler to process events, applications must configure the scheduler
by calling `odp_schedule_config()`.

The argument to this call is the `odp_schedule_config_t` struct:

.ODP Scheduler Configuration
[source,c]
-----
/**
 * Schedule configuration
 */
typedef struct odp_schedule_config_t {
	/** Maximum number of scheduled queues to be supported.
	 *
	 * @see odp_schedule_capability_t
	 */
	uint32_t num_queues;

	/** Maximum number of events required to be stored simultaneously in
	 * scheduled queue. This number must not exceed 'max_queue_size'
	 * capability.  A value of 0 configures default queue size supported by
	 * the implementation.
	 */
	uint32_t queue_size;

	/** Maximum flow ID per queue
	 *
	 *  This value must not exceed 'max_flow_id' capability. Flow aware
	 *  mode of scheduling is enabled when the value is greater than 0.
	 *  The default value is 0.
	 *
	 *  Application can assign events to specific flows by calling
	 *  odp_event_flow_id_set() before enqueuing events into a scheduled
	 *  queue. When in flow aware mode, the event flow id value affects
	 *  scheduling of the event and synchronization is maintained per flow
	 *  within each queue.
	 *
	 *  Depeding on implementation, there may be much more flows supported
	 *  than queues, as flows are lightweight entities.
	 *
	 *  @see odp_schedule_capability_t, odp_event_flow_id()
	 */
	uint32_t max_flow_id;

} odp_schedule_config_t;
-----
The `odp_schedule_config_init()` API should be used to initialize this
struct to its default values. The application then sets whatever
overrides it needs prior to calling `odp_schedule_config()` to activate
them. Note that `NULL` may be passed as the argument to `odp_schedule_config()`
if the application simply wants to use the implementation-defined default
configuration. In the default configuration, the scheduler does not operate in
flow aware mode.

Once configured, `odp_schedule()` calls can be made to get events. It is
a programming error to attempt to use the scheduler before it has been
configured.

=== Queue Scheduling Summary

NOTE: Both ordered and parallel queues improve throughput over atomic queues
due to parallel event processing, but require that the application take
steps to ensure context data synchronization if needed. The same is true for
atomic queues when the scheduler is operating in flow aware mode.

== Packet Processing
ODP applications are designed to process packets, which are the basic unit of
data of interest in the data plane. To assist in processing packets, ODP
provides a set of APIs that enable applications to examine and manipulate
packet data and metadata. Packets are referenced by an abstract *odp_packet_t*
handle defined by each implementation.

Packet objects are normally created at ingress when they arrive at a source
*odp_pktio_t* and are received by an application either directly or (more
typically) via a scheduled receive queue. They MAY be implicitly freed when
they are transmitted to an output *odp_pktio_t* via an associated transmit
queue, or freed directly via the `odp_packet_free()` API.

Occasionally an application may originate a packet itself, either directly or
by deriving it from an existing packet, and APIs are provided to assist in
these cases as well. Application-created packets can be recycled back through
a _loopback interface_ to reparse and reclassify them, or the application can
do its own parsing as desired.

Various attributes associated with a packet, such as parse results, are
stored as metadata and APIs are provided to permit applications to examine
and/or modify this information.

=== Packet Structure and Concepts
A _packet_ consists of a sequence of octets conforming to an architected
format, such as Ethernet, that can be received and transmitted via the ODP
*pktio* abstraction. Packets have a _length_, which is the number of bytes in
the packet. Packet data in ODP is referenced via _offsets_ since these reflect
the logical contents and structure of a packet independent of how particular
ODP implementations store that data.

These concepts are shown in the following diagram:

.ODP Packet Structure
image::packet.svg[align="center"]

Packet data consists of zero or more _headers_ followed by 0 or more bytes of
_payload_, followed by zero or more _trailers_.  Shown here are various APIs
that permit applications to examine and navigate various parts of a packet and
to manipulate its structure.

To support packet manipulation, predefined _headroom_ and _tailroom_
areas are logically associated with a packet. Packets can be adjusted by
_pulling_ and _pushing_ these areas. Typical packet processing might consist
of stripping headers from a packet via `odp_pull_head()` calls as part of
receive processing and then replacing them with new headers via
`odp_push_head()` calls as the packet is being prepared for transmit. Note that
while headroom and tailroom represent reserved areas of memory, these areas
not not addressable or directly usable by ODP applications until they are
made part of the packet via associated push operations. Similarly, bytes
removed via pull operations become part of a packet's headroom or tailroom
and are again no longer accessible to the application.

=== Packet Segments and Addressing
ODP platforms use various methods and techniques to store and process packets
efficiently. These vary considerably from platform to platform, so to ensure
portability across them ODP adopts certain conventions for referencing
packets.

ODP APIs use a handle of type *odp_packet_t* to refer to packet objects.
Associated with packets are various bits of system metadata that describe the
packet. By referring to the metadata, ODP applications accelerate packet
processing by minimizing the need to examine packet data. This is because the
metadata is populated by parsing and classification functions that are coupled
to ingress processing that occur prior to a packet being presented to the
application via the ODP scheduler.

When an ODP application needs to examine the contents of a packet, it requests
addressability to it via an API call that makes the packet (or a contiguously
addressable _segment_ of it) available for coherent access by the application.
To ensure portability, ODP applications assume that the underlying
implementation stores packets in _segments_ of implementation-defined
and managed size. These represent the contiguously addressable portions of a
packet that the application may refer to via normal memory accesses. ODP
provides APIs that allow applications to operate on packet segments in an
efficient and portable manner as needed. By combining these with the metadata
provided by packets, ODP applications can operate in a fully
platform-independent manner while still achieving optimal performance across
the range of platforms that support ODP.

The use of segments for packet addressing and their relationship to metadata
is shown in this diagram:

.ODP Packet Segmentation
image::segment.svg[align="center"]

The packet metadata is set during parsing and identifies the starting offsets
of the various headers in the packet. The packet itself is physically stored
as a sequence of segments that area managed by the ODP implementation.
Segment 0 is the first segment of the packet and is where the packet's headroom
and headers typically reside. Depending on the length of the packet,
additional segments may be part of the packet and contain the remaining packet
payload and tailroom. The application need not concern itself with segments
except that when the application requires addressability to a packet it
understands that addressability is provided on a per-segment basis. So, for
example, if the application makes a call like `odp_packet_l4_ptr()` to obtain
addressability to the packet's Layer 4 header, the returned length from that
call is the number of bytes from the start of the Layer 4 header that are
contiguously addressable to the application from the returned pointer address.
This is because the following byte occupies a different segment and may be
stored elsewhere. To obtain access to those bytes, the application simply
requests addressability to that offset and it will be able to address the
packet bytes that occupy the next segment, etc. Note that the returned
length for any packet addressability call is always the lesser of the remaining
packet length or size of its containing segment.  So a mapping for segment 2
in the above figure, for example, would return a length that extends only to
the end of the packet since the remaining bytes are part of the tailroom
reserved for the packet and are not usable by the application until made
available to it by an appropriate API call.

While the push/pull APIs permit applications to perform efficient manipulation
of packets within the current segment structure, ODP also provides APIs that
permit segments to be added or removed. The `odp_packet_extend_head()` and
`odp_packet_trunc_head()` APIs permit segments to be added or removed from
the beginning of a packet, while `odp_packet_extend_tail()` and
`odp_packet_trunc_tail()` permit segments to be added or removed from the end
of a packet. Extending a packet adds one or more segments to permit packets to
grow up to implementation-defined limits. Truncating a packet removes one or
more segments to shrink the size of a packet beyond its initial or final
segment.

=== Metadata Processing
As noted, packet metadata is normally set by the parser as part of
classification that occurs during packet receive processing. It is important
to note that this metadata may be changed by the application to reflect
changes in the packet contents and/or structure as part of its processing of
the packet. While changing this metadata may effect some ODP APIs, changing
metadata is designed to _document_ application changes to the packet but
does not in itself _cause_ those changes to be made. For example, if an
application changes the Layer 3 offset by using the `odp_packet_l3_offset_set()`
API, the subsequent calls to `odp_packet_l3_ptr()` will return an address
starting from that changed offset, changing an attribute like
`odp_packet_has_udp_set()` will not, by itself, turn a non-UDP packet into
a valid UDP packet. Applications are expected to exercise appropriate care
when changing packet metadata to ensure that the resulting metadata changes
reflect the actual changed packet structure that the application has made.

=== Packet Manipulation
ODP Packet manipulation APIs can be divided into two categories: Those
that do not change a packet's segment structure, and those that potentially do
change this structure. We've already seen one example of this. The push/pull
APIs permit manipulation of packet headroom/tailroom that does not result in
changes to packet segmentation, while the corresponding extend/trunc APIs
provide the same functionality but with the potential that segments may be
added to or removed from the packet as part of the operation.

The reason for having two different types of APIs that perform similar
functions is that it is expected that on most implementations operations that
do not change packet segment structure will be more efficient than those that
do. To account for this, APIs that potentially involve a change in packet
segmentation always take an output *odp_packet_t* parameter or return
value. Applications are expected to use this new handle for the resulting
packet instead of the old (input) handle as the implementation may have
returned a new handle that now represents the transformed packet.

To enable applications that manipulate packets this way to operate most
efficiently the return codes from these APIs follow a standard convention. As
usual, return codes less than zero indicate error and result in no change to
the input packet. A return code of zero indicates success, but also indicates
that any cached addressability to the packet is still valid. Return codes
greater than zero also indicate success but with a potential change to packet
addressability. For example, if an application had previously obtained
addressability to a packet's Layer 3 header via the `odp_packet_l3_ptr()` API,
a return code of zero would mean that the application may continue to use that
pointer for access to the L3 header, while a return code greater than zero
would mean that the application should reissue that call to re-obtain
addressability as the packet segmentation may have changed and hence the old
pointer may no longer be valid.

==== Packet Copying
One of the simplest manipulations that can be done is to make a copy of all or
part of a packet. The `odp_packet_copy()` and `odp_packet_copy_part()` APIs
are used to return a new packet that contains either the entirety or a
selected part of an existing packet. Note that these operations also specify
the packet pool from which the new packet is to be drawn.

==== Packet Data Copying and Moving
ODP provides several APIs to enable portions of a packet to be copied
either to or from a memory area, another packet, or within a single packet, as
illustrated below:

.ODP Packet Data Copying and Moving Operations
image::packet-copyops.svg[align="center"]

These APIs provide bounds checking when the source or destination is an ODP
packet. This means that data must be in the offset range
`0`..`odp_packet_len()-1`. For operations involving memory areas,
the caller takes responsibility for ensuring that memory areas
referenced by `odp_packet_copy_to/from_mem()` are valid.

When manipulating data within a single packet, two similar APIs are provided:
`odp_packet_copy_data()` and `odp_packet_move_data()`. Of these, the move
operation is more general and may be used even when the source and destination
data areas overlap. The copy operation must only be used if the caller knows
that the two areas do not overlap, and may result in more efficient operation.
When dealing with overlapping memory areas, `odp_packet_move_data()` operates
as if the source area was first copied to a non-overlapping separate memory
area and then copied from that area to the destination area.

==== Adding and Removing Packet Data
The various copy/move operations discussed so far only affect the data
contained in a packet do not change its length. Data can also be added to
or removed from a packet via the `odp_packet_add_data()` and
`odp_packet_rem_data()` APIs as shown below:

.Adding Data to a Packet
image::packet-adddata.svg[align="center"]

Adding data simply creates the requested amount of "space" within the packet
at the specified offset. The length of the packet is increased by the number
of added bytes. The contents of this space upon successful completion
of the operation is unspecified. It is the application's responsibility to then
fill this space with meaningful data, _e.g.,_ via a subsequent
`odp_packet_copy_from_mem()` or `odp_packet_copy_from_pkt()` call.

.Removing Data from a Packet
image::packet-remdata.svg[align="center"]

Removing data from a packet has the opposite effect. The specified number of
bytes at the designated offset are removed from the packet and the resulting
"hole" is collapsed so that the remainder of the packet immediately follows
the removal point. The resulting packet length is decreased by the number of
removed bytes.

Note that adding or removing data from a packet may affect packet segmentation,
so the application must use the returned packet handle and abide by the
return code results of the operation.  Whether or not segmentation is
changed by these operations, the amount of available packet headroom and/or
tailroom may also be changed by these operations, so again applications should
not attempt to cache the results of prior `odp_packet_headroom()` or
`odp_packet_tailroom()` calls across these APIs.

==== Packet Splitting and Concatenation
Another type of manipulation is to split a packet into two packets as shown
below:

.Splitting a Packet
image::packet-split.svg[align="center"]

The `odp_packet_split()` API indicates the split point by specifying the
resulting desired length of the original packet.  Upon return, the original
packet ends at the specified split point and the new "tail" is returned as
its own separate packet. Note that this new packet will always be from the same
packet pool as the original packet.

The opposite operation is performed by the `odp_packet_concat()` API. This API
takes a destination and source packet as arguments and the result is that
the source packet is concatenated to the destination packet and ceases to
have any separate identity. Note that it is legal to concatenate a packet to
itself, in which case the result is a packet with double the length of the
original packet.

==== Packet Realignment
As previously discussed, packets are divided into implementation-defined
segments that normally don't concern applications since contiguous
addressability extents are returned as part of APIs such as
`odp_packet_offset()`. However, if the application has performed a lot of
manipulation or processing on a packet, this can sometimes result in segment
boundaries appearing at inconvenient locations, such as in the middle of
headers or individual fields, or for headers to become misaligned with respect
to their addresses in memory. This can make subsequent processing of the
packet inefficient.

To address these issues, ODP provides a means of realigning a packet to allow
for more efficient processing as shown below:

.Packet Realignment
image::packet-align.svg[align="center"]

Input to `odp_packet_align()` specifies the number of contiguous bytes that
are needed at a given packet offset as well as the memory alignment required
for that offset. A value of zero may be specified for either as a "don't care"
value. If these criteria are already satisfied then the call is an effective
no-op and will result in a return code of zero to tell the caller that all is
well. Otherwise, the packet will be logically "shifted" within its containing
segment(s) to achieve the requested addressability and alignment constraints,
if possible, and a return code greater than zero will result.

The requested operation may fail for a number of reasons. For example, if the
caller is requesting contiguous addressability to a portion of the packet
larger than the underlying segment size. The call may also fail if the
requested alignment is too high. Alignment limits will vary among different ODP
implementations, however ODP requires that all implementations support
requested alignments of at least 32 bytes.

=== Packet References
To support efficient multicast, retransmit, and related processing, ODP
supports two additional types of packet manipulation: static and dynamic
_references_. A reference is a lightweight mechanism for
creating aliases to packets as well as to create packets that share data bytes
with other packets to avoid unnecessary data copying.

==== Static References
The simplest type of reference is the _static reference_. A static reference is
created by the call:

[source,c]
-----
ref_pkt = odp_packet_ref_static(pkt);
-----

If the reference fails, `ODP_PACKET_INVALID` is returned and `pkt`
remains unchanged.

The effect of this call is shown below:

.Static Packet Reference
image::refstatic.svg[align="center"]

A static reference provides a simple and efficient means of creating an alias
for a packet handle that prevents the packet itself from being freed until all
references to it have been released via `odp_packet_free()` calls. This is
useful, for example, to support retransmission processing, since as part of
packet TX processing, `odp_pktout_send()` or `odp_tm_enq()` will free
the packet after it has been transmitted.

`odp_packet_ref_static()` might be used in a transmit routine wrapper
function like:

[source,c]
-----
int xmit_pkt(odp_pktout_queue_t queue, odp_packet_t pkt)
{
	odp_packet_t ref = odp_packet_ref_static(pkt);
	return ref == ODP_PACKET_INVALID ? -1 : odp_pktout_send(queue, ref, 1);
}
-----

This transmits a reference to `pkt` so that `pkt` is retained by the caller,
which means that the caller is free to retransmit it if needed at a later
time. When a higher level protocol (_e.g.,_ receipt of a TCP ACK packet)
confirms that the transmission was successful, `pkt` can then be discarded via
an `odp_packet_free()` call.

The key characteristic of a static reference is that because there are
multiple independent handles that refer to the same packet, the caller should
treat the packet as read only following the creation of a static reference
until all other references to it are freed. This is because all static
references are simply aliases of the same packet, so if multiple threads were
independently manipulating the packet this would lead to unpredictable race
conditions.

To assist in determining whether there are other references to a packet, ODP
provides the API:

[source,c]
-----
int odp_packet_has_ref(odp_packet_t pkt);
-----

that indicates whether other packets exist that share bytes with this
packet. If this routine returns 0 then the caller can be assured that it is
safe to modify it as this handle is the only reference to the packet.

==== Dynamic References
While static references are convenient and efficient, they are limited by the
need to be treated as read only. For example, consider an application that
needs to _multicast_ a packet. Here the same packet needs to be sent to two or
more different destinations. While the packet payload may be the same, each
sent copy of the packet requires its own unique header to specify the
destination that is to receive the packet.

To address this need, ODP provides _dynamic references_. These are created
by the call:

[source,c]
-----
ref_pkt = odp_packet_ref(pkt, offset);
-----

The `offset` parameter specifies the byte offset into `pkt` at which the
reference is to begin. This must be in the range
0..`odp_packet_len(pkt)`-1. As before, if the reference is unable to be
created `ODP_PACKET_INVALID` is returned and `pkt` is unchanged, otherwise the
result is as shown below:

.Dynamic Packet Reference
image::ref.svg[align="center"]

Following a successful reference creation, the bytes of `pkt` beginning at
offset `offset` are shared with the created reference. These bytes should be
treated as read only since multiple references point to them. Each reference,
however still retains its own individual headroom and metadata that is not
shared with any other reference. This allows unique headers to be created by
calling `odp_packet_push_head()` or `odp_packet_extend_head()` on either
handle. This allows multiple references to the same packet to prefix unique
headers onto common shared data it so that they can be properly multicast
using code such as:

[source,c]
-----
int pkt_fanout(odp_packet_t payload, odp_queue_t fanout_queue[], int num_queues)
{
	int i;

	for (i = 0, i < num_queues, i++)
		odp_queue_enq(fanout_queue[i], odp_packet_ref(payload, 0));
}
-----

Receiver worker threads can then operate on each reference to the packet in
parallel to prefix a unique transmit header onto it and send it out.

==== Dynamic References with Headers
The dynamic references discussed so far have one drawback in that the headers
needed to make each reference unique must be constructed individually after
the reference is created. To address this problem, ODP allows these headers
to be created in advance and then simply prefixed to a base packet as part
of reference creation:

[source,c]
-----
ref_pkt = odp_packet_ref_pkt(pkt, offset, hdr_pkt);
-----

Here rather than creating a reference with a null header, a _header packet_
is supplied that is prefixed onto the reference. The result looks like this:

.Packet Reference using a Header Packet
image::refpktsingle.svg[align="center"]

So now multicasting can be more efficient using code such as:

[source,c]
-----
int pkt_fanout_hdr(odp_packet_t payload, odp_queue_q fanout_queue[],
		   odp_packet_t hdr[], int num_queues)
{
	int i;

	for (i = 0; i < num_queues, i++)
		odp_queue_enq(fanout_queue[i],
			      odp_packet_ref_pkt(payload, 0, hdr[i]));
}
-----

Now each individual reference has its own header already prefixed to
it ready for transmission.

Note that when multiple references like this are made they can each have
their own offset. So if the following code is executed:

[source,c]
-----
ref_pkt1 = odp_packet_ref_pkt(pkt, offset1, hdr_pkt1);
ref_pkt2 = odp_packet_ref_pkt(pkt, offset2, hdr_pkt2);
-----

the result will look like:

image::refpkt1.svg[align="center"]
image::refpktmulti.svg[align="center"]
.Multiple Packet References with Different Offsets
image::refpkt2.svg[align="center"]

Here two separate header packets are prefixed onto the same shared packet, each
at their own specified offset, which may or may not be the same. The result is
three packets visible to the application:

* The original `pkt`, which can still be accessed and manipulated directly.
* The first reference, which consists of `hdr_pkt1` followed by bytes
contained in `pkt` starting at `offset1`.
* The second reference, which consists of `hdr_pkt2` followed by bytes
contained in `pkt` starting at `offset2`.

Only a single copy of the bytes in `pkt` that are common to the
references exist.

===== Data Sharing with References
Because a `pkt` is a shared object when referenced, applications must observe
certain disciplines when working with them. For best portability and
reliability, the shared data contained in any packet referred to by references
should be treated as read only once it has been successfully referenced until
it is known that all references to it have been freed.

To assist applications in working with references, ODP provides the additional
API:

[source,c]
-----
int odp_packet_has_ref(odp_packet_t pkt);
-----
The `odp_packet_has_ref()` API says whether any other packets
exist that share any bytes with this packet.

===== Compound References
Note that architecturally ODP does not limit referencing and so it is possible
that a reference may be used as a basis for creating another reference. The
result is a _compound reference_ that should still behave as any other
reference.

As noted earlier, the intent behind references is that they are lightweight
objects that can be implemented without requiring data copies. The existence
of compound references may complicate this goal for some implementations. As a
result, implementations are always free to perform partial or full copies of
packets as part of any reference creation call.

Note also that a packet may not reference itself, nor may circular reference
relationships be formed, _e.g.,_ packet A is used as a header for a reference
to packet B and B is used as a header for a reference to packet A.  Results
are undefined if such circular references are attempted.

=== Packet Parsing, Checksum Processing, and Overrides
Packet parsing is normally triggered automatically as part of packet RX
processing. However, the application can trigger parsing explicitly via the
API:
[source,c]
-----
int odp_packet_parse(odp_packet_t pkt, uint32_t offset,
		     const odp_packet_parse_param_t *param);
-----
This is typically done following packet decapsulation or other preprocessing
that would prevent RX parsing from "seeing" the relevant portion of the
packet. The `odp_packet_parse_param_t` struct that is passed to control the
depth of the desired parse, as well as whether checksum validation should be
performed as part of the parse, and if so which checksums require this
processing.

Packets containing Layer 3 (IPv4) and Layer 4 (TCP, UDP, SCTP) checksums
can have these validated (on RX) and generated (on TX) automatically.
This is normally controlled by the settings on the PktIOs that
receive/transmit them, however they can also be controlled on an
individual packet basis.

Packets have associated `odp_packet_chksum_status_t` metadata that indicates
the state any checksums contained in that packet. These can be queried via
the APIs `odp_packet_l3_chksum_status()` and `odp_packet_l4_chksum_status()`,
respectively. Checksums can either be known good, known bad, or unknown, where
unknown means that checksum validation processing has not occurred or the
attempt to validate the checksum failed.

Similarly, the `odp_packet_l3_chksum_insert()` and
`odp_packet_l4_chksum_insert()` APIs may be used to override default checksum
processing for individual packets prior to transmission. If no explicit
checksum processing is specified for a packet, then any checksum generation
is controlled by the PktIO configuration of the interface used to transmit it.

== PktIO Processing
Before packets can be manipulated they typically need to be _received_ and
after they are manipulated they need to be _transmitted_. The ODP abstraction
that captures these operations is the *Packet I/O (PktIO)*.
PktIOs are represented by handles of type *odp_pktio_t* and
represent a logical I/O interface that is mapped in an implementation-defined
manner to an underlying integrated I/O adapter or NIC.

PktIO objects are manipulated through various state transitions via
`odp_pktio_xxx()` API calls as shown below:

.ODP PktIO Finite State Machine
image::pktio_fsm.svg[align="center"]

PktIOs begin in the *Unallocated* state. From here a call `odp_pktio_open()`
is used to create an *odp_pktio_t* handle that is used in all subsequent calls
to manipulate the object. This call puts the PktIO into the *Unconfigured*
state. To become operational, a PktIO must first be
*configured* for Input, Output, or both Input and Output via the
`odp_pktin_queue_config()` and/or `odp_pktout_queue_config()` APIs, and then
*started* via the `odp_pktio_start()` to make it *Ready*.

Following the completion of I/O processing, the `odp_pktio_stop()` API returns
the PktIO to the *Configured* state. From here it may be *Reconfigured* via
additional `odp_pktin_queue_config()` and/or `odp_pktout_queue_config()` calls,
or *Closed* via the `odp_pktio_close()` API to return the PktIO to the
*Unallocated* state.

=== PktIO Allocation
PktIO objects begin life by being _opened_ via the call:
[source,c]
-----
/**
 * Open a packet IO interface
 *
 * An ODP program can open a single packet IO interface per device, attempts
 * to open an already open device will fail, returning ODP_PKTIO_INVALID with
 * errno set. Use odp_pktio_lookup() to obtain a handle to an already open
 * device. Packet IO parameters provide interface level configuration options.
 *
 * Use odp_pktio_param_init() to initialize packet IO parameters into their
 * default values. Default values are also used when 'param' pointer is NULL.
 *
 * Packet input queue configuration must be setup with
 * odp_pktin_queue_config() before odp_pktio_start() is called. When packet
 * input mode is ODP_PKTIN_MODE_DISABLED, odp_pktin_queue_config() call is
 * optional and will ignore all parameters.
 *
 * Packet output queue configuration must be setup with
 * odp_pktout_queue_config() before odp_pktio_start() is called. When packet
 * output mode is ODP_PKTOUT_MODE_DISABLED or ODP_PKTOUT_MODE_TM,
 * odp_pktout_queue_config() call is optional and will ignore all parameters.
 *
 * Packet receive and transmit on the interface is enabled with a call to
 * odp_pktio_start(). If not specified otherwise, any interface level
 * configuration must not be changed when the interface is active (between start
 * and stop calls).
 *
 * In summary, a typical pktio interface setup sequence is ...
 *   * odp_pktio_open()
 *   * odp_pktin_queue_config()
 *   * odp_pktout_queue_config()
 *   * odp_pktio_start()
 *
 * ... and tear down sequence is:
 *   * odp_pktio_stop()
 *   * odp_pktio_close()
 *
 * @param name   Packet IO device name
 * @param pool   Default pool from which to allocate storage for packets
 *               received over this interface, must be of type ODP_POOL_PACKET
 * @param param  Packet IO parameters. Uses defaults when NULL.
 *
 * @return Packet IO handle
 * @retval ODP_PKTIO_INVALID on failure
 *
 * @note The device name "loop" is a reserved name for a loopback device used
 *	 for testing purposes.
 *
 * @note Packets arriving via this interface assigned to a CoS by the
 *	 classifier are received into the pool associated with that CoS. This
 *	 will occur either because this pktio is assigned a default CoS via
 *	 the odp_pktio_default_cos_set() routine, or because a matching PMR
 *	 assigned the packet to a specific CoS. The default pool specified
 *	 here is applicable only for those packets that are not assigned to a
 *	 more specific CoS.
 *
 * @see odp_pktio_start(), odp_pktio_stop(), odp_pktio_close()
 */
odp_pktio_t odp_pktio_open(const char *name, odp_pool_t pool,
			   const odp_pktio_param_t *param);
-----
`odp_pktio_open()` takes three arguments: a *name*, which is an
implementation-defined string that identifies the logical interface to be
opened, a *pool* that identifies the ODP pool that storage for received
packets should be allocated from, and a *param* structure that specifies
I/O options to be associated with this PktIO instance.
[source,c]
-----
/**
 * Packet IO parameters
 *
 * Packet IO interface level parameters. Use odp_pktio_param_init() to
 * initialize the structure with default values.
 */
typedef struct odp_pktio_param_t {
	/** Packet input mode
	  *
	  * The default value is ODP_PKTIN_MODE_DIRECT. */
	odp_pktin_mode_t in_mode;

	/** Packet output mode
	  *
	  * The default value is ODP_PKTOUT_MODE_DIRECT. */
	odp_pktout_mode_t out_mode;

} odp_pktio_param_t;
-----
ODP defines *"loop"* as a reserved name to indicate that this PktIO represents
a loopback interface. Loopback interfaces are useful as a means of recycling
packets back for reclassification after decryption or decapsulation, as well as
for diagnostic or testing purposes. For example, when receiving IPsec traffic,
the classifier is able to recognize that the traffic is IPsec, however until
the traffic is decrypted it is unable to say what that traffic contains.
So following decryption, sending the decrypted packet back to a loopback
interface allows the classifier to take a "second look" at the packet and
properly classify the decrypted payload. Similar considerations apply to
tunneled packets that must first be decapsulated to reveal the true payload.

The *pool* specifies the default pool to
use for packet allocation if not overridden by the classifier due to a
specific or default Class-of-Service (CoS) match on the packet. The *param*
struct, in turn, specifies the input and output *modes* of the PktIO.

=== PktIO Capabilities and PktIn/PktOut Configuration
Associated with each PktIO is a set of _capabilities_ that provide information
such as the maximum number of input/output queues it supports, its configuration
options, and the operations is supports. These are aggregated into
the struct:
[source,c]
-----
/**
 * Packet IO capabilities
 */
typedef struct odp_pktio_capability_t {
	/** Maximum number of input queues */
	unsigned max_input_queues;

	/** Maximum number of output queues */
	unsigned max_output_queues;

	/** Supported pktio configuration options */
	odp_pktio_config_t config;

	/** Supported set operations
	 *
	 * A bit set to one indicates a supported operation. All other bits are
	 * set to zero. */
	odp_pktio_set_op_t set_op;

	/** @deprecated Use enable_loop inside odp_pktin_config_t */
	odp_bool_t ODP_DEPRECATE(loop_supported);
} odp_pktio_capability_t;
-----
That is returned by the `odp_pktio_capability()` API. This returns the
limits and default values for these capabilities which can in turn be set
via the `odp_pktio_config()` API, which takes as input the struct:
[source,c]
-----
/**
 * Packet IO configuration options
 *
 * Packet IO interface level configuration options. Use odp_pktio_capability()
 * to see which options are supported by the implementation.
 * Use odp_pktio_config_init() to initialize the structure with default values.
 */
typedef struct odp_pktio_config_t {
	/** Packet input configuration options bit field
	 *
	 *  Default value for all bits is zero. */
	odp_pktin_config_opt_t pktin;

	/** Packet output configuration options bit field
	 *
	 *  Default value for all bits is zero. */
	odp_pktout_config_opt_t pktout;

	/** Packet input parser configuration */
	odp_pktio_parser_config_t parser;

	/** Interface loopback mode
	 *
	 * In this mode the packets sent out through the interface is
	 * looped back to input of the same interface. Supporting loopback mode
	 * is an optional feature per interface and should be queried in the
	 * interface capability before enabling the same. */
	odp_bool_t enable_loop;

	/** Inbound IPSEC inlined with packet input
	 *
	 *  Enable/disable inline inbound IPSEC operation. When enabled packet
	 *  input directs all IPSEC packets automatically to IPSEC inbound
	 *  processing. IPSEC configuration is done through the IPSEC API.
	 *  Packets that are not (recognized as) IPSEC are processed
	 *  according to the packet input configuration.
	 *
	 *  0: Disable inbound IPSEC inline operation (default)
	 *  1: Enable inbound IPSEC inline operation
	 *
	 *  @see odp_ipsec_config(), odp_ipsec_sa_create()
	 */
	odp_bool_t inbound_ipsec;

	/** Outbound IPSEC inlined with packet output
	 *
	 *  Enable/disable inline outbound IPSEC operation. When enabled IPSEC
	 *  outbound processing can send outgoing IPSEC packets directly
	 *  to the pktio interface for output. IPSEC configuration is done
	 *  through the IPSEC API.
	 *
	 *  Outbound IPSEC inline operation cannot be combined with traffic
	 *  manager (ODP_PKTOUT_MODE_TM).
	 *
	 *  0: Disable outbound IPSEC inline operation (default)
	 *  1: Enable outbound IPSEC inline operation
	 *
	 *  @see odp_ipsec_config(), odp_ipsec_sa_create()
	 */
	odp_bool_t outbound_ipsec;

} odp_pktio_config_t;
-----
The IPsec related configurations will be discussed later in the IPsec chapter,
but for now we'll focus on the PktIn/PktOut configuration and the
parser configuration.

==== PktIn Configuration
For PktIOs that will receive packets, the `odp_pktin_config_opt_t` struct
controls RX processing to be performed on these packets as they are received:
[source,c]
-----
/**
 * Packet input configuration options bit field
 *
 * Packet input configuration options listed in a bit field structure. Packet
 * input timestamping may be enabled for all packets or at least for those that
 * belong to time synchronization protocol (PTP).
 *
 * Packet input checksum checking may be enabled or disabled. When it is
 * enabled, implementation will attempt to verify checksum correctness on
 * incoming packets and depending on drop configuration either deliver erroneous
 * packets with appropriate flags set (e.g. odp_packet_has_l3_error(),
 * odp_packet_l3_chksum_status()) or drop those. When packet dropping is
 * enabled, application will never receive a packet with the specified error
 * and may avoid to check the error flag.
 *
 * If checksum checking is enabled, IPv4 header checksum checking is always
 * done for packets that do not have IP options and L4 checksum checking
 * is done for unfragmented packets that do not have IPv4 options or IPv6
 * extension headers. In other cases checksum checking may or may not
 * be done. For example, L4 checksum of fragmented packets is typically
 * not checked.
 *
 * IPv4 checksum checking may be enabled only when parsing level is
 * ODP_PROTO_LAYER_L3 or higher. Similarly, L4 level checksum checking
 * may be enabled only with parsing level ODP_PROTO_LAYER_L4 or higher.
 *
 * Whether checksum checking was done and whether a checksum was correct
 * can be queried for each received packet with odp_packet_l3_chksum_status()
 * and odp_packet_l4_chksum_status().
 */
typedef union odp_pktin_config_opt_t {
	/** Option flags */
	struct {
		/** Timestamp all packets on packet input */
		uint64_t ts_all        : 1;

		/** Timestamp (at least) IEEE1588 / PTP packets
		  * on packet input */
		uint64_t ts_ptp        : 1;

		/** Check IPv4 header checksum on packet input */
		uint64_t ipv4_chksum   : 1;

		/** Check UDP checksum on packet input */
		uint64_t udp_chksum    : 1;

		/** Check TCP checksum on packet input */
		uint64_t tcp_chksum    : 1;

		/** Check SCTP checksum on packet input */
		uint64_t sctp_chksum   : 1;

		/** Drop packets with an IPv4 error on packet input */
		uint64_t drop_ipv4_err : 1;

		/** Drop packets with an IPv6 error on packet input */
		uint64_t drop_ipv6_err : 1;

		/** Drop packets with a UDP error on packet input */
		uint64_t drop_udp_err  : 1;

		/** Drop packets with a TCP error on packet input */
		uint64_t drop_tcp_err  : 1;

		/** Drop packets with a SCTP error on packet input */
		uint64_t drop_sctp_err : 1;

	} bit;

	/** All bits of the bit field structure
	  *
	  * This field can be used to set/clear all flags, or bitwise
	  * operations over the entire structure. */
	uint64_t all_bits;
} odp_pktin_config_opt_t;
-----
These are used to control packet timestamping as well as default packet checkum
verification processing.

==== PktIO Parsing Configuration
For RX processing, packets may also be parsed automatically as part of
receipt as controlled by the `odp_pktio_parser_config_t` struct:
[source,c]
-----
/**
 * Parser configuration
 */
typedef struct odp_pktio_parser_config_t {
	/** Protocol parsing level in packet input
	  *
	  * Application requires that protocol headers in a packet are checked
	  * up to this layer during packet input. Use ODP_PROTO_LAYER_ALL for
	  * all layers. Packet metadata for this and all preceding layers are
	  * set. In addition, offset (and pointer) to the next layer is set.
	  * Other layer/protocol specific metadata have undefined values.
	  *
	  * The default value is ODP_PROTO_LAYER_ALL. */
	odp_proto_layer_t layer;

} odp_pktio_parser_config_t;
-----
Note that parsing is automatically done whenever classification is enabled
for an RX interface (see below).

==== PktOut Configuration
For PktIOs that will transmit packets, the `odp_pktout_config_opt_t` struct
controls TX processing to be performed on these packets as they are
transmitted:
[source,c]
-----
/**
 * Packet output configuration options bit field
 *
 * Packet output configuration options listed in a bit field structure. Packet
 * output checksum insertion may be enabled or disabled (e.g. ipv4_chksum_ena):
 *
 *  0: Disable checksum insertion. Application will not request checksum
 *     insertion for any packet. This is the default value for xxx_chksum_ena
 *     bits.
 *  1: Enable checksum insertion. Application will request checksum insertion
 *     for some packets.
 *
 * When checksum insertion is enabled, application may use configuration options
 * to set the default behaviour on packet output (e.g. ipv4_chksum):
 *
 *  0: Do not insert checksum by default. This is the default value for
 *     xxx_chksum bits.
 *  1: Calculate and insert checksum by default.
 *
 * These defaults may be overridden on per packet basis using e.g.
 * odp_packet_l4_chksum_insert().
 *
 * For correct operation, packet metadata must provide valid offsets for the
 * appropriate protocols. For example, UDP checksum calculation needs both L3
 * and L4 offsets (to access IP and UDP headers). When application
 * (e.g. a switch) does not modify L3/L4 data and thus checksum does not need
 * to be updated, checksum insertion should be disabled for optimal performance.
 *
 * Packet flags (odp_packet_has_*()) are ignored for the purpose of checksum
 * insertion in packet output.
 *
 * UDP, TCP and SCTP checksum insertion must not be requested for IP fragments.
 * Use checksum override function (odp_packet_l4_chksum_insert()) to disable
 * checksumming when sending a fragment through a packet IO interface that has
 * the relevant L4 checksum insertion enabled.
 *
 * Result of checksum insertion at packet output is undefined if the protocol
 * headers required for checksum calculation are not well formed. Packet must
 * contain at least as many data bytes after L3/L4 offsets as the headers
 * indicate. Other data bytes of the packet are ignored for the checksum
 * insertion.
 */
typedef union odp_pktout_config_opt_t {
	/** Option flags for packet output */
	struct {
		/** Enable IPv4 header checksum insertion. */
		uint64_t ipv4_chksum_ena : 1;

		/** Enable UDP checksum insertion */
		uint64_t udp_chksum_ena  : 1;

		/** Enable TCP checksum insertion */
		uint64_t tcp_chksum_ena  : 1;

		/** Enable SCTP checksum insertion */
		uint64_t sctp_chksum_ena : 1;

		/** Insert IPv4 header checksum by default */
		uint64_t ipv4_chksum     : 1;

		/** Insert UDP checksum on packet by default */
		uint64_t udp_chksum      : 1;

		/** Insert TCP checksum on packet by default */
		uint64_t tcp_chksum      : 1;

		/** Insert SCTP checksum on packet by default */
		uint64_t sctp_chksum     : 1;

	} bit;

	/** All bits of the bit field structure
	  *
	  * This field can be used to set/clear all flags, or bitwise
	  * operations over the entire structure. */
	uint64_t all_bits;
} odp_pktout_config_opt_t;
-----
These are used to control default checksum generation processing for
transmitted packets.

=== PktIO Input and Output Modes
PktIO objects support four different Input and Output modes, that may be
specified independently at *open* time.

.PktIO Input Modes
* `ODP_PKTIN_MODE_DIRECT`
* `ODP_PKTIN_MODE_QUEUE`
* `ODP_OKTIN_MODE_SCHED`
* `ODP_PKTIN_MODE_DISABLED`

.PktIO Output Modes
* `ODP_PKTOUT_MODE_DIRECT`
* `ODP_PKTOUT_MODE_QUEUE`
* `ODP_PKTOUT_MODE_TM`
* `ODP_PKTOUT_MODE_DISABLED`

The DISABLED modes indicate that either input or output is prohibited on this
PktIO. Attempts to receive packets on a PktIO whose `in_mode` is DISABLED
return no packets while packets sent to a PktIO whose `out_mode` is DISABLED
are discarded.

==== Direct I/O Modes
DIRECT I/O is the default mode for PktIO objects. It is designed to support
poll-based packet processing, which is often found in legacy applications
being ported to ODP, and can also be a preferred mode for some types of
packet processing. By supporting poll-based I/O processing, ODP provides
maximum flexibility to the data plane application writer.

===== Direct RX Processing
The processing of DIRECT input is shown below:

.PktIO DIRECT Mode Receive Processing
image::pktin_direct_recv.svg[align="center"]

In DIRECT mode, received packets are stored in one or more special PktIO queues
of type *odp_pktin_queue_t* and are retrieved by threads calling the
`odp_pktin_recv()` API.

Once opened, setting up a DIRECT mode PktIO is performed by the
`odp_pktin_queue_config()` API.
[source,c]
-----
/**
 * Configure packet input queues
 *
 * Setup a number of packet input queues and configure those. The maximum number
 * of queues is platform dependent and can be queried with
 * odp_pktio_capability(). Use odp_pktin_queue_param_init() to initialize
 * parameters into their default values. Default values are also used when
 * 'param' pointer is NULL.
 *
 * Queue handles for input queues can be requested with odp_pktin_queue() or
 * odp_pktin_event_queue() after this call. All requested queues are setup on
 * success, no queues are setup on failure. Each call reconfigures input queues
 * and may invalidate all previous queue handles.
 *
 * @param pktio    Packet IO handle
 * @param param    Packet input queue configuration parameters. Uses defaults
 *                 when NULL.
 *
 * @retval 0 on success
 * @retval <0 on failure
 *
 * @see odp_pktio_capability(), odp_pktin_queue(), odp_pktin_event_queue()
 */
int odp_pktin_queue_config(odp_pktio_t pktio,
			   const odp_pktin_queue_param_t *param);
-----
The second argument to this call is the *odp_pktin_queue_param_t*
[source,c]
-----
/**
 * Packet input queue parameters
 */
typedef struct odp_pktin_queue_param_t {
	/** Operation mode
	  *
	  * The default value is ODP_PKTIO_OP_MT. Application may enable
	  * performance optimization by defining ODP_PKTIO_OP_MT_UNSAFE when
	  * applicable. */
	odp_pktio_op_mode_t op_mode;

	/** Enable classifier
	  *
	  * * 0: Classifier is disabled (default)
	  * * 1: Classifier is enabled. Use classifier to direct incoming
	  *      packets into pktin event queues. Classifier can be enabled
	  *      only in ODP_PKTIN_MODE_SCHED and ODP_PKTIN_MODE_QUEUE modes.
	  *      Both classifier and hashing cannot be enabled simultaneously
	  *      ('hash_enable' must be 0). */
	odp_bool_t classifier_enable;

	/** Enable flow hashing
	  *
	  * * 0: Do not hash flows (default)
	  * * 1: Enable flow hashing. Use flow hashing to spread incoming
	  *      packets into input queues. Hashing can be enabled in all
	  *      modes. Both classifier and hashing cannot be enabled
	  *      simultaneously ('classifier_enable' must be 0). */
	odp_bool_t hash_enable;

	/** Protocol field selection for hashing
	  *
	  * Multiple protocols can be selected. Ignored when 'hash_enable' is
	  * zero. The default value is all bits zero. */
	odp_pktin_hash_proto_t hash_proto;

	/** Number of input queues to be created
	  *
	  * When classifier is enabled in odp_pktin_queue_config() this
	  * value is ignored, otherwise at least one queue is required.
	  * More than one input queues require flow hashing configured.
	  * The maximum value is defined by pktio capability 'max_input_queues'.
	  * Queue type is defined by the input mode. The default value is 1. */
	unsigned num_queues;

	/** Queue parameters
	  *
	  * These are used for input queue creation in ODP_PKTIN_MODE_QUEUE
	  * or ODP_PKTIN_MODE_SCHED modes. Scheduler parameters are considered
	  * only in ODP_PKTIN_MODE_SCHED mode. Default values are defined in
	  * odp_queue_param_t documentation.
	  * When classifier is enabled in odp_pktin_queue_config() this
	  * value is ignored. */
	odp_queue_param_t queue_param;

} odp_pktin_queue_param_t;
-----
Note that the *queue_param* field of this struct is ignored in DIRECT mode.
The purpose of `odp_pktin_queue_config()` is to specify the number of PktIn
queues to be created and to set their attributes.

It is important to note that while `odp_pktio_queue_config()` creates a
requested number of RX queues that are associated with the PktIO and accepts
optimization advice as to how the application intends to use them, _i.e._,
whether the queues need to be safe for concurrent use by multiple threads
(OP_MT) or only one thread at a time (OP_MT_UNSAFE), these queues are *not*
associated with any specific thread. Applications use a discipline
appropriate to their design, which may involve restricting PktIn queue use
to separate threads, but that is an aspect of the application design. ODP
simply provides a set of tools here, but it is the application that determines
how those tools are used.

===== Hash Processing
Another feature of DIRECT mode input is the provision of a *hash* function  used
to distribute incoming packets among the PktIO's PktIn queues. If the
`hash_enable` field of the *odp_pktin_queue_param_t* is 1,
then the `hash_proto` field is used to specify which field(s) of incoming
packets should be used as input to an implementation-defined packet
distribution hash function.
[source,c]
-----
/**
 * Packet input hash protocols
 *
 * The list of protocol header field combinations, which are included into
 * packet input hash calculation.
 */
typedef union odp_pktin_hash_proto_t {
	/** Protocol header fields for hashing */
	struct {
		/** IPv4 addresses and UDP port numbers */
		uint32_t ipv4_udp : 1;
		/** IPv4 addresses and TCP port numbers */
		uint32_t ipv4_tcp : 1;
		/** IPv4 addresses */
		uint32_t ipv4     : 1;
		/** IPv6 addresses and UDP port numbers */
		uint32_t ipv6_udp : 1;
		/** IPv6 addresses and TCP port numbers */
		uint32_t ipv6_tcp : 1;
		/** IPv6 addresses */
		uint32_t ipv6     : 1;
	} proto;

	/** All bits of the bit field structure */
	uint32_t all_bits;
} odp_pktin_hash_proto_t;
-----
Note that the hash function used in PktIO poll mode operation is intended to
provide simple packet distribution among multiple PktIn queues associated with
the PktIO. It does not have the sophistication of the *ODP Classifier*, however
it also does not incur the setup requirements of pattern matching rules,
making it a simpler choice for less sophisticated applications. Note that
ODP does not specify how the hash is to be performed. That is left to each
implementation. The hash only specifies which input packet fields are of
interest to the application and should be considered by the hash function in
deciding how to distribute packets among PktIn queues. The only expectation
is that packets that have the same hash values should all be mapped to the
same PktIn queue.

===== PktIn Queues
A *PktIn Queue* is a special type of queue that is used internally by PktIOs
operating in DIRECT mode. Applications cannot perform enqueues to these queues,
however they may obtain references to them via the `odp_pktin_queue()` API
[source,c]
-----
/**
 * Direct packet input queues
 *
 * Returns the number of input queues configured for the interface in
 * ODP_PKTIN_MODE_DIRECT mode. Outputs up to 'num' queue handles when the
 * 'queues' array pointer is not NULL. If return value is larger than 'num',
 * there are more queues than the function was allowed to output. If return
 * value (N) is less than 'num', only queues[0 ... N-1] have been written.
 *
 * Packets from these queues are received with odp_pktin_recv().
 *
 * @param      pktio    Packet IO handle
 * @param[out] queues   Points to an array of queue handles for output
 * @param      num      Maximum number of queue handles to output
 *
 * @return Number of packet input queues
 * @retval <0 on failure
 */
int odp_pktin_queue(odp_pktio_t pktio, odp_pktin_queue_t queues[], int num);
-----
Once configured, prior to receiving packets the PktIO must be placed into the
*Ready* state via a call to `odp_pktio_start()`
[source,c]
-----
/**
 * Start packet receive and transmit
 *
 * Activate packet receive and transmit on a previously opened or stopped
 * interface. The interface can be stopped with a call to odp_pktio_stop().
 *
 * @param pktio  Packet IO handle
 *
 * @retval 0 on success
 * @retval <0 on failure
 *
 * @see odp_pktio_open(), odp_pktio_stop()
 */
int odp_pktio_start(odp_pktio_t pktio);
-----
Once started, the PktIn queue handles are used as arguments to
`odp_pktin_recv()` to receive packets from the PktIO.
[source,c]
-----
/**
 * Receive packets directly from an interface input queue
 *
 * Receives up to 'num' packets from the pktio interface input queue. Returns
 * the number of packets received.
 *
 * When input queue parameter 'op_mode' has been set to ODP_PKTIO_OP_MT_UNSAFE,
 * the operation is optimized for single thread operation per queue and the same
 * queue must not be accessed simultaneously from multiple threads.
 *
 * @param      queue      Packet input queue handle for receiving packets
 * @param[out] packets[]  Packet handle array for output of received packets
 * @param      num        Maximum number of packets to receive
 *
 * @return Number of packets received
 * @retval <0 on failure
 *
 * @see odp_pktin_queue()
 */
int odp_pktin_recv(odp_pktin_queue_t queue, odp_packet_t packets[], int num);
-----
Note that it is the caller's responsibility to ensure that PktIn queues
are used correctly. For example, it is an error for multiple threads to
attempt to perform concurrent receive processing on the same PktIn queue
if that queue has been marked MT_UNSAFE. Performance MAY be improved if
the application observes the discipline of associating each PktIn queue
with a single RX thread (in which case the PktIn queue can be marked
MT_UNSAFE), however this is up to the application to determine how best
to structure itself.

===== Direct TX Processing
A PktIO operating in DIRECT mode performs TX processing as shown here:

.PktIO DIRECT Mode Transmit Processing
image::pktout_direct_send.svg[align="center"]

Direct TX processing operates similarly to Direct RX processing. Following
open, the `odp_pktout_queue_config()` API is used to create and configure
one or more *PktOut queues* to be used to support packet transmission by
this PktIO
[source,c]
-----
/**
 * Configure packet output queues
 *
 * Setup a number of packet output queues and configure those. The maximum
 * number of queues is platform dependent and can be queried with
 * odp_pktio_capability(). Use odp_pktout_queue_param_init() to initialize
 * parameters into their default values. Default values are also used when
 * 'param' pointer is NULL.
 *
 * Queue handles for output queues can be requested with odp_pktout_queue() or
 * odp_pktout_event_queue() after this call. All requested queues are setup on
 * success, no queues are setup on failure. Each call reconfigures output queues
 * and may invalidate all previous queue handles.
 *
 * @param pktio    Packet IO handle
 * @param param    Packet output queue configuration parameters. Uses defaults
 *                 when NULL.
 *
 * @retval 0 on success
 * @retval <0 on failure
 *
 * @see odp_pktio_capability(), odp_pktout_queue(), odp_pktout_event_queue()
 */
int odp_pktout_queue_config(odp_pktio_t pktio,
			    const odp_pktout_queue_param_t *param);
-----
As with `odp_pktin_queue_config()`, the configuration of PktOut queues
involves the use of a parameter struct:
[source,c]
-----
/**
 * Packet output queue parameters
 *
 * These parameters are used in ODP_PKTOUT_MODE_DIRECT and
 * ODP_PKTOUT_MODE_QUEUE modes.
 */
typedef struct odp_pktout_queue_param_t {
	/** Operation mode
	  *
	  * The default value is ODP_PKTIO_OP_MT. Application may enable
	  * performance optimization by defining ODP_PKTIO_OP_MT_UNSAFE when
	  * applicable. */
	odp_pktio_op_mode_t op_mode;

	/** Number of output queues to be created. The value must be between
	  * 1 and interface capability. The default value is 1. */
	unsigned num_queues;

} odp_pktout_queue_param_t;
-----
As with direct input, direct output uses one or more special output queues
of type *odp_pktout_queue_t* that are created and configured by this call.

As with PktIn queues, the handles for these created PktOut queues may be
retrieved by the `odp_pktout_queue()` API:
[source,c]
-----
/**
 * Direct packet output queues
 *
 * Returns the number of output queues configured for the interface in
 * ODP_PKTOUT_MODE_DIRECT mode. Outputs up to 'num' queue handles when the
 * 'queues' array pointer is not NULL. If return value is larger than 'num',
 * there are more queues than the function was allowed to output. If return
 * value (N) is less than 'num', only queues[0 ... N-1] have been written.
 *
 * Packets are sent to these queues with odp_pktout_send().
 *
 * @param      pktio    Packet IO handle
 * @param[out] queues   Points to an array of queue handles for output
 * @param      num      Maximum number of queue handles to output
 *
 * @return Number of packet output queues
 * @retval <0 on failure
 */
int odp_pktout_queue(odp_pktio_t pktio, odp_pktout_queue_t queues[], int num);
-----
Once the PktIO has been configured for output and started via
`odp_pktio_start()`, packets may be transmitted to the PktIO by calling
`odp_pktout_send()`:
[source,c]
-----
/**
 * Send packets directly to an interface output queue
 *
 * Sends out a number of packets to the interface output queue. When
 * output queue parameter 'op_mode' has been set to ODP_PKTIO_OP_MT_UNSAFE,
 * the operation is optimized for single thread operation per queue and the same
 * queue must not be accessed simultaneously from multiple threads.
 *
 * A successful call returns the actual number of packets sent. If return value
 * is less than 'num', the remaining packets at the end of packets[] array
 * are not consumed, and the caller has to take care of them.
 *
 * Entire packet data is sent out (odp_packet_len() bytes of data, starting from
 * odp_packet_data()). All other packet metadata is ignored unless otherwise
 * specified e.g. for protocol offload purposes. Link protocol specific frame
 * checksum and padding are added to frames before transmission.
 *
 * @param queue        Packet output queue handle for sending packets
 * @param packets[]    Array of packets to send
 * @param num          Number of packets to send
 *
 * @return Number of packets sent
 * @retval <0 on failure
 */
int odp_pktout_send(odp_pktout_queue_t queue, const odp_packet_t packets[],
		    int num);;
-----
Note that the argument to this call specifies the PktOut queue that the
packet is to be added to rather than the PktIO itself. This permits multiple
threads (presumably operating on different cores) a more efficient means of
separating I/O processing destined for the same interface.

==== Queued I/O Modes
To provide additional flexibility when operating in poll mode, PktIOs may also
be opened in QUEUE Mode. The difference between DIRECT and QUEUE mode is that
QUEUE mode uses standard ODP event queues to service packets.

===== Queue RX Processing
The processing for QUEUE input processing is shown below:

.PktIO QUEUE Mode Receive Processing
image::pktin_queue_recv.svg[align="center"]

In QUEUE mode, received packets are stored in one or more standard ODP queues.
The difference is that these queues are not created directly by the
application. Instead, they are created in response to an
`odp_pktin_queue_config()` call.

As with DIRECT mode, the `odp_pktin_queue_param_t` specified to this call
indicates whether an input hash should be used and if so which field(s) of
the packet should be considered as input to the has function.

The main difference between DIRECT and QUEUE RX processing is that because
the PktIO uses standard ODP event queues, other parts of the application can
use `odp_queue_enq()` API calls to enqueue packets to these queues for
"RX" processing in addition to those originating from the PktIO interface
itself. To obtain the handles of these input queues, the
`odp_pktin_event_queue()` API is used:
[source,c]
-----
/**
 * Event queues for packet input
 *
 * Returns the number of input queues configured for the interface in
 * ODP_PKTIN_MODE_QUEUE and ODP_PKTIN_MODE_SCHED modes. Outputs up to 'num'
 * queue handles when the 'queues' array pointer is not NULL. If return value is
 * larger than 'num', there are more queues than the function was allowed to
 * output. If return value (N) is less than 'num', only queues[0 ... N-1] have
 * been written.
 *
 * Packets (and other events) from these queues are received with
 * odp_queue_deq(), odp_schedule(), etc calls.
 *
 * @param      pktio    Packet IO handle
 * @param[out] queues   Points to an array of queue handles for output
 * @param      num      Maximum number of queue handles to output
 *
 * @return Number of packet input queues
 * @retval <0 on failure
 */
int odp_pktin_event_queue(odp_pktio_t pktio, odp_queue_t queues[], int num);
-----
Similarly, threads receive packets from PktIOs operating in QUEUE mode by
making standard `odp_queue_deq()` calls to one of the event queues associated
with the PktIO.

===== Queue TX Processing
Transmit processing for PktIOs operating in QUEUE mode is shown below:

.PktIO QUEUE Mode Transmit Processing
image::pktout_queue_send.svg[align="center]

For TX processing QUEUE mode behaves similar to DIRECT mode except that
output queues are regular ODP event queues that receive packets via
`odp_queue_enq()` calls rather than special PktOut queues that use
`odp_pktout_send()`. Again, these queues are created via a call to
`odp_pktout_queue_config()` following `odp_pktio_open()`.

The main reason for selecting QUEUE mode for output is flexibility. If an
application is designed to use a _pipeline model_ where packets flow through
a series of processing stages via queues, then having the PktIO in QUEUE
mode means that the application can always use the same enq APIs to pass packets
from one stage to the next, including the final transmit output stage.

==== Scheduled I/O Modes
The final PktIO mode supported integrates RX and TX processing with the ODP
_event model_.  For RX processing this involves the use of the *Scheduler*
while for TX processing this involves the use of the *Traffic Manager*.

Scheduled RX Processing is further divided based on whether or not the
Classifier is used.

===== Scheduled RX Processing
When a PktIO is opened with `ODP_PKTIN_MODE_SCHED`, it indicates that the
input queues created by a subsequent `odp_pktin_queue_config()` call are to
be used as input to the *ODP Scheduler*.

.PktIO SCHED Mode Receive Processing
image::pktin_sched_recv.svg[align="center']

For basic use, SCHED mode simply associates the PktIO input event queues
created by `odp_pktin_queue_config()` with the scheduler. Hashing may still be
employed to distribute input packets among multiple input queues. However
instead of these being plain queues they are scheduled queues and have
associated scheduling attributes like priority, scheduler group, and
synchronization mode (parallel, atomic, ordered). SCHED mode thus provides
both packet distribution (via the optional hash) as well as scalability via
the ODP event model.

In its fullest form, PktIOs operating in SCHED mode use the *ODP Classifier*
to permit fine-grained flow separation on *Class of Service (CoS)* boundaries.

.PktIO SCHED Mode Receive Processing with Classification
image::pktin_sched_cls.svg[align="center"]

In this mode of operation, the hash function of `odp_pktin_queue_config()` is
typically not used. Instead, the event queues created by this call,
as well as any additional event queues created via separate
`odp_queue_create()` calls are associated with classes of service via
`odp_cls_cos_create()` calls. Classification is enabled for the PktIO as a
whole by assigning a _default_ CoS via the `odp_pktio_default_cos_set()`
API.

When operating in SCHED mode, applications do not call PktIn receive functions.
Instead the PktIn queues are scanned by the scheduler and, if classification
is enabled on the PktIO, inbound packets are classified and put on queues
associated with their target class of service which are themelves scheduled
to threads. Note that on platforms that support hardware classification
and/or scheduling these operations will typically be performed in parallel as
packets are arriving, so this description refers to the _logical_ sequence
of classification and scheduling, and does not imply that this is a serial
process.

===== Scheduled TX Processing
Scheduled transmit processing is performed via the *ODP Traffic Manager* and
is requested when a PktIO is opened with an `out_mode` of `ODP_PKTOUT_MODE_TM`.

For TX processing via the Traffic Manager, applications use the `odp_tm_enq()`
API:
[source,c]
-----
/** The odp_tm_enq() function is used to add packets to a given TM system.
 * Note that the System Metadata associated with the pkt needed by the TM
 * system is (a) a drop_eligible bit, (b) a two bit "pkt_color", (c) a 16-bit
 * pkt_len, and MAYBE? (d) a signed 8-bit shaper_len_adjust.
 *
 * If there is a non-zero shaper_len_adjust, then it is added to the pkt_len
 * after any non-zero shaper_len_adjust that is part of the shaper profile.
 *
 * The pkt_color bits are a result of some earlier Metering/Marking/Policing
 * processing (typically ingress based), and should not be confused with the
 * shaper_color produced from the TM shaper entities within the tm_inputs and
 * tm_nodes.
 *
 * @param[in] tm_queue  Specifies the tm_queue (and indirectly the TM system).
 * @param[in] pkt       Handle to a packet.
 * @return              Returns 0 upon success, < 0 upon failure. One of the
 *                      more common failure reasons is WRED dropage.
 */
int odp_tm_enq(odp_tm_queue_t tm_queue, odp_packet_t pkt);
-----
See the *Traffic Manager* section of this document for full information about
Traffic Manager configuration and operation.

== Timers and Timeout Events
The ODP Timer APIs offer a set of functions that permit applications to react
to the passage of time, and are designed to reflect the underlying hardware
timing features found in various platforms that support ODP implementations.

Timers are drawn from specialized pools called _timer pools_ that have their
own abstract type (`odp_timer_pool_t`). Each timer pool is a logically
independent time source with its own _resolution_ measured in nanoseconds (ns)
and a maximum number of timers that it can support. The max _resolution_ is
able to be obtained from `odp_timer_capability()`. Applications can have many
timers active at the same time and can set them to use either relative or
absolute time. Associated with each timer is a queue that is to receive events
when this timer expires. This queue is created by a separate
`odp_queue_create()` call that is passed as a parameter to `odp_timer_alloc()`.

Timeouts are specialized events of type `odp_timeout_t` that are used to
represent the expiration of timers. Timeouts are drawn from pools of type
`ODP_POOL_TIMEOUT` that are created by the standard `odp_pool_create()` API.
Timeout events are associated with timers when those timers are _set_ and are
enqueued to their timer's associated queue whenever a set timer expires. So the
effect of timer expiration is a timeout event being added to a queue and
delivered via normal ODP event scheduling.

The following diagrams show the life cycle of timers and timeout events.
Transitions in these finite state machines are marked by the event
triggering them. Events marked in green are common to both state machines,
_i.e.,_ trigger both state machines.

.ODP Timers lifecycle State Diagram
image::timer_fsm.svg[align="center"]

.ODP Timeout event lifecyle State Diagram
image::timeout_fsm.svg[align="center"]

Reminder:
On a `timer expire` event, the related timeout event is enqueued to the timer
related queue.

Timers measure time in _ticks_ rather than nanoseconds because each timer pool
may have its own time source and associated conversion ratios. It is thus more
efficient to manipulate time in these native tick values. As a result time
measured in nanoseconds must be converted between timer-pool specific tick
values via the conversion functions `odp_timer_ns_to_tick()` and
`odp_timer_tick_to_ns()` as needed.  Both of these functions take a timer pool
as an input parameter to enable the pool-specific conversion ratios to be
used.

Associated with each timer pool is a free running tick counter that can be
sampled at any time via the `odp_timer_current_tick()` API. Timers can be set
to an absolute future tick value via `odp_timer_set_abs()` or to a future tick
value relative to the current tick via `odp_timer_set_rel()`.  Implementations
may impose minimum and maximum future values supported by a given timer pool
and timer set operations will fail if the requested value is outside of the
supported range.

Before a set timer expires, it can be canceled via the `odp_timer_cancel()`
API. A successful cancel has the same effect as if the timer were never set.
An attempted cancel will fail if the timer is not set or if it has already
expired.

=== Timer Pool Management
To facilitate implementation of the ODP timer APIs, an additional timer API is
provided. During initialization, applications are expected to create the timer
pools they need and then call `odp_timer_pool_start()`. ODP implementations
may or may not fail further attempts to create timer pools after this API is
called. For best portability, applications should not attempt to create
further timer pools after calling `odp_timer_pool_start()`. Note that no such
restrictions exist on timeout pools, as these are just ordinary ODP pools.

Following start, applications may allocate, set, cancel, and free timers
from their associated timer pools. During termination processing, after all
timers allocated from a timer pool have been freed, the pool itself should be
released via a call to `odp_timer_pool_destroy()`.

=== Timeout Event Management
The purpose of ODP timers is to schedule their associated timeout events, which
are how applications actually react to the passage of time. To help with this,
several additional APIs and conventions are provided.

Timer allocation is performed via the `odp_timer_alloc()` API:
[source,c]
-----
/**
 * Allocate a timer
 *
 * Create a timer (allocating all necessary resources e.g. timeout event) from
 * the timer pool. The user_ptr is copied to timeouts and can be retrieved
 * using the odp_timeout_user_ptr() call.
 *
 * @param tpid     Timer pool identifier
 * @param queue    Destination queue for timeout notifications
 * @param user_ptr User defined pointer or NULL to be copied to timeouts
 *
 * @return Timer handle on success
 * @retval ODP_TIMER_INVALID on failure and errno set.
 */
odp_timer_t odp_timer_alloc(odp_timer_pool_t tpid,
			    odp_queue_t queue,
			    void *user_ptr);
-----
Note that in addition to the timer pool id and queue, a user pointer is
provided. This is to allow context associated with the timeout to be
communicated. Upon receiving a timeout event, the application can use
the `odp_timeout_user_ptr()` API to retrieve the user pointer associated
with the timer that triggered this event.

An worker thread receiving events that may include timeouts might be structured
as follows:
[source,c]
-----
while (1) {
	ev = odp_schedule(&from, ODP_SCHED_WAIT);

	switch (odp_event_type(ev)) {
	case ODP_EVENT_TIMEOUT:
		odp_timeout_t timeout = odp_timeout_from_event(ev);
		odp_timer_t timer = odp_timeout_timer(timeout);
		void *userptr = odp_timeout_user_ptr(timeout);
		uint64_t expiration = odp_timeout_tick(timeout);

		if (!odp_timeout_fresh(timeout)) {
			odp_timeout_free(timeout);
			continue;
		}

		...process the timeout event
		break;

	...process other event types
	}
}
-----
When a worker thread receives a timeout event via `odp_schedule()`, it needs
to determine whether the event is still relevant. A timeout event that is still
relevant is said to be _fresh_ while one that is no longer relevant is said to
be _stale_. Timeouts may be stale for any number of reasons, most of which are
known only to the application itself. However, there are a few cases where the
ODP implementation may be able to assist in this determination and for those
cases the `odp_timeout_fresh()` API is provided.

ODP defines a fresh timeout simply as one that has not been reset or
canceled since it expired. So if `odp_timeout_fresh()` returns 0 then it is
likely that the application should ignore this event, however if it returns 1
then it remains an application responsibility to handle the event appropriate
to its needs.

== Cryptographic services

ODP provides APIs to perform cryptographic operations required by
applications. ODP cryptographic APIs are session based and provide
cryptographic algorithm offload services. ODP also offers cryptographic
protocol offload services for protocols such as IPsec using a different set
of APIs. This section covers the main crypto APIs.

ODP provides APIs for following cryptographic services:

* Ciphering
* Authentication/data integrity via Keyed-Hashing (HMAC)
* Random number generation
* Crypto capability inquiries

Ciphering and authentication services are accessible via two complementary
sets of related APIs. The original ODP crypto APIs, and a newer
_packet-oriented_ set of crypto APIs that are designed to be consistent with
the protocol-aware cryptographic services offered by the IPsec API set.

=== Crypto Sessions

To apply a cryptographic operation to a packet a session must be created. All
packets processed by a session share the parameters that define the session.

ODP supports synchronous and asynchronous crypto sessions. For asynchronous
sessions, the output of crypto operation is posted in a queue defined as
the completion queue in its session parameters.

ODP crypto APIs support chained operation sessions in which hashing and
ciphering both can be achieved using a single session and operation call. The
order of cipher and hashing can be controlled by the `auth_cipher_text`
session parameter.

Other Session parameters include algorithms, keys, initialization vector
(optional), encode or decode, output queue for async mode and output packet
pool for allocation of an output packet if required.

The parameters that describe the characteristics of a crypto session are
encoded in the `odp_crypto_session_param_t` struct that is passed to the
`odp_crypto_session_create()` API. A successful call returns an
`odp_crypto_session_t` object that in turn is passed as an input parameter to
crypto operation calls.

When an application is finished with a crypto session the
`odp_crypto_session_destroy()` API is used to release the resources associated
with an `odp_crypto_session_t`.

=== Crypto operations

After session creation, a cryptographic operation can be applied to a packet
in one of two ways.

==== Parameter-based Crypto Operations
This is the original ODP support for cryptographic operations. The
`odp_crypto_operation()` API takes an input `odp_crypto_op_param_t` struct
that describes the cryptographic operation to be performed. This struct
contains the session to use as well as the input packet the operation is to be
performed on. The caller may either specify an output packet to receive the
operation results or may request that the ODP implementation allocate a new
packet to receive these results from the output pool associated with the
`odp_crypto_session_t`. If the input packet is also used as the output packet,
then an "in place" operation is requested.

When using the `odp_crypto_operation()` API. Applications may indicate a
preference for synchronous or asynchronous processing in the session's
`pref_mode` parameter.  However crypto operations may complete synchronously
even if an asynchronous preference is indicated, and applications must examine
the `posted` output parameter from `odp_crypto_operation()` to determine
whether the operation has completed or if an `ODP_EVENT_CRYPTO_COMPL`
notification is expected. In the case of an async operation, the `posted`
output parameter will be set to true.

The operation arguments specify for each packet the areas that are to be
encrypted or decrypted and authenticated. Also, there is an option of overriding
the initialization vector specified in session parameters.

An operation can be executed in in-place, out-of-place or new buffer mode.
In in-place mode output packet is same as the input packet.
In case of out-of-place mode output packet is different from input packet as
specified by the application, while in new buffer mode implementation allocates
a new output buffer from the session’s output pool.

The application can also specify a context associated with a given operation
that will be retained during async operation and can be retrieved via the
completion event.

Results of an asynchronous session will be posted as completion events to the
session’s completion queue, which can be accessed directly or via the ODP
scheduler. The completion event contains the status of the operation and the
result. The application has the responsibility to free the completion event.

Upon receipt of an `ODP_EVENT_CRYPTO_COMPL` event, the
`odp_crypto_compl_result()` API is used to retrieve the
`odp_crypto_op_result_t` associated with the event. This result struct in turn
contains:

* An indication of the success or failure of the crypto operation
* The user context associated with the event
* The output `odp_packet_t`.
* The `odp_crypto_op_status_t` for the requested cipher operation
* The `odp_crypto_op_status_t` for the requested authentication operation

==== Packet-based Crypto Operations
To simplify the original cryptographic operation request API, as well as to
be more flexible and consistent with the protocol-aware APIs introduced for
IPsec support, a newer packet-oriented set of cryptographic operation
APIs is also provided. Applications may use either API set, but going forward
it is expected that these newer APIs will be the focus of continued
development.

Instead of a single `odp_crypto_operation()` API, the packet-based form
provides two APIs: `odp_crypto_op()` is the synchronous form while
`odp_crypto_op_enq()` is the asynchronous form. To check which of these are
supported by the ODP implementation, examine the `sync_mode` and `async_mode`
fields in the `odp_crypto_capability_t` struct returned by the
`odp_crypto_capability()` API.

Both forms take an input array of packets, an optional output array of packets
to receive the results, and an array of `odp_crypto_packet_op_param_t` structs
that describe the operation to be performed on each input packet. As with the
original APIs, the output array may be the same packets to request in-place
operation, or may be specified as `ODP_PACKET_INVALID` to request that ODP
allocate output packets from the pool associated with the
`odp_crypto_session_t` being used.

The key differences between the `odp_crypto_op_param_t` used by the original
APIs and the `odp_crypto_packet_op_param_t` used by the new APIs are:

* The original API takes a single `odp_crypto_op_param_t` since it operates on
a single packet whereas the new forms take an array of
`odp_crypto_packet_op_param_t` structs, one for each input packet.

* The `odp_crypto_packet_op_param_t` does not contain any packet information
since the input and output packets are supplied as API parameters rather than
being encoded in this struct.

* The `odp_crypto_packet_op_param_t` does not contain a user context field.

In addition, the `odp_crypto_session_t` field `op_mode` is used instead of
the `pref_mode` field when the packet-oriented APIs are used. If the
`op_mode` is set to `ODP_CRYPTO_SYNC` then the synchronous form of the API
must be used and if `op_mode` is set to `ODP_CRYPTO_ASYNC` then the
asynchronous form of the API must be used. It is an error to attempt to use
a form of the API not properly matched to the mode of the crypto session.

The output of a packet-based crypto operation is an `odp_packet_t` (one for
each input packet) that is returned either synchronously or
asynchronously. Asynchronous return is in the form of `ODP_EVENT_PACKET`
events that have event subtype `ODP_EVENT_PACKET_CRYPTO`. The packet
associated with such events is obtained via the
`odp_crypto_packet_from_event()` API. The `odp_crypto_result()` API, in turn,
retrieves the `odp_crypto_packet_result_t` from this `odp_packet_t` that
contains:

* An indication of whether the crypto packet operation was successful or not
* The `odp_crypto_op_status_t` for the requested cipher operation
* The `odp_crypto_op_status_t` for the requested authentication operation

=== Random number Generation

ODP provides two APIs to generate various kinds of random data bytes. Random
data is characterized by _kind_, which specifies the "quality" of the
randomness required. ODP support three kinds of random data:

ODP_RANDOM_BASIC:: No specific requirement other than the data appear to be
uniformly distributed. Suitable for load-balancing or other non-cryptographic
use.

ODP_RANDOM_CRYPTO:: Data suitable for cryptographic use. This is a more
stringent requirement that the data pass tests for statistical randomness.

ODP_RANDOM_TRUE:: Data generated from a hardware entropy source rather than
any software generated pseudo-random data. May not be available on all
platforms.

These form a hierarchy with BASIC being the lowest kind of random and TRUE
being the highest. The main API for accessing random data is:

[source,c]
-----
int32_t odp_random_data(uint8_t buf, uint32_t len, odp_random_kind_t kind);
-----

The expectation is that lesser-quality random is easier and faster to generate
while higher-quality random may take more time. Implementations are always free
to substitute a higher kind of random than the one requested if they are able
to do so more efficiently, however calls must return a failure indicator
(rc < 0) if a higher kind of data is requested than the implementation can
provide. This is most likely the case for ODP_RANDOM_TRUE since not all
platforms have access to a true hardware random number generator.

The `odp_random_max_kind()` API returns the highest kind of random data
available on this implementation.

For testing purposes it is often desirable to generate repeatable sequences
of "random" data. To address this need ODP provides the additional API:

[source,c]
-----
int32_t odp_random_test_data(uint8_t buf, uint32_t len, uint64_t *seed);
-----

This operates the same as `odp_random_data()` except that it always returns
data of kind `ODP_RANDOM_BASIC` and an additional thread-local `seed`
parameter is provide that specifies a seed value to use in generating the
data. This value is updated on each call, so repeated calls with the same
variable will generate a sequence of random data starting from the initial
specified seed. If another sequence of calls is made starting with the same
initial seed value, then `odp_random_test_data()` will return the same
sequence of data bytes.

=== Capability inquiries

ODP provides the API `odp_crypto_capability()` to inquire the implementation’s
crypto capabilities. This interface returns a the maximum number of crypto
sessions supported as well as bitmasks for supported algorithms and hardware
backed algorithms.

== IPsec services

In addition to general cryptographic services, ODP offers offload support for
the IPsec protocol. IPsec is a general term referencing a suite of protocols
and packet formats and as such a full discussion of IPsec is beyond the scope
of this document. See https://tools.ietf.org/html/rfc4301[RFC 4301] and
related RFCs for more detail. This section assumes the reader is already
familiar with IPsec and focuses on explaining the ODP APIs that support it.

ODP provides APIs for the following IPsec services:

* General IPsec configuration
* Security Association (SA) configuration and lifecycle management
* Synchronous and Asynchronous IPsec lookaside processing
* Inline processing for full IPsec RX and/or TX offload
* Pipelining for RX traffic
* Fragmentation support for TX traffic
* IPsec event management

=== IPsec Capabilities and Configuration
As with other features, ODP provides APIs that permit applications to query
platform-specific IPsec capabilities. The `odp_ipsec_capability()` API queries
the general IPsec features available while the `odp_ipsec_cipher_capability()`
and `odp_ipsec_auth_capability()` APIs provide detail on the range of
cipher and authentication algorithms supported by IPsec on this platform.

General IPsec capabilities that are reported include:

* The IPsec operation modes supported by this implementation. Different
operation modes may be _not supported_, _supported_, or _preferred_. A
preferred form means that this mode takes advantage of hardware
acceleration features to achieve best performance.
* Whether IPsec AH processing is supported. All ODP platforms must provide
support for IPsec ESP processing, however since AH is relatively rare, it
may not be supported, or supported only via software emulation (_e.g.,_ be
non-preferred).
* Whether IPsec headers can be retained on decrypt for inbound inline
operations.
* Whether classification pipelining is supported (to be discussed below).

In addition, capabilities also inform the application of the maximum number
of destination queues and classification CoS targets supported. These
will be discussed further later.

==== IPsec Operation Modes
IPsec operates in one of three modes: Synchronous, Asynchronous, and Inline.

==== Lookaside Processing
Synchronous and Asynchronous are types of _lookaside_ processing. Which of
these forms may be used depends on the IPsec operation mode. So synchronous
APIs may only be used when operating in synchronous mode, and asynchronous
APIs may only be used when operating in asynchronous mode.

In lookaside mode, the application receives (or creates) an IPsec packet and
then uses ODP to perform one of two functions:

* To decrypt an IPsec packet into a "normal" packet
* To take a "normal" packet and encrypt it into an IPsec packet.

This process may be performed _synchronously_ with the APIs `odp_ipsec_in()`
(to decrypt) and `odp_ipsec_out()` (to encrypt). Upon return from these calls
the requested packet transformation is complete, or an error return code
indicates that it could not be performed (_e.g.,_ packet decryption failed).

Synchronous processing may be preferred if the application has a large number
of worker threads so that blocking any individual worker while IPsec processing
is performed represents a reasonable design. The alternative is to use
_asynchronous_ forms of these APIs:

* `odp_ipsec_in_enq()` for decrypt
* `odp_ipsec_out_enq()` for encrypt

These simply pass packets to IPsec for processing. When this processing is
complete, the resulting packets are sent to the completion queue associated
with the SA used by the operation, serving as IPsec completion events as
shown here:

image::ipsec-lookaside.svg[align="center"]

If the operation fails because SA lookup failed for inbound processing, then
these result packets are sent to the default queue specified as part of the
`odp_ipsec_inbound_config_t` used in the `odp_ipsec_config()` call.

Following an asynchronous IPsec call, the worker thread moves on to process
other events until the IPsec completion shows up. At that point the worker
thread sees whether the operation was successful or not and continues
processing for that packet. These events may be direct-polled with
`odp_queue_deq()` if the completion queue was created as a plain queue, or
processed via the ODP scheduler if the completion queue was created as a
scheduled queue.

==== Inline Processing
While lookaside processing offers flexibility, it still requires extra
processing steps not required by modern hardware. To avoid this overhead
ODP also offers _inline_ processing support for IPsec. In this mode the
processing of IPsec packets on the RX and TX paths is fully offloaded as
shown here:

image::ipsec-inline.svg[align="center"]

It is worth noting that, depending on the implementation and application
needs, inline processing may be enabled only for one direction (inbound or
outbound) or for both directions.

On the receive side, once configured for inline processing, arriving IPsec
packets that are recognized at the PktIO interface are decrypted automatically
before the application ever sees them. On the transmit side, the application
calls `odp_ipsec_out_inline()` and the packet is encrypted and queued for
transmission as a single operation without further application involvement.
Note that if an inbound IPsec packet is not recognized (_e.g.,_ it belongs to
an unknown SA) then it will be presented to the application as-is without
further processing. The application may then use a lookaside call to process
the packet if it is able to supply a matching SA by other means.

On the receive side, after an IPsec packet is decrypted, it may be
_pipelined_ to the ODP classifier or added to a poll queue, as the
application wishes. The advantage of classification pipelining is that inbound
IPsec traffic is automatically decrypted and classified into appropriate
flow-based queues for ease of processing.

On the transmit side, since IPsec encryption and tunneling may exceed an
output MTU, ODP also offers support for MTU configuration and automatic IPsec
TX fragmentation.

Both classification pipelining and TX fragmentation support are support
features that are indicated by `odp_ipsec_capability()`.

Note that at present inline IPsec output support sends resulting packets
directly to an output PktIO. If it's desired to send them to the ODP
Traffic Manager for shaping prior to transmission, use the lookaside APIs
to perform the IPsec encrypt and then call `odp_tm_enq()` on the resulting
packet.

=== IPsec Configuration
Prior to making use of IPsec services, the `odp_ipsec_config()` API is used to
configure IPsec processing options. This API takes a pointer to an
`odp_ipsec_config_t` struct as its argument.

The `odp_ipsec_config_t` struct specifies the inbound and outbound processing
modes (SYNC, ASYNC, or INLINE) that the application plans to use, the maximum
number of Security Associations it will use, and sets inbound and outbound
processing options.

==== IPsec Inbound Configuration
Inbound configuration options for IPsec specify the default `odp_queue_t` to
be used for processing global events like SA lookup failures, how Security
Parameter Index (SPI) lookup is to be performed, and whether the application
requires ODP to retain outer headers for decrypted IPsec packets.

Parsing options specify how "deep" decrypted packets are to be parsed
after IPsec processing by specifying the packet layers of interest to the
application (None, L2, L3, L4, or All). And which checksums should be verified
on decrypted packets.

==== IPsec Outbound Configuration
Outbound configuration options for IPsec specify checksum insertion processing
that should be performed prior to encryption.

=== IPsec Events
IPsec introduces one new event type and one new event subtype. These are:

* IPsec packet events. These are events of type `ODP_EVENT_PACKET` that have
subtype `ODP_EVENT_PACKET_IPSEC`. These are packets that carry additional
IPsec-related metadata in the form of an `odp_ipsec_packet_result_t` struct
that can be retrieved from the packet via the `odp_ipsec_result()` API.

* IPsec status notifications. These are events of type `ODP_EVENT_IPSEC_STATUS`
that indicate status events not associated with any particular IPsec
packet. Such events carry status in the form of an `odp_ipsec_status_t`
struct that is retrieved from the event via the `odp_ipsec_status()` API.

IPsec-related events are thus part of normal and exception processing when
working with IPsec.

=== Security Associations (SAs)
The fundamental "building block" for IPsec processing is the _Security
Association (SA)_. Similar to a crypto session, the SA encapsulates the keying
material and context needed to perform IPsec protocol processing for inbound
or outbound packets on a given flow, as well as additional processing options
that control how IPsec is to be used for packets processed under this
SA. Security Associations are unidirectional (RX or TX) so a flow that
requires both inbound (decrypt) and outbound (encrypt) IPsec functions will
have two SAs associated with it. SAs in ODP are represented by the
abstract type `odp_ipsec_sa_t`.

After ODP initialization, IPsec support is dormant until it is configured
by a call to `odp_ipsec_config()` as described earlier. Once configured,
SAs may be created by calling `odp_ipsec_sa_create()`.

==== SA Creation and Configuration
The `odp_ipsec_sa_create()` API takes an `odp_ipsec_sa_param_t` argument that
describes the SA to be created. Use the `odp_ipsec_sa_param_init()` API to
initialize this to its default state and then override selected fields within
the param struct as needed.

Items specified in the `odp_ipsec_sa_param_t` struct include:

* The direction of the SA (inbound or outbound).

* The IPsec protocol being used (ESP or AH).

* The IPsec protocol mode (Transport or Tunnel).

* The parameters needed for the crypto and authentication algorithms to be
used by this SA.

* Miscellaneous SA options that control behavior such as use of Extended
Sequence Numbers (ESNs), the use of UDP encapsulation, various copy
options for header fields, and whether the TTL (Hop Limit) field should be
decremented when operating in tunnel mode.

* Parameters controlling the SA lifetime.

* The Security Parameter Index (SPI) that packets will use to indicate that
they belong to this SA.

* The pipeline mode used by this SA.

* The destination `odp_queue_t` to be used for events associated with this SA.

* The user context pointer (and length) associated with this SA for
application use.

In addition, there are specific direction-specific parameters that vary
based on whether the SA is for inbound or outbound use. For inbound SAs:

* Controls for how this SA is to be looked up.

* The minimum size of the anti-replay window to be used.

* The default CoS to use when classification pipelining packets matching this
SA.

For outbound SAs:

* Tunnel parameters to use when doing outbound processing in tunnel mode.

* The fragmentation mode to be used.

* The MTU to be used to control the maximum length IP packets that outbound
IPsec operations may produce. This can be changed dynamically by the
`odp_ipsec_sa_mtu_update()` API.

As can be seen, SAs have a large degree of configurability.

==== SA Lifecycle Management
In discussing the lifecycle of an SA and the operations it supports, it is
useful to refer to the following sequence diagram for IPsec configuration, SA
management, and IPsec operations:

image:ipsec_sa_states.svg[align="center"]

After creation, IPsec services are active for this Security Association. The
specific APIs that can be used on this SA depends on the IPsec operating mode
that has been configured.

===== IPsec Lookaside Processing
If IPsec is operating in lookaside mode for the SA's direction (the
`odp_ipsec_op_mode_t` is `ODP_IPSEC_OP_MODE_SYNC` or `ODP_IPSEC_OP_MODE_ASYNC`),
then inbound or outbound lookaside operations may be performed. Asynchronous
lookaside operations are also permitted if the SA is operating in inline
mode, as described in the next section.

The synchronous forms of these APIs are:

* `odp_ipsec_in()`
* `odp_ipsec_out()`

Upon return from these calls, the return code tells the application the number
of number of input packets that were consumed by the operation. The result of
the operation is determined by calling the `odp_ipsec_result()` API for each
output packet to retrieve its associated `odp_ipsec_result_t`.

The asynchronous forms of these APIs are:

* `odp_ipsec_in_enq()`
* `odp_ipsec_out_enq()`

Here again, the return code indicates how many input packets were
processed. The success or failure is determined by inspecting the
`odp_ipsec_result_t` associated with each packet completion event. These are
presented as events of type `ODP_EVENT_PACKET` with subtype
`ODP_EVENT_PACKET_IPSEC`.

For both synchronous and asynchronous IPsec operations an input packet array
is transformed into an output packet array as specified by a controlling
parameter struct. For inbound operations, the `odp_ipsec_in_param_t` is
used to specify how SA processing is to be performed for the requested
operation. The caller may say that SA lookup processing should be performed
for each input packet, a single (specified) SA should be used for all packets,
or that each packet has a specified individual SA.

For outbound lookaside operations, a corresponding `odp_ipsec_out_param_t`
serves a similar role, but here the SA must be specified since the input
packet(s) are non-IPsec packets. Again the option is to use a single SA for
all input packets or one per input packet.

For outbound operations, an associated array of `odp_ipsec_out_opt_t` structs
is also used to control the fragmentation mode to be used as part of the
outbound processing. Options here are to not fragment, to fragment before
IPsec processing, after IPsec processing, or to only check whether IP
fragmentation is needed but not to perform it. For check processing, the `mtu`
status error bit in the `odp_ipsec_packet_result_t` is set if check processing
detects that the resulting packet will not fit into the configured MTU. Note
that the MTU associated with a given SA is set at SA creation and can be
changed at any time via the `odp_ipsec_sa_mtu_update()` API.

Once an asynchronous lookaside operation has been initiated, the worker thread
that issued the asynchronous call can handle other events while waiting for
the operation to complete. Completion of an asynchronous operation is
indicated by the worker receiving an `ODP_EVENT_PACKET` that has subtype
`ODP_EVENT_PACKET_IPSEC`. These events can be retrieved directly by polling
the completion queue associated with the SA, or (more typically) via the ODP
scheduler. Typical code for such completion processing would look as follows:

[source,c]
-----
while (1) {
	ev = odp_schedule(&queue, ODP_SCHED_WAIT);
	ev_type = odp_event_types(ev, &ev_subtype);

	switch (ev_type) {
	case ODP_EVENT_PACKET:

		switch (ev_subtype) {
		case ODP_EVENT_PACKET_IPSEC:
			pkt = odp_packet_from_event(ev);

			if (odp_unlikely(odp_ipsec_result(&result, pkt) != 0)) {
				/* Stale event, discard */
				odp_event_free(ev);
				continue;
			}

			if (odp_unlikely(result.status.all != ODP_IPSEC_OK)) {
				 if (result.status.error != ODP_IPSEC_OK) {
					 ...process error result
					 odp_event_free(ev);
					 continue;
				 } else {
					 ...process packet warnings
				 }
			}

			my_context = odp_ipsec_sa_context(result.sa);

			if (result.flag.inline_mode) {
				...process inline inbound packet
			} else {
				...process the async completion event
			}

			...
			break;

		case ...
		}
		break;

	case ODP_EVENT_IPSEC_STATUS:
		...process IPsec status event
		break;

	}
}
-----

===== IPsec Inline Processing
When IPsec is configured to operate in `ODP_IPSEC_OP_MODE_INLINE` mode,
inbound processing is implicit. The application never sees these packets until
after IPsec has already decrypted them. As shown in the code sketch above,
such packets appear as events of subtype `ODP_EVENT_PACKET_IPSEC` and the
`flag` field in the associated `odp_ipsec_packet_result_t` indicates
`inline_mode`.

For outbound IPsec processing, the `odp_ipsec_out_inline()` API operates as
a "fire and forget" API. A success return code from this call indicates that
the packet will be encrypted and transmitted to the `odp_pktio_t` indicated
in the `odp_ipsec_out_inline_param_t` specified at the time of the call without
any further application involvement. Only if a problem arises will the packet
be returned to the application with an `odp_ipsec_packet_result_t` indicating
the nature of the problem.

Note that while operating in inline mode, asynchronous lookaside operations are
also permitted. This provide the application with additional flexibility if,
for example, some packets need additional handling that cannot be supported
directly with inline IPsec processing.

==== SA Lifetimes
A fundamental principle of good security is that the keying material
associated with sessions has a limited lifetime. In effect, keys grow "stale"
over time or due to being used to encrypt too much data. The metrics used
to limit effective SA lifetimes are:

* Duration (time)
* Usage (volume of traffic using the keys)

Associated with each of these metrics are "soft" and "hard" limits. When a
hard limit is reached, the SA is expired and cannot be used further. To support
graceful transition to a replacement SA, soft limits are used. A soft limit is
similar to a "low fuel" warning light on a car. It alerts the application that
the SA is nearing the end of its useful life and should be renegotiated even
as the SA continues to work normally.

ODP support for SA limits is based on packet/byte counts. Applications that
wish to use time-based SA limits may do so on their own using the timing
facilities that ODP provides. However, since especially with inline IPsec
processing, the application may not have explicit knowledge of the traffic
volumes associated with a given SA, support for usage-based limits is
integrated into ODP IPsec support.

At `odp_ipsec_sa_create()` time, one of the fields in the
`odp_ipsec_sa_param_t` struct is the `odp_ipsec_lifetime_t` sub-structure.
This struct allows hard and/or soft limits to be specified in terms of total
bytes encrypted/decrypted, total packet count, or both. A limit specification
of 0 indicates no limit for that metric. If either is specified, the limit
is triggered on whichever occurs first. Given the defined behavior of hard vs.
soft limits, the soft limits, if used, should always be specified as lower
than the hard limits. These should be sufficiently lower to enable adequate
time to switch over to a replacement SA before the hard limit is reached.

As noted, when an SA hard limit is reached the SA immediately enters the
expired state and attempts to use it further are failed with an
`odp_ipsec_result_t` that indicates a hard expiration limit. When a soft
limit is reached for packets sent via `odp_ipsec_out_inline()`, this results
in an `ODP_EVENT_IPSEC_STATUS` event being sent to the application on the
queue associated with the SA that has reached the soft limit. This status
event has an `odp_ipsec_status_id_t` of `ODP_IPSEC_STATUS_WARN` with a
`odp_ipsec_warn_t` bits set to indicate the type of soft expiration reached.
Receipt of this event alerts the application that the SA is nearing the end of
its useful life and that it should be replaced. It is the application's
responsibility to heed this warning. It is implementation-defined how many
such warnings are issued when a soft limit is exceeded (once, first N packets,
or all packets beyond the limit), so applications should be written to
allow for possible repeated warnings.

When operating in lookaside mode, expiration limits are carried as a warning
in the `odp_op_status_t` section of the `odp_ipsec_result_t` struct. The same
is true for inline inbound packets. When the soft limit is reached, these
packets will carry a warning flag indicating this condition.

==== SA Disablement and Destruction
When it is time to retire an SA, the application does so by first issuing a
call to the `odp_ipsec_sa_disable()` API. This call initiates termination
processing for an SA by stopping use of the SA for new operations while still
allowing those that are "in flight" to complete processing. Following this call
the application continues to receive and process IPsec events as normal.

Disable completion is indicated by the application seeing an event of type
`ODP_EVENT_IPSEC_STATUS` for this SA that contains an `odp_ipsec_status_id_t`
of `ODP_IPSEC_STATUS_SA_DISABLE`. For inbound SAs, receipt of this event means
that the application has seen all IPsec packets associated with this SA that
were pending at the time of the disable call. For outbound SAs, receipt of
this event means that the application has seen all result events associated
with packets sent via this SA.

Note that once a packet has been "seen" by the application, it becomes the
application's responsibility to ensure that it is fully processed before
attempting to destroy its associated SA. The disable call exists to give
the application assurance that there are no pending IPsec events for this
SA associated with packets that it has not seen before.

So after disabling the SA, the application can process pending packets
normally until it sees the disable status event. At that point it knows that
all pending packets that arrived before the disable have been seen and it is
safe for the application to destroy it via `odp_ipsec_sa_destroy()`, thus
completing the SA lifecycle.

Unresolved directive in users-guide.adoc - include::users-guide-comp.adoc[]

== Traffic Manager \(TM)

The TM subsystem is a general packet scheduling system that accepts
packets from input queues and applies strict priority scheduling, weighted fair
queueing scheduling and/or bandwidth controls to decide which input packet
should be chosen as the next output packet and when this output packet can be
sent onwards.

A given platform supporting this TM API could support one or more pure hardware
based packet scheduling systems, one or more pure software based systems or one
or more hybrid systems - where because of hardware constraints some of the
packet scheduling is done in hardware and some is done in software.  In
addition, there may also be additional APIs beyond those described here for:

- controlling advanced capabilities supported by specific hardware, software
or hybrid subsystems
- dealing with constraints and limitations of
specific implementations.

The intention here is to be the simplest API that covers the vast majority of
packet scheduling requirements.

Often a TM subsystem's output(s) will be directly connected to a device's
physical (or virtual) output interfaces/links, in which case sometimes such a
system will be called an Egress Packet Scheduler or an Output Link Shaper,
etc..  While the TM subsystems configured by this API can be used in such a
way, this API equally well supports the ability to have the TM subsystem's
outputs connect to other TM subsystem input queues or general software queues
or even some combination of these three cases.

=== TM Algorithms

The packet scheduling/dropping techniques that can be applied to input
traffic include any mixture of the following:

- Strict Priority scheduling.
- Weighted Fair Queueing scheduling (WFQ).
- Bandwidth Shaping.
- Weighted Random Early Discard (WRED).

Note that Bandwidth Shaping is the only feature that can cause packets to be
"delayed", and Weighted Random Early Discard is the only feature (other than
input queues becoming full) that can cause packets to be dropped.

==== Strict Priority Scheduling

Strict Priority Scheduling (or just priority for short), is a technique where input
queues and the packets from them, are assigned a priority value in the range 0
.. ODP_TM_MAX_PRIORITIES - 1.  At all times packets with the smallest priority
value will be chosen ahead of packets with a numerically larger priority value.
This is called strict priority scheduling because the algorithm strictly
enforces the scheduling of higher priority packets over lower priority
packets.

==== Bandwidth Shaping

Bandwidth Shaping (or often just Shaping) is the term used here for the idea of
controlling packet rates using single rate and/or dual rate token bucket
algorithms.  For single rate shaping a rate (the commit rate) and a "burst
size" (the maximum commit count) are configured.  Then an internal signed
integer counter called the _commitCnt_ is maintained such that if the _commitCnt_
is positive then packets are eligible to be sent. When such a packet is
actually sent then its _commitCnt_ is decremented (usually by its length, but one
could decrement by 1 for each packet instead).  The _commitCnt_ is then
incremented periodically based upon the configured rate, so that this technique
causes the traffic to be limited to the commit rate over the long term, while
allowing some ability to exceed this rate for a very short time (based on the
burst size) in order to catch up if the traffic input temporarily drops below
the commit rate.

Dual Rate Shaping is designed to allow  certain traffic flows to fairly send
more than their assigned commit rate when the  scheduler has excess capacity.
The idea being that it may be better to allow some types of traffic to send
more than their committed bandwidth rather than letting the TM outputs be idle.
The configuration of Dual Rate Shaping requires additionally a peak rate and a
peak burst size.  The peak rate must be greater than the related commit
rate, but the burst sizes have no similar constraint.  Also for every input
priority that has Dual Rate shaping enabled, there needs to be an additional
equal or lower priority (equal or higher numeric priority value) assigned.
Then if the traffic exceeds its commit rate but not its peak rate, the
"excess" traffic will be sent at the lower priority level - which by the
strict priority algorithm should cause no degradation of the higher priority
traffic, while allowing for less idle outputs.

==== Weighted Fair Queuing

Weighted Fair Queuing (WFQ) is used to arbitrate among multiple input
packets with the same priority.  Each input can be assigned a weight in the
range MIN_WFQ_WEIGHT..MAX_WFQ_WEIGHT (nominally 1..255) that affects the way
the algorithm chooses the next packet.  If all of the weights are equal AND all
of the input packets are the same length then the algorithm is equivalent to a
round robin scheduling.  If all of the weights are equal but the packets have
different lengths then the WFQ algorithm will attempt to choose the packet such
that inputs each get a fair share of the bandwidth - in other words it
implements a weighted round robin algorithm where the weighting is based on
frame length.

When the input weights are not all equal and the input packet lengths vary then
the WFQ algorithm will schedule packets such that the packet with the lowest
"Virtual Finish Time" is chosen first.  An input packet's Virtual Finish Time
is roughly calculated based on the WFQ object's base Virtual Finish Time when
the packet becomes the first packet in its queue plus its frame length divided
by its weight.

virtualFinishTime = wfqVirtualTimeBase + (pktLength / wfqWeight)

In a system running at full capacity with no bandwidth limits - over the long
term - each input fan-in's average transmit rate will be the same fraction of
the output bandwidth as the fraction of its weight divided by the sum of all of
the WFQ fan-in weights.  Hence larger WFQ weights result in better "service"
for a given fan-in.

[source,c]

totalWfqWeight = 0; for (each fan-in entity - fanIn - feeding this WFQ scheduler) totalWfqWeight += fanIn→sfqWeight;

fanIn→avgTransmitRate = avgOutputRatefanIn→sfqWeight / totalWfqWeight;

==== Weighted Random Early Discard

The Weighted Random Early Discard (WRED) algorithm deals with the situation
where an input packet rate exceeds some output rate (including the case where
Bandwidth Shaping limits some output rates).  Without WRED enabled and
configured, the TM system will just implement a tail dropping scheme whereby
whichever packet is unlucky enough to arrive when an TM input queue is full
will be discarded regardless of priority or any other consideration. WRED
allows one to configure the system to use a better/fairer algorithm than simple
tail dropping.  It works by measuring the "fullness" of various packet queues
and converting this percentage into a probability of random packet dropping
with the help of some configurable parameters. Then a random number is picked
and together with the drop probability, a decision is made to accept the packet
or drop it. A basic parameterization of WRED requires three parameters:

- the maximum queue level (which could be either a maximum number of
     packets or a maximum amount of memory (i.e. bytes/buffers) used),
- a starting threshold - which is a number in the range 0..100
     representing a percentage of the maximum queue level at which the
     drop probability becomes non-zero,
- a drop probability - which is a number in the range 0..100
     representing a probability (0 means no drop and 100 means
     certain drop) - which is used when the queue is near 100% full.

Note that all packet drops for a TM system only occur when a new packet arrives
at a given TM system input queue.  At that time either the WRED algorithm, if
enabled for this input queue, or the "input queue full" tail drop algorithm
will make a drop/no drop decision.  After this point, any packets not dropped,
will at some point be sent out a TM output - assuming that the topology is
fully connected and enabled.

=== Hierarchical Scheduling and tm_nodes

This API supports the ability to do Hierarchical Scheduling whereby the
final scheduling decision is controlled by equal priority schedulers,
strict priority multiplexers, bandwidth shapers - at multiple levels - all
forming a tree rooted at a single egress object.  In other words, all
tm_queues and tm_nodes have the property that their logical "output" feeds
into one fan-in of a subsequent tm_node or egress object - forming a proper
tree.

.Hierarchical Scheduling
image::tm_hierarchy.svg[align="center"]

Multi-level/hierarchical scheduling adds both great control and significant
complexity.  Logically, despite the implication of the tm_node tree diagrams,
there are no queues between the levels of hierarchy.  Instead all packets are
held in their input queue, until such time that the totality of all of the
tm_nodes in the single path from input queue to output object agrees that this
packet should be the next to be chosen to leave the TM system through the
output object "portal".  Hence what flows from level to level is the "local
choice" of what packet/tm_queue should next be serviced.

==== tm_nodes

Tm_nodes are the main "entity"/object that a TM system is composed of. Each
tm_node is a mini-TM subsystem of its own, but the interconnection and
interplay of a multi-level "tree" of tm_nodes can allow the user to specify
some very sophisticated behaviors. Each tm_node can contain a set of scheduler
(one per strict priority level), a strict priority multiplexer, a bandwidth
shaper and a WRED component - or a subset of these.

.Traffic Manager Node
image::tm_node.svg[align="center"]

In its full generality an tm_node consists of a set of "fan-in" connections to
preceding tm_queues or tm_nodes.  The fan-in for a single tm_node can range
from 1 to many many thousands.  This fan-in is divided first into a WFQ
scheduler per priority level. So if 4 priority levels are implemented by this
tm_node, there would be 4 WFQ schedulers - each with its own unique fan-in.
After the WFQ schedulers a priority chooser comes next - where it will always
choose the highest priority WFQ output available.  The output of the priority
chooser then feeds a bandwidth shaper function which then finally uses the
shaper's propagation table to determine its output packet and its priority.
This output could then be remapped via a priority map profile and then becomes
one of the input fan-in to perhaps another level of tm_nodes, and so on.

During this process it is important to remember that the bandwidth shaping
function never causes packets to be dropped.  Instead all packet drops occur
because of tm_queue fullness or be running the WRED algorithm at the time a new
packet attempts to be appended to the end of some input queue.

The WRED profile associated with an tm_node considers the entire set of
tm_queues feeding directly or indirectly into it as its measure of queue
fullness.

==== tm_queues

tm_queues are the second major type of "entity"/object that a TM system is
composed of.  All packets MUST first enter the TM system via some tm_queue.
Then logically, the head packets of all of the tm_queues are examined
simultaneously by the entire TM system, and ONE tm_queue is chosen send its
head packet out of the TM system's egress.  Abstractly packets stay in the
tm_queue until they are chosen at which time they are instantly transferred
from tm_queue to/through the corresponding TM egress. It is also important to
note that packets in the same tm_queue MUST always stay in order.  In other
words, the second packet in an tm_queue must never leave the TM system through
a TM egress spigot before the first packet has left the system.  So tm_queue
packet order must always be maintained.

==== TM egress

Note that TM egress objects are NOT referred to as queues, because in many/most
cases they don't have multi-packet structure but instead are viewed as a
port/spigot through which the TM system schedules and finally transfers input
packets through.

=== Ideal versus Actual Behavior

It is important to recognize the difference between the "abstract" mathematical
model of the prescribed behavior and real implementations. The model describes
the Ideal, but theoretically desired behavior, but such an Ideal is generally
not practical to implement.  Instead, one understands that virtually all Real
TM systems attempt to approximate the Ideal behavior as given by the TM
configuration as best as they can - while still attaining high packet
processing performance.  The idea is that instead of trying too hard to be
"perfect" at the granularity of say microseconds, it may be better to instead
try to match the long term Ideal behavior over a much more reasonable period of
time like a millisecond.  It is generally better to have a stable
implementation that when averaged over a period of several milliseconds matches
the Ideal behavior very closely than to have an implementation that is perhaps
more accurate over a period of microseconds, but whose millisecond averaged
behavior drifts away from the Ideal case.

=== Other TM Concepts

==== Profiles

This specification often packages related TM system parameters into
records/objects called profiles.  These profiles can then be associated with
various entities like tm_nodes and tm_queue's.  This way the amount of storage
associated with setting related parameters can be reduced and in addition it is
common to re-use the same set of parameter set over and over again, and also to
be able to change the parameter set once and have it affect lots of entities
with which it is associated with/applied to.

==== Absolute Limits versus  `odp_tm_capability_t`

This header file defines some constants representing the absolute maximum
settings for any TM system, though in most cases a TM system can (and should)
be created/instantiated with smaller values, since lower values will often
result in faster operation and/or less memory used.

==== Packet Marking

The Packet Marking API is used to mark the packet based upon the final packet
color assigned to it when it reaches the egress node.
This is an optional feature and if available on the platform is used to reflect
the packet color on IPv4/IPv6 DiffServ filed in accordance with https://www.ietf.org/rfc/rfc2474.txt[RFC 2474].
There are three different packet marking fields supported they are,
1). Assured forwarding in accordance with https://www.ietf.org/rfc/rfc2597.txt[RFC 2597], the DSCP is marked to
set the packet Drop precedence in accordance with the color, i.e High Drop
precedence for RED, Medium Drop precedence for YELLOW and leave the DSCP
unchanged if the color is GREEN.
2). Explicit Congestion Notification protocol per https://www.ietf.org/rfc/rfc3168.txt[RFC 3168], where a router
encountering congestion can notify it by setting the lower 2 bits in
DiffServ field to "11" Congestion Encountered code, which will ultimately
reduce the transmission rate of the packet sender.
3). In IEEE 802.1q VLAN tag header contains a DE - Drop Eligibility bit for
marking a packet for Downstream switches, and is valid for Ethernet packet
containing a VLAN tag.

RFC 3168 is only valid for TCP packets whereas RFC 2597 is valid for IPv4/IPv6
traffic.

The values are set per color and hence the implementation may support these
parameters only for a specific colors. marking_colors_supported field in
capabilities structure can be used to check which colors are supported for
marking.

==== Vlan Marking.

This vlan marking is used to enable the drop eligibility on the packet
based on the packet color. If drop eligibility is enabled then the
implementation will set the one bit VLAN Drop Eligibility Indicator (DEI)
field (but only for packets that already carry a VLAN tag) of a packet based
upon the final packet color assigned to the packet when it reaches the egress
node.  When drop_eligible_enabled is false, then the given color has
no effect on the VLAN fields.  See IEEE 802.1q for more details.
`vlan_marking_supported` boolean in capability structure indicates whether this
feature is supported by the implementation.

==== Explicit Congestion Notification Marking.

The `odp_tm_ecn_marking()` function allows one to configure the TM
egress so that the two bit ECN subfield of the eight bit TOS field of an
IPv4 packet OR the eight bit Traffic Class (TC) field of an IPv6 packet can be
selectively modified based upon the final color assigned to the packet when it
reaches the egress.  Note that the IPv4 header checksum will be updated -
but only if the IPv4 TOS field actually changes as a result of this
setting or the `odp_tm_drop_prec_marking()` setting.  For IPv6, since there is
no header checksum, nothing needs to be done. If ECN is enabled for a
particular color then ECN subfield will be set to _ECN_CE_  _i.e.,_ congestion
experienced.
`ecn_marking_supported` boolean in capability structure indicates whether this
feature is supported by the implementation.

==== Drop Precedence Marking.

The Drop precedence marking allows one to configure the TM
egress to support Assured forwarding in accordance with https://www.ietf.org/rfc/rfc2597.txt[RFC 2597].
The Drop Precedence bits are contained within the six bit Differentiated
Services Code Point subfield of the IPv4 TOS field or the IPv6 Traffic
Class (TC) field.  Specifically the Drop Precedence sub-subfield can be
accessed with a DSCP bit mask of 0x06.  When enabled for a given color,
these two bits will be set to Medium Drop Precedence (value 0x4) if the
color is ODP_PACKET_YELLOW, set to High Drop Precedence (value 0x6) if
the color is ODP_PACKET_RED.

Note that the IPv4 header checksum will be updated - but only if the
IPv4 TOS field actually changes as a result of this setting or the
`odp_tm_ecn_marking()` setting.  For IPv6, since there is no header checksum,
nothing else needs to be done.
`drop_prec_marking_supported` boolean in capability structure indicates whether
this feature is supported by the implementation.

=== Examples

.Create a tm_node chain for two nodes and associate the scheduler
[source,c]
odp_tm_params_init(&tm_params); // (1)
tm_params.pktio = egress_pktio;
tm = odp_tm_create("Example TM", &tm_params);

/* create 5 input queues here - two at priority 1 and three at priority 2. */

odp_tm_queue_params_init(&queue_params);
queue_params.priority = 1;
tmq_A1 = odp_tm_queue_create(tm, &queue_params);
tmq_B1 = odp_tm_queue_create(tm, &queue_params);
queue_params.priority = 2;
tmq_A2 = odp_tm_queue_create(tm, &queue_params);
tmq_B2 = odp_tm_queue_create(tm, &queue_params);
tmq_C2 = odp_tm_queue_create(tm, &queue_params);
odp_tm_node_params_init(&node_params); // (2)
node_params.level = 1;
tm_node_1 = odp_tm_node_create(tm, "TmNode1", &node_params);
odp_tm_queue_connect(tmq_A1, tm_node_1); // (3)
odp_tm_queue_connect(tmq_B1, tm_node_1);
odp_tm_queue_connect(tmq_A2, tm_node_1);
odp_tm_queue_connect(tmq_B2, tm_node_1);
odp_tm_queue_connect(tmq_C2, tm_node_1);

/* It is IMPORTANT to understand that the following code does NOT create any schedulers! In fact there is NO call to create a tm scheduler that exists inside of a tm_node. Such an entity comes into existence as needed. What this code does is create a scheduler PROFILE, which is effectively a registered set of common scheduler parameters. NOTE that this uses some pseudocode below instead of real C code so as to be more concise. */

odp_tm_sched_params_init(&sched_params); // (4)
sched_params.sched_modes = { ODP_TM_FRAME_BASED_WEIGHTS, ... };
sched_params.sched_weights = { 8, 8, 8,  ... };
sched_profile_RR = odp_tm_sched_create("SchedProfileRR", &sched_params);
sched_params.sched_modes = { ODP_TM_BYTE_BASED_WEIGHTS, ... };
sched_params.sched_weights = { 8, 8, 8, ... };
sched_profile_FQ = odp_tm_sched_create("SchedProfileFQ", &sched_params);
odp_tm_queue_sched_config(tm_node_1, tmq_A1, sched_profile_RR); // (5)
odp_tm_queue_sched_config(tm_node_1, tmq_B1, sched_profile_RR);
odp_tm_queue_sched_config(tm_node_1, tmq_A2, sched_profile_FQ);
odp_tm_queue_sched_config(tm_node_1, tmq_B2, sched_profile_FQ);
odp_tm_queue_sched_config(tm_node_1, tmq_C2, sched_profile_FQ);
odp_tm_node_params_init(&node_params); // (6)
node_params.level = 2;
tm_node_2 = odp_tm_node_create(tm, "TmNode2", &node_params);
odp_tm_node_connect(tm_node_1, tm_node_2); // (7)
odp_tm_sched_params_init(&sched_params); // (8)
sched_params.sched_modes = { ODP_TM_BYTE_BASED_WEIGHTS, ... };
sched_params.sched_weights = { 8, 16, 24,  ... };
sched_profile_WFQ = odp_tm_sched_create("SchedProfileWFQ", &sched_params);
odp_tm_node_sched_config(tm_node_2, tm_node_1, sched_profile_WFQ); // (9)
<1> Create a tm system, since that is a precursor to creating tm_queues.
<2> Create a Node #1
<3> Connect the Queue(s) to the Node -> odp_tm_queue_connect()
<4> Create two sets of scheduler params - one implementing Round Robin (since
all weights are the same - namely 8) and the second implementing Fair Queuing.

<5> Associate the Scheduler to the Node and the Queue(s) -> odp_tm_queue_sched_config()
Use the Round Robin profile for the priority 1 fan-ins and Fair Queuing
for the priority 2 fan-ins.

<6> Create a second Node #2
<7> Connect the first Node #1 to the second Node #2 -> odp_tm_node_connect()
<8> Create a Scheduler Profile
<9> Associate the Scheduler to the Node #1 and #2 -> odp_tm_node_sched_config()

== Classification (CLS)

ODP is a framework for software-based packet forwarding/filtering applications,
and the purpose of the Packet Classification API is to enable applications to
program the platform hardware or software implementation to assist in
prioritization, classification and scheduling of each packet, so that the
software application can run faster, scale better and adhere to QoS
requirements.

The following API abstraction are not modeled after any existing product
implementation, but is instead defined in terms of what a typical data-plane
application may require from such a platform, without sacrificing simplicity and
avoiding ambiguity. Certain terms that are being used within the context of
existing products in relation to packet parsing and classification, such as
access lists are avoided such that not to suggest any relationship
between the abstraction used within this API and any particular manner in which
they may be implemented in hardware.

=== Functional Description

Following is the functionality that is required of the classification API, and
its underlying implementation. The details and order of the following paragraph
is informative, and is only intended to help convey the functional scope of a
classifier and provide context for the API. In reality, implementations may
execute many of these steps concurrently, or in different order while
maintaining the evident dependencies:

1. Apply a set of classification rules to the header of an incoming packet,
identify the header fields, e.g. ethertype, IP version, IP protocol, transport
layer port numbers, IP DiffServ, VLAN id, 802.1p priority.

2. Store these fields as packet meta data for application use, and for the
remainder of parser operations. The odp_pktio is also stored as one of the meta
data fields for subsequent use.

3. Compute an odp_cos (Class of Service) value from a subset of supported fields
from 1) above.

4. Based on the odp_cos from 3) above, select the odp_queue through which the
packet is delivered to the application.

5. Validate the packet data integrity (checksums, FCS)  and correctness (e.g.,
length fields) and store the validation result, along with optional error layer
and type indicator, in packet meta data. Optionally, if a packet fails
validation, override the odp_cos selection in step 3 to a class of service
designated for errored packets.

6. Based on the odp_cos from 3) above, select the odp_buffer_pool that should be
used to acquire a buffer to store the packet data and meta data.

7. Allocate a buffer from odp_buffer_pool selected in 6) above and logically[1]
store the packet data and meta data to the allocated buffer, or in accordance
with class-of-service drop policy and subject to pool buffer availability,
optionally discard the packet.

8. Enqueue the buffer into the odp_queue selected in 4) above.

The above is an abstract description of the classifier functionality, and may be
applied to a variety of applications in many different ways. The ultimate
meaning of how this functionality applies to an application also depends on
other ODP modules, so the above may not complete a full depiction. For instance,
the exact meaning of priority, which is a per-queue attribute is influenced by
the ODP scheduler semantics, and the system behavior under stress depends on the
ODP buffer pool module behavior.

For the sole purpose of illustrating the above abstract functionality, here is
an example of a Layer-2 (IEEE 802.1D)  bridge application: Such a forwarding
application that also adheres to IEEE 802.1p/q priority, which has 8 traffic
priority levels, might create 8 odp_buffer_pool instances, one for each PCP
priority level, and 8 odp_queue instances one per priority level. Incoming
packets will be inspected for a VLAN header; the PCP field will be extracted,
and used to select both the pool and the queue. Because each queue will be
assigned a priority value, the packets with highest PCP values will be scheduled
before any packet with a lower PCP value. Also, in a case of congestion, buffer
pools for lower priority packets will be depleted earlier than the pools
containing packets of the high priority, and hence the lower priority packets
will be dropped (assuming that is the only flow control method that is supported
in the platform) while higher priority packets will continue to be received into
buffers and processed.

=== Class of Service Creation and Binding

To program the classifier, a class-of-service instance must be created, which
will contain the packet filtering resources that it may require. All subsequent
calls refer to one or more of these resources.

Each class of service instance must be associated with a single queue or queue
group, which will be the destination of all packets matching that particular
filter. The queue assignment is implemented as a separate function call such
that the queue may be modified at any time, without tearing down the filters
that define the class of service. In other words, it is possible to change the
destination queue for a class of service defined by its filters quickly and
dynamically.

Optionally, on platforms that support multiple packet buffer pools, each class
of service may be assigned a different pool such that when buffers are exhausted
for one class of service, other classes are not negatively impacted and continue
to be processed.

=== Default packet handling

There is a `odp_cos_t` assigned to each port with the
odp_pktio_default_cos_set() function, which will function as the default
class-of-service for all packets received from an ingress port,
that do not match any of the filters defined subsequently.
At minimum this default class-of-service must have a queue and a
buffer pool assigned to it on platforms that support multiple packet buffer
pools. Multiple odp_pktio instances (i.e., multiple ports) may each have their
own default odp_cos, or may share a odp_cos with other ports, based on
application requirements.

=== Error packet handling

Error class of service is assigned to an ingress port using the function
`odp_pktio_error_cos_set()`. All the packets received with error from this
specific ingress port are assigned to this error class-of-service.
At minimum this error class-of-service must have a queue and a buffer pool
assigned to it. Multiple pktio instances (_i.e.,_ multiple ports) may each have
their own error class of service, or may share an error CoS with other ports,
based on application requirements.

=== Packet dropping

Each class of service has a `drop_policy` configured during creation. The
valid value are ODP_COS_DROP_POOL and ODP_COS_DROP_NEVER. If the `drop_policy`
is set to ODP_COS_DROP_POOL then the packets assigned to the CoS follows the
drop policy of the associated pool _i.e.,_ depending on the Random Early Discard
or any other configuration of the pool the packet might get dropped. If the
`drop_policy` is set to ODP_COS_DROP_NEVER then the Random Early Discard of the
pool is ignored.

During creation of the class of service if the pool or queue is set as INVALID
using ODP_POOL_INVALID or ODP_QUEUE_INVALID field then any packet assigned to the specific CoS are dropped.

=== Packet Classification

For each odp_pktio port, the API allows the assignment of a class-of-service to
a packet using one of  three methods:

1. The packet may be assigned a specific class-of-service based on its Layer-2
(802.1P/902.1Q VLAN tag) priority field. Since the standard field defines 8
discrete priority levels, the API allows to assign an odp_cos to each of these
priority levels with the `odp_cos_with_l2_priority()` function.

2. Similarly, a class-of-service may be assigned using the Layer-3 (IP DiffServ)
header field. The application supplies an array of odp_cos values that covers
the entire range of the standard protocol header field, where array elements do
not need to contain unique values. There is also a need to specify if Layer-3
priority takes precedence over Layer-2 priority in a packet with both headers
present.

3. Additionally, the application may also program a number of pattern matching
rules that assign a class-of-service for packets with header fields matching
specified values. The field-matching rules take precedence over the previously
described priority-based assignment of a class-of-service. Using these matching
rules the application should be able for example to identify all packets
containing VoIP traffic based on the protocol being UDP, and a specific
destination or source port numbers, and appropriately assign these packets a
class-of-service that maps to a higher priority queue, assuring voice packets a
lower and bound latency.

=== Packet meta data Elements

Here are the specific information elements that are stored within the
packet meta data structure:

* Protocol fields that are decoded and extracted by the parsing phase

* The pool identifier that is selected for the packet

* The ingress port identifier

* The result of packet validation, including an indication of the type of error
detected, if any

The ODP packet API module provides accessors for retrieving the above meta
data fields from the container buffer in an implementation-independent manner.

===  Example configuration

CoS configuration can be best illustrated by drawing a tree, where each CoS is
the vertex, and each link between any two vertices is a PMR. The root node for
the tree is the default CoS which is attached with the pktio interface.  All of
the CoS vertices can be final for some packets, if these packets do not match
any of the link PMRs.

.Let us consider the below configuration
odp_pktio_default_cos_set(odp_pktio_t pktio, odp_cos_t default_cos); +

pmr1 = odp_cls_pmr_create(pmr_match1, default_cos,  cos1); +
pmr2 = odp_cls_pmr_create(pmr_match2, default_cos,  cos2); +
pmr3 = odp_cls_pmr_create(pmr_match3, default_cos,  cos3); +

pmr11 = odp_cls_pmr_create(pmr_match11, cos1,  cos11); +
pmr12 = odp_cls_pmr_create(pmr_match12, cos1,  cos12); +

pmr21 = odp_cls_pmr_create(pmr_match11, cos2,  cos21); +
pmr31 = odp_cls_pmr_create(pmr_match11, cos3,  cos31); +

The above configuration DOES imply order - a packet that matches pmr_match1 will
then be applied to pmr_match11 and pmr_match12, and as a result could terminate
with either cost1, cos11, cos12. In this case the packet was subjected to two
match attempts in total.

The remaining two lines illustrate how a packet that matches pmr_match11 could
end up wth either cos11, cos21 or cos31, depending on whether it matches
pmr_march1, pmr_march2 or pmr_match3.

=== Practical example

Let's look at DNS packets, these are identified by using UDP port 53, but each
UDP packet may run atop of IPv4 or IPv6, and in turn an IP packet might be
received as either multicast or unicast,

.Very simply, we can create these PMRs
PMR-L2 = match all multicast/broadcast packets based on DMAC address +
PMR_L3_IP4 = match all IPv4 packets +
PMR_L3_IP6 = match all IPv6 packets +
PMR_L4_UDP = match all UDP packets +
PMR_L4_53 = match all packets with dest port = 53 +

[source,c]

odp_cls_pmr_create(PMR_L2, default_cos, default_cos_mc); odp_cls_pmr_create(PMR_L3_IP4, default_cos, default_cos_ip4_uc); odp_cls_pmr_create(PMR_L3_IP6, default_cos, default_cos_ip6_uc);

odp_cls_pmr_create(PMR_L3_IP4, default_cos_mc, default_cos_ip4_mc); odp_cls_pmr_create(PMR_L3_IP6, default_cos_mc, default_cos_ip6_mc); odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip4_uc, cos_udp4_uc); odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip4_mc, cos_udp4_mc); odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip6_uc, cos_udp6_uc); odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip6_mc, cos_udp6_mc);

odp_cls_pmr_create(PMR_L4_53, cos_udp4_uc, dns4_uc); odp_cls_pmr_create(PMR_L4_53, cos_udp4_mc, dns4_mc); odp_cls_pmr_create(PMR_L4_53, cos_udp6_uc, dns6_uc); odp_cls_pmr_create(PMR_L4_53, cos_udp6_mc, dns6_mc);

In this case, a packet may change CoS between 0 and 5 times, meaning that up to
5 PMRs may be applied in series, and the order

Another interesting point is that an implementation will probably impose on a
limit of how many PMRs can be applied to a packet in series, so in the above
example, if an implementation limit on the number of consecutive classification
steps is 4, then all the DNS packets may only reach cos_udp?_?c set of vertices.

== Utilities and examples

=== PcapNg capture
If compiled using `--enable-pcapng-support` ODP will offer packet capturing
functionality in PcapNg format. If the /var/run/odp directory exists prior to
launching the application ODP will create a fifo for each NIC queue.
Queue naming will be of the following format: *<odp global pid>-<NIC
name>-flow-<queue number>*. Linux dd application can be used for capturing a
sample of the live stream from the fifo. Killing ether the application or dd
will stop the capturing process.

. `./configure --enable-pcapng-support`
. `sudo mkdir /var/run/odp`
. `sudo ./example/generator/odp_generator -I enp2s0 -mu --dstmac
A0:F6:FD:AE:62:6C --dstip 192.168.49.20 --srcmac 2c:56:dc:9a:8f:06 --srcip
192.168.49.4 -i0 -w1`
. `sudo dd if=/var/run/odp/26737-enp2s0-flow-0 of=~/test.pcap`
. `ctrl^c`
. `wireshark ~/test.pcap`

== Glossary
[glossary]
worker thread::
    A worker is a type of ODP thread. It will usually be isolated from
    the scheduling of any host operating system and is intended for fast-path
    processing with a low and predictable latency. Worker threads will not
    generally receive interrupts and will run to completion.
control thread::
    A control thread is a type of ODP thread. It will be isolated from the host
    operating system house keeping tasks but will be scheduled by it and may
    receive interrupts.
ODP instantiation process::
    The process calling `odp_init_global()`, which is probably the
    first process which is started when an ODP application is started.
    There is one single such process per ODP instantiation.
thread::
    The word thread (without any further specification) refers to an ODP
    thread.
ODP thread::
    An ODP thread is a flow of execution that belongs to ODP:
    Any "flow of execution" (i.e. OS process or OS thread) calling
    `odp_init_global()`, or `odp_init_local()` becomes an ODP thread.
    This definition currently limits the number of ODP instances on a given
    machine to one. In the future `odp_init_global()` will return something
    like an ODP instance reference and `odp_init_local()` will take such
    a reference in parameter, allowing threads to join any running ODP instance.
    Note that, in a Linux environment an ODP thread can be either a Linux
    process or a linux thread (i.e. a linux process calling `odp_init_local()`
    will be referred as ODP thread, not ODP process).
event::
    An event is a notification that can be placed in a queue.
queue::
    A communication channel that holds events