L2: Session & Lifecycle Layer

Purpose

L2 manages the lifecycle of a task session — from creation through execution to termination. It defines the session state machine, standard event types for execution visibility, and mechanisms for checkpoint and recovery.

Session

A session represents a single task execution lifecycle. It is created when a task passes L3 safety validation and destroyed when the task reaches a terminal state.

Session Properties

Property Type Description
session_id string (UUID v4) Unique session identifier, generated by the callee
state enum Current lifecycle state
created_at timestamp When the session was created
updated_at timestamp Last state transition time
session_token string L3-issued token authorizing this session
risk_level enum (R1–R5) Risk level assessed by L3
metadata object Caller-provided and callee-assigned metadata

Session State Machine

                     TaskSubmit received
                           │
                           ▼
                       PENDING
                       │     │
               L3 Approve   L3 Reject
                       │     │
                       ▼     ▼
                   RUNNING  REJECTED ─── (terminal)
                   │  │  │
          ┌────────┘  │  └────────┐
          ▼           │           ▼
       PAUSED     RUNNING     ABORTING
          │                       │
          └───► RUNNING           ▼
                              ABORTED ──── (terminal)

               RUNNING
               │     │
               ▼     ▼
         COMPLETED  FAILED ──── (terminal)
         (terminal)

State Definitions

State Description Trigger
PENDING Task received, awaiting L3 safety check task_submit received
REJECTED Task failed L3 safety check (terminal) L3 rejects task
RUNNING Task is actively being executed by the callee harness L3 approves task
PAUSED Execution temporarily suspended, session preserved Callee decision or checkpoint
ABORTING Abort requested, callee is cleaning up Caller sends abort
ABORTED Execution aborted, cleanup complete (terminal) Callee finishes abort cleanup
COMPLETED Task finished successfully (terminal) Callee reports success
FAILED Task terminated due to unrecoverable error (terminal) Callee reports failure

Valid State Transitions

From To Trigger
PENDING RUNNING L3 safety check passed
PENDING REJECTED L3 safety check failed
RUNNING PAUSED Callee pauses execution
RUNNING ABORTING Caller sends abort
RUNNING COMPLETED Task finishes successfully
RUNNING FAILED Unrecoverable error
PAUSED RUNNING Callee resumes execution
PAUSED ABORTING Caller sends abort while paused
ABORTING ABORTED Callee finishes cleanup

Event Stream

During execution, the callee emits a stream of events to provide visibility into progress. All events are delivered through the L1 event channel.

Event Envelope

Events are carried in the payload of an L1 message with type: "event":

{
  "hcp_version": "1.0",
  "message_id": "...",
  "timestamp": "...",
  "session_id": "...",
  "type": "event",
  "payload": {
    "event_type": "progress",
    "sequence": 42,
    "data": { }
  }
}
Field Type Description
event_type string (enum) Type of event
sequence integer Monotonically increasing sequence number within the session
data object Event-type-specific content

Standard Event Types

Event Type Description Data Fields
session_created Session has been created and execution is starting state, risk_level, session_token
state_changed Session state has transitioned from_state, to_state, reason
progress Execution progress update stage (string), percent (number, optional), message (string)
intermediate_result Partial or interim result available result_type, data, is_partial
log Execution log entry level (info/warn/error), message, details
warning Non-fatal warning code, message, details
error Error occurred but execution continues code, message, recoverable
checkpoint_created A checkpoint was saved checkpoint_id, description, resumable
session_closed Session has reached a terminal state final_state, reason

Event Ordering

  • Events MUST be emitted with strictly increasing sequence numbers within a session.
  • Consumers MUST process events in sequence order.
  • If events arrive out of order (due to transport characteristics), the consumer SHOULD buffer and reorder.

Integration with L1 Stream Continuity

The sequence number is the L2 mechanism that works with L1’s AMQP ACK-based stream continuity (see L1-transport-encoding.md — Stream Continuity) to provide lossless, deduplicated, ordered event delivery.

Caller-side per-session tracking:

The caller MUST maintain a last_processed_sequence value for each active session. This value is used for:

  1. Deduplication on redelivery: When L1 redelivers messages after a caller crash (AMQP requeue), the caller checks event.sequence <= last_processed_sequence — if true, the event has already been processed and is skipped (but still ACKed to advance the queue).

  2. Gap detection: If the caller receives sequence = N+2 without having processed N+1, it detects a gap. This should not occur under normal AMQP delivery but serves as a safety check. On gap detection, the caller SHOULD log a warning and continue processing (the missing event may arrive via redelivery).

  3. Recovery after restart: On restart, the caller loads last_processed_sequence from persistent storage (if available) to resume deduplication. If not persisted, idempotent event processing (as recommended by L1) handles redeliveries.

Interaction model:

L1 (AMQP)                    L2 (Session)                  Caller Application
    │                              │                              │
    │  deliver event               │                              │
    │  (delivery_tag=7)            │                              │
    │─────────────────────────────►│                              │
    │                              │  parse session_id, sequence  │
    │                              │  check: seq > last_processed?│
    │                              │                              │
    │                              │  ├─ Yes: forward to app ────►│ process event
    │                              │  │  update last_processed    │
    │                              │  │                           │
    │  basic.ack(delivery_tag=7) ◄─┤  │  signal ACK to L1        │
    │                              │  │                           │
    │                              │  └─ No (duplicate): skip     │
    │  basic.ack(delivery_tag=7) ◄─┤     ACK without processing  │
    │                              │                              │

Key principle: L1 ensures messages are never lost (AMQP durable delivery + manual ACK + requeue). L2 ensures events are never processed twice (sequence-based deduplication) and always in order (sequence-based ordering). Together, they provide exactly-once semantics at the application level.

Checkpoint & Recovery

For long-running tasks, checkpoints allow execution to be resumed after interruption.

Checkpoint

A checkpoint is a snapshot of the callee’s execution state at a given point. When a checkpoint is created, the callee emits a checkpoint_created event.

{
  "event_type": "checkpoint_created",
  "sequence": 100,
  "data": {
    "checkpoint_id": "ckpt-001",
    "description": "Completed phase 1: material preparation",
    "resumable": true,
    "created_at": "2025-01-15T10:00:00.000Z"
  }
}

Recovery

If a callee harness fails and restarts, it MAY resume from the latest checkpoint. Recovery is an internal concern of the callee — the protocol does not define how checkpoints are stored or how state is reconstructed. From the caller’s perspective:

  1. The event stream may have a gap (events between failure and recovery are lost).
  2. The callee SHOULD emit a state_changed event with reason: "recovered_from_checkpoint" upon recovery.
  3. Execution continues from the checkpoint, and new events are appended to the same session.

Session Timeout

  • The callee SHOULD enforce a maximum session duration, derived from the task’s max_duration constraint (L4) or a system default.
  • If execution exceeds the timeout, the callee transitions to FAILED with reason "timeout".
  • Idle sessions (no events emitted for a configurable period) MAY be cleaned up by the callee.

Abort Protocol

When the caller sends an abort message:

  1. The session transitions to ABORTING.
  2. The callee begins cleanup: stops LLM calls, terminates running tools, releases resources.
  3. Cleanup SHOULD be bounded by a callee-defined abort timeout.
  4. Upon completion, the session transitions to ABORTED.
  5. The callee emits session_closed with final_state: "ABORTED".

The callee MUST make a best-effort attempt at cleanup but is not required to guarantee resource release within any specific timeframe. The caller SHOULD treat ABORTING as a transient state that will resolve to ABORTED.