Nico 3d71c651fc v0.10.0: test framework with markdown testcases and web UI

- testcases/*.md: declarative test definitions (send, expect_response,
  expect_state, expect_actions, action)
- runtime_test.py: standalone runner + pytest integration via conftest.py
- /tests route: web UI showing last run results from results.json
- /api/tests: serves results JSON
- Two initial testcases: counter_state (UI actions) and pub_conversation
  (multi-turn, language switch, tool use, memorizer state)
- pub_conversation: 19/20 passed on first run
- Fix nm-text vertical overflow in node metrics bar

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 15:36:19 +01:00

1.2 KiB

Raw Blame History

Pub Conversation

Tests multi-turn conversation with context tracking, language switching, and memorizer state updates across a social scenario.

Setup

clear history

Steps

1. Set the scene

send: Hey, Tina and I are heading to the pub tonight
expect_response: length > 10
expect_state: situation contains "pub" or "Tina"

2. Language switch to German

send: Wir sind jetzt im Biergarten angekommen
expect_response: length > 10
expect_state: language is "de" or "mixed"

3. Context awareness

send: Was sollen wir bestellen?
expect_response: length > 10
expect_state: topic contains "bestell" or "order" or "pub" or "Biergarten"

4. Tina speaks

send: Tina says: I'll have a Hefeweizen please
expect_response: length > 10
expect_state: facts any contains "Tina" or "Hefeweizen"

5. Ask for time (tool use)

send: wie spaet ist es eigentlich?
expect_response: matches \d{1,2}:\d{2}

6. Back to English

send: Let's switch to English, what was the last thing Tina said?
expect_state: language is "en" or "mixed"
expect_response: contains "Tina" or "Hefeweizen"

7. Mood check

send: This is really fun!
expect_state: user_mood is "happy" or "playful" or "excited"

1.2 KiB Raw Blame History

Pub Conversation

Setup

Steps

1. Set the scene

2. Language switch to German

3. Context awareness

4. Tina speaks

5. Ask for time (tool use)

6. Back to English

7. Mood check

1.2 KiB

Raw Blame History