I decided to make a personal AI assistant using the M5Stack StickC Plus2.

I decided to build a client for my AI project. Before this, I trained a model to create my personal JARVIS. I used mistral-7b-finetuned as a base model from Hugging Face, then fine-tuned it for my own needs. I used Google Colab for training, and after 1–2 hours, my model was ready. (I also want to write about how I trained it, my code, API, etc. that’s another blog topic.)

Table of Contents

Intro
Inspiration
Tech stack & system design
Implementation (M5StickC Plus2)
- Configuration
- Dialog history (on-device)
Sending audio to the backend, getting a response, and rendering it
Result

Now I have a working AI API, and I needed a client a pocket client. I first wanted to use Telegram to access it from my phone, but then I decided: why not build a project with my M5StickC Plus2 (ESP32)? The device has a microphone, so I can record voice, convert it to text (STT), send it to my API, get a response, and show it on the screen. This is a simple project and it works great. I named it Pocket AI.

Now I’m going to share some simple information about the plan, the AI architecture, and where I got inspiration from.

Inspiration

I got inspiration from the Marvel universe (J.A.R.V.I.S / F.R.I.D.A.Y.), and from the DC universe (Mother Box / Father Box). I also love Batman's Batcomputer because I'm a nerd.

J.A.R.V.I.S

Mother Box

so I decided to make my own AI my own JARVIS because why settle for Siri when you can have a sarcastic genius in your pocket?

Tech stack & system design

I hate choosing the tech stack, but I have to.

System Design Zero.svg

Hardware

M5StickC Plus2 (ESP32)
Built-in microphone for voice input
Built-in screen for showing the assistant’s responses

On-device software

Record audio on the ESP32
Send audio or text to the backend over Wi‑Fi
Render the response text on the device display

Backend / API

FastAPI as the web server for your AI APIs
Endpoints typically look like:
- /stt (speech-to-text)
- /chat (LLM inference)
- /tts (text-to-speech) (optional, if you add audio output later)

AI layer

Base model: mistral-7b-finetuned (from Hugging Face)
Hosted behind your API, so the device stays lightweight

Data flow (high level)

ESP32 records audio
Audio is sent to the backend
Backend runs STT and sends text to the LLM
LLM response is returned to the device
Device displays the response

Implementation (M5StickC Plus2)

Now I started building the M5StickC Plus2 client firmware. The goal is simple (and honestly a little magical): record audio, ship it to my backend, get a text reply, and show it on this tiny screen like it’s a pocket-sized JARVIS.

My platformio.ini looks like this:

[env:m5stick-c]
platform = espressif32
board = m5stick-c
framework = arduino
board_build.f_cpu = 240000000L
board_build.f_flash = 80000000L
board_build.flash_mode = dio
board_upload.flash_size = 8MB
board_build.partitions = default_8MB.csv
board_build.filesystem = spiffs
build_flags = 
    -DCORE_DEBUG_LEVEL=3
    -DBOARD_HAS_PSRAM
    -mfix-esp32-psram-cache-issue
lib_deps = 
    m5stack/M5Unified@^0.1.14
    tzapu/WiFiManager@^2.0.16-rc.2
    bblanchon/ArduinoJson@^7.0.4
    earlephilhower/ESP8266Audio@^1.9.9
    bitbank2/AnimatedGIF@^2.0.1

I wrote most of the firmware with help from Claude (Sonnet). It was my first time building a full microcontroller client like this, so having an AI “rubber duck” that also writes code was… extremely convenient.

For the core request flow I only need networking, JSON, and M5Unified:

#include <ArduinoJson.h>
#include <HTTPClient.h>
#include <M5Unified.h>
#include <WiFi.h>
#include <WiFiManager.h>

Note: I experimented with additional audio output libraries earlier, but the main concept of this post is the record → STT → chat → display loop, so I am keeping the includes minimal here.

Configuration

I decided to protect my backend with a simple API key. The device uses a single API_END_POINT base URL.

// === Configuration ===
const char* API_KEY = "JWT_API_KEY";
const char* API_END_POINT = "http://localhost:8000";

// === Audio buffer (PSRAM) ===
int16_t* record_buffer = nullptr;
size_t record_size = 0;
const size_t MAX_RECORD_SIZE = 1024 * 500; // 500 KB max
const int SAMPLE_RATE = 16000;

Dialog history (on-device)

Then I added a small ring-buffer-like structure for local chat history (user prompt + AI response).

// Dialog History
struct Dialog {
  String userText;
  String aiResponse;
  unsigned long timestamp;
};

const int MAX_DIALOGS = 10;
Dialog dialogHistory[MAX_DIALOGS];
int dialogCount = 0;
int currentDialogView = 0;

Add dialog entries (user input + AI response) to the local history buffer, and shift the array when we reach the max limit:

void addDialog(String userText, String aiResponse) {
  if (dialogCount < MAX_DIALOGS) {
    dialogHistory[dialogCount].userText = userText;
    dialogHistory[dialogCount].aiResponse = aiResponse;
    dialogHistory[dialogCount].timestamp = millis();
    dialogCount++;
  } else {
    // Remove the oldest entry by shifting the array left
    for (int i = 0; i < MAX_DIALOGS - 1; i++) {
      dialogHistory[i] = dialogHistory[i + 1];
    }
    dialogHistory[MAX_DIALOGS - 1].userText = userText;
    dialogHistory[MAX_DIALOGS - 1].aiResponse = aiResponse;
    dialogHistory[MAX_DIALOGS - 1].timestamp = millis();
  }

  // Always point the viewer to the most recent dialog
  currentDialogView = dialogCount - 1;
}

Sending audio to the backend, getting a response, and rendering it

This is the part where the project stops being “a board with a screen” and starts feeling like a tiny assistant:

Hold the button to record audio from the microphone.
When recording ends, send the recorded audio to your backend’s STT endpoint.
Take the transcribed text and send it to your backend’s chat endpoint.
Render the response on the screen.
Store the pair in dialogHistory so you can browse past prompts and answers.

Minimal libraries

For this flow I only needed Wi‑Fi, HTTPS/HTTP, JSON, and M5Unified:

#include <ArduinoJson.h>
#include <HTTPClient.h>
#include <M5Unified.h>
#include <WiFi.h>
#include <WiFiClientSecure.h>
#include <WiFiManager.h>

Configuration

I kept configuration intentionally simple: an API key and a single base URL for my backend.

// === Configuration ===
const char* API_KEY = "JWT_API_KEY";
const char* API_END_POINT = "http://localhost:8000";

// Audio buffer (PSRAM)
int16_t* record_buffer = nullptr;
size_t record_size = 0;
const size_t MAX_RECORD_SIZE = 1024 * 500; // 500 KB max
const int SAMPLE_RATE = 16000;

Sending the recording to `/stt`

My backend expects a small audio payload and returns JSON.

The important idea is that the device sends the recorded buffer, then reads back a text field:

String sttRequest() {
  HTTPClient http;

  String url = String(API_END_POINT) + "/stt";
  http.begin(url);
  http.addHeader("Authorization", String("Bearer ") + API_KEY);

  // Pick the content type your backend expects.
  // If you send raw PCM, use application/octet-stream.
  // If you send WAV, use audio/wav.
  http.addHeader("Content-Type", "application/octet-stream");

  int httpCode = http.POST((uint8_t*)record_buffer, record_size);
  if (httpCode == HTTP_CODE_OK) {
    String response = http.getString();
    JsonDocument doc;
    deserializeJson(doc, response);
    http.end();

    if (!doc["text"].isNull()) {
      return doc["text"].as<String>();
    }
  }

  http.end();
  return "";
}

Sending the text to `/chat`

Once I have transcription text, I send it to /chat and read back a single string response:

String chatRequest(String input) {
  HTTPClient http;

  String url = String(API_END_POINT) + "/chat";
  http.begin(url);
  http.addHeader("Authorization", String("Bearer ") + API_KEY);
  http.addHeader("Content-Type", "application/json");

  JsonDocument req;
  req["text"] = input;

  String payload;
  serializeJson(req, payload);

  int httpCode = http.POST(payload);
  if (httpCode == HTTP_CODE_OK) {
    String response = http.getString();
    JsonDocument doc;
    deserializeJson(doc, response);
    http.end();

    if (!doc["response"].isNull()) {
      return doc["response"].as<String>();
    }
  }

  http.end();
  return "Error getting response.";
}

The main loop: record → STT → chat → render

The loop stays simple. I treat the device like a “push to talk” client:

While the button is held: record to the buffer.
When the button is released: call STT, then chat, then render the result.

I keep the UI code separate so the network logic stays readable.

Result

With these pieces in place, the device can record audio, send it to the backend, receive a text response, and display it on the screen. It is basically push-to-talk, but for your own AI.

Slide Version

I decided to make a personal AI assistant using the M5Stack StickC Plus2.

Inspiration

Tech stack & system design

Hardware

On-device software

Backend / API

AI layer

Data flow (high level)

Implementation (M5StickC Plus2)

Configuration

Dialog history (on-device)

Sending audio to the backend, getting a response, and rendering it

Minimal libraries

Configuration

Sending the recording to /stt

Sending the text to /chat

The main loop: record → STT → chat → render

Result

Sending the recording to `/stt`

Sending the text to `/chat`