Tapas is a universal energy-efficiency middleware layer. It sits between your users and any AI engine, serving cached answers when available and forwarding to the AI only when necessary.
Cache hits return in <50ms at 0.001 Wh. Cache misses are forwarded to your AI with zero added latency.
Wrap Tapas around the OpenAI API as a middleware layer
Tapas intercepts every request to the OpenAI Chat Completions API. Before forwarding to GPT-4/GPT-3.5, it checks the semantic cache. On a hit, it returns a bullet-point answer in <50ms — no OpenAI token cost, no GPU spin-up.
Tapas is built on Claude — native Anthropic integration
Tapas uses Anthropic Claude as its primary inference engine for cache misses. For Claude-native apps, Tapas acts as a semantic pre-filter: it checks the cache before invoking the Messages API, and stores the response for future reuse. This is the tightest integration available.
Deploy Tapas on Google Cloud Run alongside Vertex AI
For Google Cloud deployments, Tapas runs as a Cloud Run service in the same VPC as your Vertex AI endpoints. It intercepts Gemini API calls via a sidecar proxy, checks the semantic cache (stored in Cloud Spanner or Firestore), and only forwards to Gemini on cache misses.
Integrate Tapas with Azure OpenAI Service and Copilot Studio
Microsoft Copilot and Azure OpenAI Service use the same OpenAI-compatible API. Tapas deploys as an Azure Container App or API Management policy, sitting between your Copilot Studio bot and the Azure OpenAI endpoint. Cache hits bypass the AOAI deployment entirely.
Add Tapas as a caching layer in front of your Ollama or vLLM server
For self-hosted Llama deployments (via Ollama, vLLM, LM Studio, or NVIDIA NIM), Tapas runs as a local reverse proxy on the same machine or in the same Docker network. It intercepts requests to port 11434 (Ollama) or your vLLM endpoint and serves cache hits without touching the GPU.
One REST endpoint. Works with any LLM or AI service.
Tapas exposes a single REST endpoint that accepts a query string and returns a structured response. Any AI platform that accepts HTTP calls can route queries through Tapas first. If the cache returns a hit, the AI platform never needs to be called. This works with custom models, fine-tuned LLMs, RAG pipelines, or any proprietary AI service.
We use cookies
We use essential cookies to keep you signed in and optional analytics cookies to understand how Tapas is used — no advertising, no third-party tracking.
Essential
Always on — required for login & security
Analytics
Optional — anonymous usage stats