Pre-process JSON data for AI apps

Sensible data obfuscation and token-saving transformation out of one hand with a few lines of code.

In the world of AI, data preparation is key. When working with JSON data that needs to be fed into Large Language Models (LLMs), you often face two challenges: the data may contain sensitive information and its structure might not be optimized for token efficiency. In this post, we’ll explore how to address these challenges using @tsmx/json-tools, a lightweight JavaScript library for JSON manipulation.

The initial data

Imagine you have a JSON object containing customer data, including sensitive information like credit card numbers and IP addresses. You want to use this data to train or prompt an AI model, but you need to protect the sensitive information and minimize the number of tokens to save costs.

Let’s consider you have the following JSON object that should be consumed by an LLM:

{
  "accounts": [
    { "id": 1, "name": "Joe", "creditCard": "4111-1111-1111-1111" },
    { "id": 2, "name": "Sue", "creditCard": "5555-5555-5555-4444" }
  ],
  "visits": [
    { "visitorId": 1, "timestamp": "2025-01-01T12:00:00Z", "ip": "192.168.1.1", "site": "index.html"},
    { "visitorId": 1, "timestamp": "2025-01-01T13:05:00Z", "ip": "192.168.1.2", "site": "shop.html"},
    { "visitorId": 2, "timestamp": "2025-01-01T14:00:00Z", "ip": "192.168.1.2", "site": "login.html"},
    { "visitorId": 2, "timestamp": "2025-01-01T14:10:00Z", "ip": "192.168.1.1", "site": "index.html"}
  ]
}

Pre-processing for AI

With @tsmx/json-tools, you can easily obfuscate the sensitive data and transform the JSON object into a token-saving format.

First, let’s install the library:

npm install @tsmx/json-tools

Now, let’s use the library to process our JSON data:

const jt = require('@tsmx/json-tools');

const data = {
  "accounts": [
    { "id": 1, "name": "Joe", "creditCard": "4111-1111-1111-1111" },
    { "id": 2, "name": "Sue", "creditCard": "5555-5555-5555-4444" }
  ],
  "visits": [
    { "visitorId": 1, "timestamp": "2025-01-01T12:00:00Z", "ip": "192.168.1.1", "site": "index.html"},
    { "visitorId": 1, "timestamp": "2025-01-01T13:05:00Z", "ip": "192.168.1.2", "site": "shop.html"},
    { "visitorId": 2, "timestamp": "2025-01-01T14:00:00Z", "ip": "192.168.1.2", "site": "login.html"},
    { "visitorId": 2, "timestamp": "2025-01-01T14:10:00Z", "ip": "192.168.1.1", "site": "index.html"}
  ]
};

// Obfuscate sensitive data
jt.obfuscate.ipAddresses(data);
jt.obfuscate.creditCards(data);

// Transform to LLM-friendly format
const result = jt.transform.toLLM(data, true);

console.log(result);

That’s already it. Your data is now ready to be safely and efficently processed further in your AI app.

Note that the library offers a lot more functionality then shown in this use-case. For a complete overview and more details on the functions used above, please consult the Github repo.

The Result

The code above will produce the following output:

accounts[2](id,name,creditCard)
 -1
  Joe
  ***
 -2
  Sue
  ***
visits[4](visitorId,timestamp,ip,site)
 -1
  2025-01-01T12:00:00Z
  ***
  index.html
 -1
  2025-01-01T13:05:00Z
  ***
  shop.html
 -2
  2025-01-01T14:00:00Z
  ***
  login.html
 -2
  2025-01-01T14:10:00Z
  ***
  index.html

As you can see, the credit card numbers and IP addresses have been replaced with `***`, and the JSON structure has been transformed into a compact, LLM-friendly format.

In this example, the transformation reduces the token count from 265 to 139 (according to OpenAI’s tokenizer for GPT-4o), which is a significant saving of 48%.

Note: If you want an absolute maximum of token savings, there might be more optimized solutions (like TOON or CTON). The goal of @tsmx/json-tools is more to deliver an overall comprehensive toolset for JSON data processing so that you need less dependencies.

Conclusion

@tsmx/json-tools provides a simple and effective way to pre-process JSON data for AI applications without the need of any other dependency. With its obfuscation and transformation functions, you can easily secure sensitive data and optimize your data for token efficiency, leading to significant cost savings.

Useful links