blog Article

Staying compliant and in control of your data when using ChatGPT and other Large Language Models

Author: Joren Verspeurt ,

 

Large Language Models and services based on them, such as ChatGPT, have recently enjoyed quite the spotlight moment. We've previously posted about the transformative power they could have for many companies, and some academics have even started suggesting that GPT-4 could be a genuine first step towards Artificial General Intelligence. The flip side to this is the scrutiny that these models and the companies who provide them have been under from regulators, the ethical questions that opinion leaders have posed, and the uncertainty about them that lives in the minds of the general public. At the time of writing, ChatGPT is still unavailable in Italy due to the aforementioned issues.

Specifically, we've been hearing concerns from several clients about whether the terms under which these services are offered are compatible with the constraints on their businesses. Many of them have heard about the case where Samsung employees effectively leaked internal company information by sending questions to ChatGPT, which could then be used as training data for the next version. Some are concerned about the processing of personal data, especially by a US-based company like OpenAI.

This post aims to answer some of the questions you might have about using LLM-based solutions in your company. We'll also include some quick recommendations at the end of this post about ways you can already start using some of these services right now.

The basics

Which aspects of using LLMs could be an issue for my company?

It may not be immediately obvious what the potential risks are when using LLMs and services based on them. These are a couple that you should consider:

  • Confidentiality: LLMs process and generate text based on the input they receive, which may include sensitive or confidential information. Ensuring that confidential data remains secure and does not get inadvertently exposed during interactions with LLMs is a significant concern. This is important both for employees who may use ChatGPT independently and when integrating LLMs into other processes.
  • Data Privacy: While LLMs do not explicitly store or share the data used in their training dataset or future interactions, there is still a risk of unintended information leakage or privacy violations, especially when processing personal or sensitive data. If you use a service which is regularly retrained on user interactions (like ChatGPT), other users may be able to deduce that sensitive data about your company was sent to the service at some point and, in the worst case, even get the service to generate some version of it.
  • Regulatory Compliance: Compliance with data protection regulations like the GDPR and other EU legislation is critical when using LLMs in business applications. Failing to comply with these regulations can result in fines, legal consequences, and reputational damage.
  • Ethical Considerations: LLMs have the potential to generate biased, offensive, or inappropriate content due to the nature of the data they were trained on. Ensuring that your company's use of LLMs aligns with ethical guidelines and does not contribute to harmful content generation is vital.

What are the first steps I should take?

Whatever service you'd like to use, checking the terms and conditions should be your first reflex. For example, the terms of (the official frontend to) ChatGPT allow OpenAI to use anything you send it as training data, while the terms for their API indicate that the data sent to it won't be used for training purposes unless you specifically opt-in.

Whether the terms of your chosen service are suitable for you depends on the use case. As discussed in the next section, if you'd like to send anything to an LLM service including personal data, you need to comply with the GDPR. That could mean you need to use a service that guarantees that data from the EU stays in the EU. In the case of the OpenAI API, the only way to get a version that gives that guarantee is to use Microsoft Azure's OpenAI service, which only presents a limited version of the API with older models than OpenAI's own implementation.

Apart from that, awareness in your organization is an important factor. If the terms of use for your chosen LLM service limit which data can be sent to it, everyone who might come into contact with the service should be aware of that.

Data Protection and the GDPR

Which parts of the GDPR are relevant here?

When using LLM-based services, understanding how these services process your company's data and potential GDPR compliance issues that may cause is essential for responsible and compliant usage. Here's an overview of how LLM services use data and where GDPR compliance concerns may arise.

Data Storage and Retention

LLM providers often temporarily store input and output data, such as message logs, to improve the quality of their services for troubleshooting purposes or other operational needs. This storage and retention must be compliant with GDPR requirements, which dictate that personal data should not be kept longer than necessary for the purpose it was collected.

Data Processing

LLM services process data by analyzing the input text provided by users and generating human-like responses based on patterns learned during the training process. The processing of your company's data, especially when it includes personal information, must comply with GDPR regulations. This can include obtaining consent for data processing, including data minimization in a solution design, and tracking purpose limitation.

Data Security

When using LLM services, data is transmitted over the internet, and it's crucial to ensure that this transmission is secure. LLM providers targeting the EU should implement appropriate technical and organizational measures to protect personal data, as required by the GDPR. Still, it's good to check whether they document their compliance in their terms or other documentation.

Data Subject Rights

One challenge when using LLM services is ensuring compliance with data subject rights under the GDPR, such as the right to access, rectify, or erase personal data. To enable this, it's important to know whether data sent to a service contains data about a specific person and, if so, which data subject(s) are involved. Especially with free text data, this can be hard to keep track of, or in some cases even know in the first place, depending on the source of the data. In these cases, the number of compliant ways of using this data may be limited.

International Data Transfers

Using LLM services based in the United States while operating within the EU can pose challenges regarding the transfer of personal data. GDPR requires that the personal data of EU citizens should only be transferred outside the EU if there are adequate safeguards in place. Such safeguards could include the EU Standard Contractual Clauses. Note that initiatives such as the EU-US Privacy Shield Framework are no longer relevant after the Schrems II verdict, so special care must be taken when transferring data into the control of US companies for processing outside the EU.

How can I improve compliance through Data Processing Agreements?

Having a Data Processing Agreement (DPA) in place with your LLM provider is crucial to ensuring compliance with data protection regulations. DPAs clarify the roles and responsibilities of both parties concerning data protection and set the guidelines for processing, storing, and handling data. Make sure your DPA aligns with relevant regulations and addresses the unique risks and concerns related to using LLMs.

Technical and Organisational measures

What Data Privacy best practices should I implement when using LLMs?

To maintain data privacy and confidentiality while using LLMs, start by considering the implementation of the following best practices:

  • Data minimization: Limit the data processed by LLMs to only what's necessary for a particular task. Anonymize, pseudonymize or remove sensitive information before feeding it into the model.
  • Content filtering: This can apply both to the input data sent to an LLM service or the output which is sent back. The output data should be regularly or automatically checked for the format, correctness (if possible), and potentially damaging or offensive content (if generating free text). Filtering the input data can be done with techniques like those mentioned in the next section.
  • Local deployment: There are some LLMs which can be deployed locally or in a private cloud. This reduces the risk of data leaks and ensures greater control over data storage and access.
  • Keep up to date: Read the updates about new features and changes to the terms of a hosted service, or check out the changelog of the project behind your locally deployed implementation. This will ensure you don't miss any changes that could impact the privacy of your data.
  • Data access and retention policies: Establish clear policies on LLM access and data retention to prevent unauthorized access and data breaches.
  • Fine-tuning for privacy: Some LLMs (even hosted services) allow you to fine-tune your own custom model. If you do so, you can fine-tune the LLM on custom datasets designed to minimize data privacy risks, such as synthetic data, automatically anonymized data, or carefully curated datasets.
  • Logging and monitoring: Implement logging and monitoring systems to track LLM usage and detect potential data breaches, privacy violations, or misuse.
  • Evaluate third-party integrations: Assess the data privacy practices of third-party applications or services integrated with the LLM (like ChatGPT's new plugin functionality) and ensure they follow data protection best practices and comply with relevant regulations.
  • Stimulate a privacy-first culture: This begins with raising awareness and fostering a shared commitment to data privacy. Provide training and resources for employees to understand the implications of using ChatGPT and other LLMs, and make data privacy an integral part of your company's values. Encourage open communication about privacy concerns and involve employees in the development and implementation of data protection policies and procedures.

Which other technical measures can I use to ensure I don't send sensitive data to an LLM service?

When using Large Language Models (LLMs) in a business setting, it is crucial to ensure that personal and sensitive data remains secure from potential attackers, whether they gain access to the LLM service itself or intercept the communication. Implementing technical measures can help protect sensitive information while interacting with LLMs. This section explores several techniques to safeguard your data:

Anonymization: Anonymization is the process of removing or altering personally identifiable information (PII) from a dataset, making it impossible to link the data back to an individual. By anonymizing data before sending it to an LLM service, you can reduce the risk of privacy breaches and improve compliance with data protection regulations like GDPR.

Pseudonymization: Pseudonymization is a technique that replaces identifiable information with pseudonyms, which are randomly generated or artificial identifiers. While not as secure as anonymization, pseudonymization allows for data analysis while protecting personal information. Pseudonymized data can be re-identified if combined with the correct key, which should be stored securely and separately from the LLM service.

Differential Privacy: Differential privacy is a technique used to protect the privacy of individuals in a dataset by adding statistical noise to the data. This approach ensures that the overall characteristics of the dataset remain accurate while preventing the identification of individual records. Implementing differential privacy can help protect sensitive data when using LLM services, especially if the model itself is trained on such data.

Database methods: When working with a relatively stable dataset, typically kept in a database of some kind, there are other methods that can be applied to anonymize or desensitise data, such as k-anonymity. K-anonymity is a method of ensuring privacy in data by generalizing or suppressing certain attributes to ensure that each record in the dataset is indistinguishable from at least k-1 other records. This technique can be useful in preventing attackers from linking records to specific individuals based on unique combinations of attributes.

To effectively implement these technical measures, consider the following steps:

  • Assess the type of data your company processes with LLM services, identifying any personal or sensitive information that requires protection.
  • Choose appropriate techniques based on the nature of the data and the desired level of privacy protection. For example, use anonymization or pseudonymization for text data, and consider differential privacy or k-anonymity when working with large datasets.
  • Train your staff on the importance of data privacy and the techniques used to protect personal and sensitive data when interacting with LLM services.
  • Regularly review and update your data protection measures to stay current with technological advancements and changes in data protection regulations.
  • By implementing these technical measures, your company can ensure that sensitive and personal data remains secure, even if attackers gain access to the LLM service or intercept the communication. This proactive approach to data protection will help you maintain compliance with data protection regulations and uphold the trust of your customers and clients.

How will the future of LLMs and relevant laws and regulations affect my company's use of these technologies?

As data privacy regulations continue to evolve and technology advances, the use of ChatGPT and other LLMs in business applications may be subject to new rules and privacy safeguards. For example, the upcoming AI act will regulate, and in some cases limit, certain use cases for LLMs. It will require greater transparency about the use of generative AI and enforce best practices around monitoring and quality assurance. Right now it may be fair to say that lawmakers weren't prepared for the sudden advances made in this technology, but they're certainly watching the situation closely, and it would be prudent to expect more specific legislation soon.

We at Radix follow these developments closely and, as an AI partner, advise our clients about the best way to anticipate future requirements and implement necessary changes.

What can I start using right now?

Even though the landscape of LLM-based tools and services is shifting so fast that this section might need to be updated by the time you read this, we wanted to leave you with some concrete, practical advice which you can use to try things out for yourself. Contact us if you'd like to stay up-to-date with the bleeding edge of the latest developments.

Use of ChatGPT and other readily available tools

Currently, the ChatGPT terms make it a tool with limited applicability. It can be useful as a writing aid or as a technical copilot in less sensitive projects, but employees should be cautious when using it in any context that could potentially be sensitive. A project like Open Assistant could be deployed to fill this role in a more secure way, if an open-source solution is preferred, or something like ChatORG if you want to take advantage of the power of GPT-4.

Integrating LLM services into your business

If you're looking for a service to provide completions or high-quality text embeddings via an API, the OpenAI API could work well for data which is not personal or too sensitive. As mentioned before, when processing personal data, including data about EU residents, it takes a lot of work to send it to be processed by a US company outside of the EEA. Suppose you take advantage of something like the OpenAI API anyway. In that case, it is currently possible to use a very limited version of it as implemented by Microsoft on Azure. The Azure terms and Data Processing Agreement, which apply to this service, indicate that data processed by this service will not leave the EEA or be accessed by support staff outside the EEA if you, as a customer, use a European location in your subscription.

Rolling your own services

If you can't use LLMs as cloud services at all, you're not entirely out of luck! There are some alternatives that you can host by yourself. The downside, for now, is that hardly any of the options currently available match the performance of even GPT-3.5, let alone GPT-4. If you do have to go that route, you could try to roll your own ChatGPT-like service with Microsoft's DeepSpeed. The example uses OPT models from Meta, which could act as an alternative to GPT-3. The default performance doesn't yet parallel OpenAI's offerings, but if you have the hardware for it, you can fine-tune these models locally to be better suited for your use cases.

Conclusion

As you integrate ChatGPT and other LLMs into your business operations, it's essential to prioritize data ownership, privacy, and compliance. This involves understanding the risks and challenges associated with LLM usage, implementing data protection best practices, and fostering a privacy-first culture within your organization. By proactively addressing data privacy concerns and maintaining compliance with relevant regulations, you can stay in control of your data.

The constantly evolving landscape of available models, tools, and services in this space can make it seem daunting to get started with this technology. This is especially true given the variety of (often confusing) terms and conditions under which the services are made available. This is where Radix can make a difference as an AI partner. We can not only help you keep on top of the latest developments but also help you set up the right framework around the innovative projects you end up doing with them. This way, you can get ahead while maintaining full control over your data in a secure and compliant manner. 

Interested in having a first conversation with us? Book a call with our Account Manager Tomas Vanhaeren below: 

 
Joren Verspeurt
About The Author

Joren Verspeurt

Joren is a Machine Learning Engineer and Security Officer at Radix. He mixes his experience as a Data Scientist and Software Engineer for companies working in domains as varied as video streaming, life sciences, and cloud telephony. What drives him is finding ways to improve people's interactions with technology by using Al and Machine Learning techniques.

About The Author