Measured 7 mainstream large models, privacy streaking becomes a common problem
Article source: New knowledge of science and technology
Image source: Generated by AI
In the AI era, the information entered by users is no longer just personal privacy, but has become a “stepping stone” to the progress of large models.
“Help me make a PPT”,”Help me make a New Year poster” and “Help me summarize the content of the document”. After the big model became popular, using AI tools to improve efficiency has become a daily work for white-collar workers, and even many people began to use AI to order takeaways and book hotels.
However, this way of collecting and using data also poses huge privacy risks. Many users ignore the digital era. A major problem in using digital technologies and tools is the lack of transparency. They are not clear about how the data of these AI tools is collected, processed and stored, and are not sure whether the data has been abused or leaked.
In March this year, OpenAI admitted that ChatGPT had a vulnerability, which led to the disclosure of some users ‘historical chat records. This incident has raised public concerns about data security and personal privacy protection in large models. In addition to the ChatGPT data breach, Meta’s AI model has also been controversial due to copyright infringement. In April this year, American writers, artists and other organizations accused Meta of embezzling their works for training and infringing on their copyright.
Similarly, similar incidents occurred in the country. Recently, iQiyi and MiniMax, one of the “six big model tigers”, attracted attention over a copyright dispute. iQiyi accused Conch AI of using its copyrighted materials to train models without permission. This case is the first infringement lawsuit filed by a video platform in China against a large AI video model.
These incidents have attracted attention from the outside world to the source and copyright issues of large model training data, indicating that the development of AI technology needs to be based on user privacy protection.
In order to understand the current transparency of information disclosure of domestic large models,”Science and Technology New Knowledge” selected seven mainstream large models in the market: Doubao, Wenxinyiyan, Kimi, Tencent Hunyuan, Spark Model, Tongyi Qianwen, and Kuaishou Keling. As samples, we conducted actual measurements through privacy policies and user agreement evaluations, product function design experience, etc., and found that many products are not doing well in this regard. We also clearly saw the sensitive relationship between user data and AI products.
01. The right of withdrawal is in name only
First of all, it can be clearly seen from the login page that the seven domestic large model products all follow the “standard” usage agreement and privacy policy of the Internet APP, and all have different chapters in the privacy policy text to provide information to Users explain how to collect and use personal information.
The statements of these products are basically the same,”In order to optimize and improve the service experience, we may combine user feedback on the output content and problems encountered during use to improve the service.” Under the premise of being processed by secure encryption technology and strictly de-identified, the data input by the user to the AI, the instructions issued, the corresponding replies generated by the AI, and the user’s access and usage of the product may be analyzed and used for model training.”
In fact, using user data to train products and then iterating better products for users to use seems to be a positive cycle, but users are concerned about whether they have the right to refuse or withdraw the relevant data to “feed” AI training.
After reading and testing these seven AI products,”Technology New Knowledge” found that only Doubao, iFlytek, Tongyi Qianwen, and Keling mentioned in their privacy clauses that they could “change the scope of authorized products to continue to collect personal information. Or withdraw authorization.”
Among them, bean bags mainly focus on the withdrawal of authorization of voice messages. The policy shows,”If you do not want the voice information you enter or provide to be used for model training and optimization, you can withdraw your authorization by turning off” Settings “-” Account Settings “-” Improve Voice Services “”; however, for other information, you need to contact the official through the public contact information before you can request withdrawal of the use of data for model training and optimization.
Picture source/(Bean Bao)
In actual operation, it is not difficult to authorize and close voice services, but for the withdrawal of other information,”Technology New Knowledge” has not been able to get a reply after contacting the Doubao official.
Picture source/(Bean Bao)
Tongyi Qianwen is similar to Doubao. The only thing that an individual can operate is the withdrawal of authorization for voice services. For other information, it is also necessary to contact the official through the disclosed contact information to change or withdraw the scope of authorization to collect and process personal information.
Picture Source/(Tongyi Thousand Questions)
As a video and image generation platform, Keling has focused on the use of faces, saying that it will not use your facial pixel information for any other purpose or share it with third parties. However, if you want to cancel the authorization, you need to send an email to contact the official to cancel.
Picture source/(Keling)
Compared with Doubao, Tongyi Qianwen and Keling, iFlytek Spark’s requirements are more stringent. According to the terms, if users need to change or withdraw the scope of personal information collected, they need to cancel their accounts to achieve this.
Photo Source/(iFlytek Spark)
It is worth mentioning that although Tencent Yuanbao did not mention how to change information authorization in the terms, we can see the switch of the “Voice Function Improvement Plan” in the APP.
Photo Source/(Tencent Yuanbao)
Although Kimi mentioned in the privacy clause that sharing voiceprint information with third parties can be revoked and corresponding operations can be carried out in the APP,”New Technology Knowledge” did not find any changes after exploring for a long time. As for other textual information, no corresponding terms were found.
Photo Source/(Kimi Privacy Policy)
In fact, it is not difficult to see from several mainstream large-scale model applications that each family pays more attention to user voiceprint management. Bean bags, Tongyi Qianwen, etc. can all cancel authorization through autonomous operations. For geographical location, camera, microphone, etc. Basic authorization under specific interactions can also be turned off voluntarily, but it is not so smooth for each family to withdraw “feeding” data.
It is worth mentioning that overseas models also have similar practices in terms of “user data exit AI training mechanism.” Google’s Gemini related terms stipulate,”If you don’t want us to review future conversations or use relevant conversations to improve Google’s Machine learning technology, please close the Gemini application activity record.”
In addition, Gemini also mentioned that when deleting its own application activity record, the system will not delete conversation content (as well as related data such as language, device type, location information or feedback) that has been reviewed or annotated by a human reviewer because the content is saved separately and is not associated with a Google account. These contents will be retained for up to three years.
Source/(Gemini Terms)
ChatGPT’s rules are somewhat ambiguous, saying that users may have the right to restrict their processing of personal data. However, in actual use, it is found that Plus users can proactively set up to disable data for training. However, for free users, data is usually collected by default and used for training. If users want to opt out, they need to send a message to the official.
Picture source/(ChatGPT terms)
In fact, it is not difficult to see from the terms of these large model products that collecting user input information seems to have become a consensus, but for more private biometric information such as voiceprints and faces, only some multimodal platforms have a slight performance.
But this is not a lack of experience, especially for major Internet companies. For example, WeChat’s privacy clause details the specific scenarios, purposes and scope of each data collection, and even explicitly promises that “users ‘chat records will not be collected.” The same is true for Douyin. Information uploaded by users on Douyin is almost always described in detail in the privacy clause, standard use methods, purpose of use, etc.
Photo Source/(Douyin Privacy Policy)
Data acquisition behavior that was strictly controlled in the Internet social era has now become the norm in the AI era. The information entered by users has been randomly obtained by large model manufacturers under the slogan of “training materials.” User data is no longer considered a personal privacy that needs to be strictly treated, but a “stepping stone” to model progress.
In addition to user data, the transparency of the training corpus is also crucial for large-scale model attempts. Whether these corpus is reasonable and legal, whether it constitutes infringement, and whether there are potential risks to users ‘use are all questions. We conducted in-depth exploration and evaluation of these seven large model products with questions, and the results also surprised us.
02. Hidden dangers of “feeding” training corpus
In addition to computing power, high-quality corpus is more important for training large models. However, these corpus often contains some copyrighted texts, pictures, videos and other diverse works. Unauthorized use will obviously constitute infringement.
After actual measurement by “New Technology Knowledge”, it was found that the specific source of the training data for the seven large model products was not mentioned in the agreement, nor did the copyright data be disclosed.
As for the reason why everyone has a tacit understanding not to disclose the training materials, it is also very simple. On the one hand, it may be that copyright disputes can easily arise due to improper use of data. However, there is no relevant regulation on whether AI companies use copyright products as training materials in compliance and legal terms; on the other hand, it may be related to competition among companies, and companies disclose the training materials to their peers is equivalent to a food company telling the raw materials, and peers can quickly copy them to improve the level of their products.
It is worth mentioning that the policy agreements of most models mention that the information obtained from interactions between users and large models will be used for model and service optimization, related research, brand promotion and promotion, marketing, user research, etc.
Frankly speaking, due to various reasons such as the uneven quality of user data, insufficient scene depth, and marginal effects, it is difficult for user data to improve model capabilities, and may even bring additional data cleaning costs. But even so, the value of user data still exists. It’s just that they are no longer the key to improving model capabilities, but a new way for companies to gain business benefits. By analyzing user conversations, companies can gain insight into user behavior, discover monetization scenarios, customize business functions, and even share information with advertisers. These also happen to comply with the rules of use of large model products.
However, it should also be noted that the data generated during real-time processing will be uploaded to the cloud for processing, and will also be stored in the cloud. Although most large models mention in their privacy agreements the use of encryption technology no less than that of industry peers., anonymization and related feasible means to protect personal information, there are still concerns about the actual effect of these measures.
For example, if the content entered by the user is used as a data set, it may bring the risk of information leakage when others ask the large model about relevant content after a while; in addition, if the cloud or product is attacked, is it still possible to use correlation or analysis technology to restore the original information, which is also a hidden danger.
The European Data Protection Board (EDPB) recently issued data protection guidance on artificial intelligence models for processing personal data. The opinion clearly points out that the anonymity of AI models cannot be established with a statement, but must be ensured through rigorous technical verification and unremitting monitoring measures. In addition, the opinion also emphasized that companies must not only confirm the necessity of data processing activities, but also demonstrate that they have used the least intrusive method to personal privacy during the processing process.
Therefore, when large model companies collect data “in order to improve model performance,” we need to be more vigilant and think about whether this is a necessary condition for model progress or whether the company misuses users ‘data for commercial purposes.
03. Data security fuzzy zone
In addition to conventional large-scale model applications, the privacy leakage risks brought by the application of agents and end-side AI are more complex.
Compared with AI tools such as chat robots, the personal information that agents and end-side AI need to obtain when using it is more detailed and valuable. In the past, the information obtained by mobile phones mainly included user equipment and application information, log information, underlying permission information, etc.; in the end-side AI scenario and the current technical method mainly based on screen reading and recording, in addition to the above-mentioned comprehensive information permissions, the terminal agent can often also obtain the screen-recorded file itself, and further through model analysis, obtain various sensitive information such as identity, location, and payment displayed by it.
For example, Glory previously demonstrated the takeout scene at the press conference, so that location, payment, preferences and other information will be quietly read and recorded by AI applications, increasing the risk of personal privacy disclosure.
For example, Tencent Research Institute previously analyzed that in the mobile Internet ecosystem, apps that provide services directly to consumers are generally regarded as data controllers, and bear corresponding privacy protection in service scenarios such as e-commerce, social networking, and travel. and data security responsibilities. However, when end-side AI agents complete specific tasks based on APP service capabilities, the boundary of responsibilities between terminal manufacturers and APP service providers in terms of data security becomes blurred.
Manufacturers often use the excuse of providing better services. When it comes to the entire industry, this is not a “legitimate reason”. Apple Intelligence has made it clear that its cloud will not store user data and uses a variety of technical means to prevent any organization, including Apple itself, from obtaining user data and winning user trust.
There is no doubt that the current mainstream model has many problems in terms of transparency that need to be solved urgently. Whether it is the difficulty of withdrawing user data, the opacity of the source of training corpus, or the complex privacy risks brought by agents and end-side AI, they are constantly eroding the cornerstone of user trust in the large model.
As a key force in promoting the digitalization process, the transparency of large models is urgently improved. This is not only related to the security and privacy protection of users ‘personal information, but also a core factor that determines whether the entire large model industry can develop healthily and sustainably.
In the future, we hope that major model manufacturers will respond actively, proactively optimize product design and privacy policies, and clearly explain the ins and outs of data to users in a more open and transparent manner, so that users can use large model technology with confidence. At the same time, regulatory authorities should also speed up the improvement of relevant laws and regulations, clarify data use norms and responsibility boundaries, create a development environment that is full of innovative vitality, safe and orderly for the large model industry, and make large models truly a powerful tool for benefiting mankind.