$1 million ARR in 45 days, the startup found PMF generated by AI 3D
Article source: Founder Park
Image source: Generated by AI
In 45 days, 3D generation product Rodin reached a US$1 million ARR. This is an important milestone, and by contrast, it took HeyGen, one of the most successful startups in the GenAI space, seven months to reach this number.
Rodin is from ShadowEye Technology and has just completed tens of millions of dollars in Series A financing. Investors include ByteDance and Meituan Dragon Ball.
The average age of the four joint founders is 25 years old, but they have been starting a business for four years. Four years ago, we were all classmates. The more confident our skills, the more bumpy our business.
We sat down and chatted with CEO Wu Di and CTO Zhang Qixuan for a long time, and heard many questions they asked ourselves. Those questions gradually came to answers after four years of exploration.
“Our technology is so good, why don’t customers use it?” The first question is a supertypical technical genius.
Shadoweye has spent four years solving this problem.
January, from ancient times to the present, has always been hardware. The hope of the big model may also be hardware.
So Founder Park held this semi-closed exchange and invited several emerging entrepreneurs to make a game:
- What is the best AI hardware at CES?
- Apart from glasses, what else is AI hardware worth doing?
- How to quickly reach 1 million shipments for a new category?
- In the coming year, who among the cutting-edge companies can do the best?
Sharing guests: - Zhang Peng| Founder and President of Geek Park&
- Henri Pang |Senior Advisor, Chief Strategy Representative, Kickstarter China&
- He Jiabin| Mengyou Intelligent CEO and Co-Founder (Ropet)
- Zhang Xiaohui| TangibleFuture Founder& CEO (LOOI)
- Wu Hao Tony| Co-founder and CEO of Jiuzhi Technology (RingConn)
01. The expression in 3D is “fragmented”
Rodin 1.0 passed the US$1 million ARR in 45 days. That was a story half a year ago. Now Rodin has completed several version iterations and upgraded to version 1.5, and the model performance has completed a leap.
1.5 The most important feature of the version is the ability to generate right angles. It sounds “simple”, which means more accurately generating straight lines, right angles and smooth surfaces, and better edge sharpness.
When the outside world’s expectation of 3D generation becomes that a corner of the real world can be created with a few words of natural language, what is the value of a more accurate “right angle”?
Film and television works created using Rodin
“3D generation, what exactly is generated?” This is the most basic but also the most critical issue.
Some people think it’s a video, or rather, most people’s understanding of 3D is largely equivalent to a video content full of 3D elements. “Toy Story” in the 1990s, later Ang Lee’s digital version of Will Smith, the early Polygon game, and last year’s hit “Black Myth: Wukong”, everyone can feel the charm of 3D as an image presentation method through flat surfaces, whether it is a movie screen or a game computer screen.
Therefore, imitating 3D from 2D video has become a very important technical route.
Sora was launched in early 2024, and the high consistency in the demo video has triggered people to discuss whether it will directly cover 3D generation. But soon, Sora delayed release, its followers performed average, and the video model was still a long time away from being “film-level” or being added to the game pipeline.
There are many reasons, such as the power of generative AI is still overrated, as film concept artist and illustrator Reid Sousen commented earlier,”These videos are a little sloppy and have too many problems, especially artifacts such as temporal consistency and extra limbs.”
But one neglected question is, is a picture that demonstrates a 3D image “3D” or is it more “video”?
Video works mean facing its consumers directly, but the “3D” concept in game and film creation is itself a part of a complete industry. For example, a virtual modeling of Huaguoshan needs to be able to be used in subsequent creation. Continue to be used in the creative process.
“3D generation, what exactly is generated?”
“Unlike video, 3D is an industry with downstream links. After the video is output, users can directly share it and watch it on their mobile phones. However, if they want to use it further after 3D is produced, they need to adapt the renderer, the game engine, and if it is intelligent, they need to adapt the simulation software. This requires us to respond to the output of the (model) with the needs of some industry standards.”
“In our understanding, 3D is an asset,” Qi Xuan said.”Text, images, and videos are all consumer-grade and directly meet C-end users, but 3D is not.”
Users use Rodin to generate 3D assets in batches
Text, images or videos have become consumer-level content until now, which means that they all meet C-end users directly. At the technical level, this means that the expression of the three modes has reached a basic agreement in the industry.
“Video has its mainstream coding, and the current mainstream image may be a two-dimensional matrix, recording its color at each position. Words may be codes on some characters,”Qi Xuan said,” But 3D is not. Up to now, its expression is still very fragmented.”
This fragmentation means that, for example, the facial modeling of a 3D digital person may use a specific format to support complex facial expressions and body animations, which usually requires high-precision mesh and bone binding technology; Modeling in the escape game pays more attention to performance and efficiency, and a gun on the ground is usually modeled in a low-polygon style; The 3D modeling of a car during the design stage focuses on accurate geometric shape and functional expression, and requires detailed display of its internal and external structure, mechanical components and aerodynamic characteristics. This modeling usually requires the use of professional CAD software, and combines strict engineering and design standards to ensure the accuracy and practicality of the model.
Almost all industries that require 3D data currently have a set of standards and characterization methods that are only applicable to their own scenarios, and their data information cannot be reused with each other.
The ShadowEye Technology team has always wanted to unify the representation of 3D data and turn it into a standardized asset. This has been done since Rodin 1.0. The team has proposed a remesh model reset strategy, by making each model A little “thicker” to achieve consistency in the representation. After “thickening”, it does not have much impact on the aesthetics of the generated 3D data and the information it contains, but the entire model will look round.
But when Rodin 1.0 really fell into the industry, the unification of representations did not mean that the generated 3D data could be used smoothly as an asset. In a lot of real product design or game industries, the huge demand for 3D assets is not cute pets or a letter “A” made of cloud texture, but something more inclined to inorganic shapes (using mathematical construction, a surface formed by straight lines or curves, or a combination of straight curves) and sharp edges.
The ability to generate inorganic shapes, sharp edges and very clean topology are the most prominent performance improvements in Rodin 1.5 ‘s 3D generation capabilities. This emphasis on the consistency and “usability” of 3D generated data is something Wu Di and Qi Xuan have stepped on through one pit after another in the past few years.
02. Be Production-Ready
A few years ago, a major customer made the fledgling Wu Di, Qi Xuan and others hit a wall for the first time, and that was “The Wandering Earth 2”.
“The Wandering Earth 2” contains some scenes of Andy Lau and Wu Jing becoming younger, which the team hopes to use special effects to present in the later stage. At the beginning of 2021, the ShadowEye team built a black spherical frame with a diameter of 3 meters in Zhangjiang, Shanghai. Light sources and cameras were spread throughout the sphere, and the entire device filled the entire room. This was used by ShadowEye Technology at that time for high-precision human facial collection. The first generation of dome light field. After the dome light field was made, some film and television industry teams came to ask questions one after another, including “The Wandering Earth 2”.
dome light field
Wu Di and Qi Xuan are very confident in the face scanning equipment they have developed, but the reality is also very bleak. According to Wu Di’s recollections,”After the Wandering Earth team came to see the effect, they asked the first question: How to use this thing?”
The reason why it cannot be used is that the original dome light field is actually a pure lighting system. When a person enters the center of the sphere, light in all directions can be collected through a 360-degree light source. On this basis, different lighting environments can be synthesized later, and then it can be replaced by changing faces. Logically, it is more inclined to what is now called video generation. This makes it difficult to enter the CG pipeline of the film industry.
“To really use a 3D face on the CG pipeline, it must first be a complete 3D model. It has excellent topology, materials that can reflect various lighting changes, and can control and make various expressions. Only then can it be well connected to the back for use.”
Shortly after that, ShadowEye Technology made a major decision-cut off all investment in base 2D technology research and development at that time and fully in 3D. Behind the shift in the generation route from 2D to 3D is the consensus within the ShadowEye Technology team on “Production-Ready”.
The word “Production-Ready” comes from the CG industry. There is a word in the CG industry-Post-Production, and “Production-Ready” means available later.
User works, 70% of models come from Rodin
From the first generation of dome light fields that focused on planar data collection, it slowly evolved in the process of continuous collision with customers to the second generation of dome light fields that collected 3D facial data, and then with contact with customers, technology finally It has been achieved that collected data can be directly used to build digital characters in film and television games, and “Production-Ready” has gradually become a concept of ShadowEye Technology from the inside out.
“Production-Ready is not an easy indicator to quantify. If we must be more specific, it is that in the design of technical routes and the priority of selection, we will regard the usability of the generated results as a very important consideration point. For example, if a technology can improve visual quality but does not bring Production-Ready closer, we may not necessarily do it,”Qi Xuan said.
The concept of “Production-Ready” also directly determines that ShadowEye Technology has chosen an anti-common sense path in 3D generation after the wave of generative AI comes.
In the most mainstream concept at the time, 3D generation was essentially a process of dimensional upgrading from 2D. After the emergence of Stable Diffusion, 3D reconstruction was achieved through a 2D diffusion model combined with NeRF and other methods. Because they can be trained using a large amount of 2D image data, such models are often able to generate diverse results.
As the multi-view reconstruction work adds multi-view 2D images of 3D assets to the training data of 2D diffusion models, the problem of such models ‘limited ability to understand the 3D world has been alleviated to a certain extent. However, the limitation is that the starting point of this kind of method is after all the 2D image. After all, 2D data only records one side of the real world, or projection. No matter how many angles of images, it cannot completely describe a 3D content. Therefore, there is still a lot of information missing in what the model learns, and the generated results still need a lot of revisions, which is difficult to meet industry standards.
The path from 2D to 3D is more like proving that an image model can understand 3D after seeing enough images, but this understanding of 3D is still far from the 3D data that can be used in the industry. From another perspective, 2D to 3D in turn means a compression of 3D information-just like a 200-sided regular polygon is still far from an ideal circle.
After a large number of digital people and 3D face-scanning work, the ShadowEye team “couldn’t convince themselves” when faced with this seemingly consensual technical route in 3D generation.
“We know what the upper limit is for three-dimensional scanning. At present, it is difficult to directly put it into actual production when it reaches the most perfect level. The best case scenario for using 2D Stable Diffusion to upgrade the dimension to 3D is infinitely close to the quality of 3D scanning. Why can this method be achieved in one step?” Wu Di said.
To be able to align with human industry, 3D generation can only take the path of 3D native, that is, abandon the idea of upgrading dimensions from 2D and directly build a 3D model.
At the ACM SIGGRAPH 2024 conference of the Computer Graphics Summit, two papers by the ShadowEye Technology Team-CLAY, a controllable 3D native DiT generation framework and DressCode, a 3D clothing generation framework-were shortlisted for the best paper nominations. The paper proposes a 3D native diffusion transformer architecture, which is to train and generate models entirely from 3D datasets and extract rich 3D priors from various 3D geometries.
The exploration work of these two papers also led to changes in the technical route in the 3D generation industry. Since then, 3D native began to replace 2D and upgrade to 3D, and it has now become the mainstream exploration path for 3D generation around the world.
Shadow Eye team on SIGGRAPH
03. From laboratory to startup company
As early as the first year of ShadowEye’s founding, they had made a star product.
In 2021, a secondary character generation product called “WAND” was launched. It was seen by a well-known Japanese blogger on the second day after launch, and then quickly became lively in China, gaining 1.6 million users in a short period of time.
WAND’s App Store page
Traffic and attention followed, but “I couldn’t keep it,” Wu Di said.
Traffic did not give Wu Di and Qixuan the opportunity to choose which company to become, but instead deprived them of the right to choose.
“Everyone feels that we should turn ourselves into a ‘WAND’ company, including the people around us, and some who want to invest in us,” Wu Di said.
But in the end, the “WAND” company did not appear. Soon after, Wu Di and Qi Xuan took the initiative to stop the “WAND” product. The more familiar names to the outside world are ShadowEye Technology and Rodin.
“We didn’t take the path that everyone thought we should take because our technical capabilities and what we want to do are still in 3D.”
The determination to completely abandon the image generation route was supported by Dr. Lu Qi.
“Now that you have made this decision, you must be ruthless and only do what you think is right.” Dr. Lu Qi said to the ShadowEye team after the 2021 Autumn Roadshow of Extraordinary Achievements.
At the 2021 Autumn Entrepreneurship Camp Roadshow at the end of 2021, Dr. Lu Qi was like a “coach”, recycling the microphone and high-fiving with entrepreneurs who had just completed the roadshow. Among the 4226 startups in this phase, 53 projects were finally accepted. The acceptance rate of 1.25%, including ShadowEye Technology.
WAND eventually became a stepping stone for Wu Di and Qi Xuan to move from the laboratory to the business world.
Wu Di later asked Dr. Lu Qi why he voted for his team. WAND, which became popular in the same year, was the first opportunity for Qiqi to notice this young team at the University of Science and Technology. However, the most fundamental reason was behind WAND. Qiqi saw that a pure R & D team could rarely have commercial thinking at an early stage.
This is not easy for a founding team with an average age of only 21 years old in 2021, but the two very corporate thinking dimensions of productization and commercialization have existed since the beginning of the name ShadowEye Technology in the MARS laboratory of Shanghai University of Science and Technology.
Wu Di entered the University of Science and Technology of Shanghai in 2015, and Qi Xuan entered the MARS laboratory of Shanghai University of Science and Technology, whose main research direction is artificial intelligence combined with computational photography. At that time, there were only three students in the laboratory, that is, the earliest three members of ShadowEye Technology, and the fourth joint venture entered the MARS laboratory in 2020. At this time, the first generation of dome light fields were being built, and the outside world was the concept of the metaverse and digital person. Wu Di and Qi Xuan saw the business prospects behind this set of digital collection equipment and decided on the establishment of ShadowEye Technology in the laboratory.
Shanghai University of Science and Technology is a very, very young school. It was founded in 2013. Wu Di was a second-class student. At that time, HKUST was not a “double first-class university”. There was only one dormitory building on the campus, and classrooms from other schools had to be borrowed for classes.
But what’s interesting is that at HKUST, whether it’s the laboratory, the student union, or the initial course, everything has to be built from scratch. Wu Di likes this feeling very much.”I study and I feel like starting a business.”
Or to use Qi Xuan’s words,”(The first two years of HKUST) determined the attributes of the students at that time. It was their daring, aka entrepreneurial spirit.”
The ShadowEye team is at SIGGRAPH Real-time Live! Link display Rodin 3D generation
The company was established in June 2020. For more than a year after that, Wu Di and Qixuan both suffered setbacks between the huge gap between generated content and the real needs of the industry. The calibration direction that made “Production-Ready” the core of technology research and development was originally formed through these countless setbacks.
In the autumn of 2021, ShadowEye received its first financing from the extraordinary achievement industry. After the road show day of the extraordinary achievement creation world, they quickly got their second sum.
The second one came from Sequoia. Wu Di remembers that when Sequoia’s financing was finalized, it was Christmas 2021. They met with several waves of investors that afternoon until very late. “That day happened to be our Christmas party, but in the end, Wu Di and I just went to the party to settle the bill,” Qi Xuan said.
This entrepreneurial road has not been smooth sailing since then. Starting from 2022, ShadowEye Technology has not received financing for nearly two years. One of the financing process consumed a lot of Wu Di’s energy, but in the end it failed to close.
That failure brought two results:
First, Yingmou’s character is that when doing AI entrepreneurship, we must consider commercialization on the first day, live first, and ensure cash flow;
Second, thoroughly strengthen the choice of 3D native route.
“Before this, our idea of doing 3D generation was to recruit someone who had tried in the field of 3D generation to help us do it together, but that might not be able to escape the inertia of the technical path at that time,” Wu Di said.”It was precisely because of that financing failure that the entire core R & D team made up its mind to make a truly usable 3D generation.”
A few months later, the original Rodin 1.0 came.
04. 3D is the puzzle
Does Yingmou want Rodin to become a popular toC product like WAND?
The answer is clear.
“3D generation will eventually move towards the C-end, but not now.” Qi Xuan said,”Nowadays, taking a picture or a video can be directly shared on social platforms, but 3D is not yet a format that can be shared.”
There may be opportunities for new hardware, but it will definitely take time. Before that,”When you don’t know where the end of this thing is, it’s better to do it first. There are always many problems in front of you that are worth solving.” Wu Di is convinced that the current opportunities for 3D generation are in the stock market.
Needless to say, film and television entertainment, there is an increasing demand for 3D generation in the industrial field. For example, in architectural design, in the past, most architectural renderings relied on two-dimensional textures, and computing power limited the choice of visualization. The limitations of this method are quite large. For example, the lighting always looks incorrect, the camera always has to be at a certain height, and animation is a restricted area. 3D native technology can allow the entire virtual space to operate under any light situation and any camera, bringing more imagination to architectural visualization.
At present, ShadowEye has cooperated with many leading companies in the games, film and television, manufacturing and other industries. Rodin’s SaaS products have also accumulated a large number of professional users such as graphic designers, AR VR developers, and 3D printing enthusiasts.
Rodin user reviews on X
“Our goal now is the stock market. The stock market has real demand. Can it tell us what kind of 3D generation model do you need?” Wu Di said.
What about the future?
When Sora was earth-shattering a year ago, it once made people wonder whether the industry still needed 3D.
Qi Xuan was deeply impressed.”When video generation first came out, everyone who did traditional graphics-we-thought it would be subverted.” He explained that for 3DCG, video generation means no longer needing three-dimensional space and getting the rendering results directly.”This has a great impact on traditional CGI technology. Those who do 3D generation will worry that one day 3D will no longer be needed.”
In particular, although Sora was a “futures” at the time,”OpenAI’s reputation in futures is quite good.”
Yingeye’s R & D team began to frequently understand and test video models. They quickly realized that what video generation was doing was just “simulation”,”simulation”, and then “approaching” the final desired result.
“It is a generator of frame consistency. It is not based on the World Model. It cannot achieve world consistency.” Qi Xuan said,”This is a concept of two levels. If you only rely on video generation, you can only stay here.”
“But what’s interesting is that 3D models were originally made in the CGI industry, which is world consistency.”
A CG video in a movie, such as a person in a room, first needs a model of every object in the room. Each model needs materials to express lighting properties. The characters need animation of the action. A photographer in the virtual world is needed to ray trace every frame of the character’s action. At this time, light chasing is the job of the renderer. Usually, a movie-level CG is rendered offline, and cluster-level rendering is often required to achieve realistic results.
Realizing this, and looking at video generation, in the above pipeline, it seems that “only replaces the work of offline renderers-not the entire CGI industry.”
“Video is not a world model,” Wu Di said.”It may be a form of world model output and displayed to the public.”
“The issue of consistency, especially world-level consistency, is a matter of information volume,” Qi Xuan explained.”If the description of information changes in this world cannot be input to AI, it will definitely not be able to achieve such consistency.”
To reach the world model, at least world consistency is required, so at this time, a new module is needed to control it.
The missing piece of the puzzle happens to be 3D.
“We have our own World Model in mind.” There are many things that are being done and worth doing, and it’s exciting to think about it.