AILab Howest

Howest Logo

/

From prompt to 3D model: an AI workflow for text-to-mesh and image-to-mesh generation

Original language: English
May 19, 2025

AI models that automatically generate 3D models based on text or images sound like science fiction, but they are a reality today. Thanks to techniques such as text-to-mesh and image-to-mesh, digital 3D objects can be created faster than ever. This is ideal for applications in games, AR/VR, or product visualization. In this blog post, I will dive deeper into how these models work, how well they perform, and which tools are currently available. I will showcase a complete workflow: from input to mesh, with concrete examples and technical insights. This will give you a clear picture of what is already possible today and where the challenges still lie.

What are these models, and how can I get started with them?

Why text-to-mesh and image-to-mesh are interesting

The ability to automatically generate 3D models based on text or images opens new doors in various sectors. What used to require hours of modeling in Blender or Maya can now be accomplished in just a few minutes with the help of AI. This makes text-to-mesh and image-to-mesh technologies particularly valuable in workflows where speed, iteration, and creativity are central.

For game developers, this means they can build prototypes more quickly or generate placeholder assets based on visual references or concept prompts. In AR/VR development, developers can create interactive 3D content in an accessible way that is dynamically generated based on the user's context.

Designers, both graphic and industrial, can visualize ideas with these tools without needing extensive 3D modeling experience. Think of product design, interior visualization, or virtual clothing in a digital fitting room. For them, this technology significantly lowers the barrier to 3D visualization.

There are clear advantages even within R&D environments. For example, in simulations, robotics, or machine learning pipelines where synthetic data is required: AI-generated meshes can serve as training data, or can even be adjusted in real-time based on input.

Finally, there is the aspect of automation. In workflows where speed and scalability are essential — such as e-commerce (product previews), digital heritage conservation, or creative content generation — text/image-to-mesh pipelines can be linked to other AI modules. This creates an almost fully automated 3D production chain, from prompt to optimized model.

Workflow: From image prompt to mesh

The power of these AI tools lies not only in what they generate but also in how easily they can be integrated into existing 3D workflows. Below, I will demonstrate a practical pipeline.

Text-To-Mesh Workflow

It seems like there was no content provided for translation. Please provide the text you would like me to translate to English.Descriptive text prompt

It seems like you have provided a single word "Tool:" without any additional context or content to translate. Could you please provide more information or text that you would like to be translated?Tripo.ai (or other useful options)

It seems like you have provided a single word "Output." Could you please provide the full content or text that you would like me to translate into English?3D-mesh (.obj / .gbl)

Short technical explanation: How do these models work?

Text-to-mesh and image-to-mesh models utilize different neural representations and optimization techniques to generate a 3D mesh from a text prompt or image. The methods employed vary significantly in structure, scalability, and output quality. Below, we will discuss some key concepts that will help you better understand these models.

Volumetric Representations (NeRF-based)

Many models such as DreamFusion or Magic3D are based on NeRFs (Neural Radiance Fields). In this approach, a 3D object is not represented as a mesh or point cloud, but as a volumetric field. Each point in 3D space has a color and a density, learned through a neural network function.

  • Advantage: Continuous and detailed, realistically renderable via differentiable rendering.
  • Disadvantage: The output is initially not a mesh. You need to use a marching cubes algorithm to create a mesh from it, which can sometimes result in a rough or hollow output.

Point-Based & Explicit Mesh Representations

Models like Shap-E and Point-E work directly with point or mesh structures. They predict, for example, a point cloud or mesh vertices that can be immediately used as geometry output.

  • Advantage: Faster and more directly deployable for game engines or 3D tools. No additional conversion step required.
  • Disadvantage: More difficult to represent fine details or textures, especially in smaller objects.

Diffusion Models for 3D (Latent or Rendered)

An increasing number of models utilize diffusion processes, as DreamFusion does with "Score Distillation Sampling." The idea: a text prompt is used in combination with a 2D text-to-image diffusion model (such as Imagen or Stable Diffusion), which generates images from various angles. These renders are then used as targets to optimize an underlying 3D representation (e.g., NeRF).

  • Render-based supervision: The generated 3D representation is rendered from multiple angles, and those renders must resemble images generated by the 2D diffusion model. Based on this, the 3D parameters are updated.
  • Latent optimization: Some new methods (such as Meshy.ai) operate entirely in the latent space of the diffusion model and avoid explicit render loops, making them faster and more scalable.

CLIP Guidance & Text Embedding Matching

Some older or lighter systems use CLIP-based loss, where rendered images from the 3D model are compared with the original text prompt using a vision-language encoder like CLIP.

  • The render is encoded by CLIP, just like the prompt, and the cosine similarity is maximized.
  • Less accurate than pure diffusion supervision, but computationally lighter.

Comparison of Models

In this section, we test various AI tools and models that are currently available for text-to-mesh and image-to-mesh generation. We compare the results based on the same input (text prompt or image) to see how well each model performs in terms of detail, shape consistency, texture quality, and ease of use.

Text-To-Mesh

The following 3 models were used for the tests:

📝 Prompt 1: Futuristic Plasma Blaster

"A futuristic handheld plasma blaster designed for elite space marines. The weapon features a sleek, matte-black carbon fiber body with glowing blue energy conduits running along its barrel. The front end houses a rotating tri-nozzle mechanism surrounded by copper cooling fins. Its ergonomic grip is textured with dark rubber padding and contains a small digital ammo counter screen with green LED lights. Several small warning decals and engraved serial numbers are etched into the metal near the trigger housing."

📝 Prompt 2: Organic Tree

“A large, gnarled forest tree stump with thick, moss-covered bark and multiple twisted roots sprawling outward. The top surface is uneven and cracked, with a shallow pool of rainwater reflecting light. Small mushrooms with red caps and white spots grow along the side, and a tiny hollowed-out squirrel den is visible near the base. The bark shows fine details like vertical grain lines and peeling textures, with vines hanging over one edge. The lighting emphasizes dampness and the organic complexity of the wood.”

Image-To-Mesh

The following 3 models were used for the tests:

These are the 3 images we used as prompts:

Results

After testing multiple prompts, it appears that Tripo3D delivers the most consistent and impressive results, both in text-to-mesh and image-to-mesh. The generated models exhibit high geometric complexity combined with sharp, well-applied textures. The models are generally ready for immediate use in visualization or prototyping workflows.

Hyper3D and Meshy.ai perform similarly, but differ slightly depending on the prompt. Meshy.ai sometimes excels in texture usage and recognizability, while Hyper3D is strong in mesh structure. However, both often require additional post-processing when the model is used in a production environment.

InstantMesh clearly scores the lowest. The model struggles with complex shapes and often produces generic or inconsistent results with detailed input. Simple objects like furniture, pots, or stands are manageable, but further use is limited.

Applications in Industry

Although text-to-mesh and image-to-mesh models are often associated with creative sectors such as gaming and AR/VR, their potential extends far beyond that. These technologies can also provide significant added value in industrial environments, education, and healthcare.

An important example is digital twinning — the virtual reconstruction of physical objects or systems. Using an image or text description, an AI model can quickly generate a digital representation of, for instance, a machine part or mechanical system. This accelerates the design process and makes maintenance and simulation models more accessible.

Interactive 3D representation of a gear system:

In healthcare and medical education, AI-generated 3D content can play an important role. Think of visualizing complex anatomical structures based on descriptions or medical images. This makes abstract concepts more tangible for students or patients.

Anatomical 3D model of the human heart for educational use

By linking these technologies to interactive platforms or AR headsets, users can learn, train, or plan more intuitively — without costly manual modeling.

Limitations & Realism: What challenges do we still face today?

Although AI tools can generate impressive 3D models, there are still clear limitations when you want to implement them in professional workflows.

Mesh Quality

The generated meshes are often rough, contain too many polygons, or have artifacts such as holes or floating vertices. Retopology is usually necessary.

Inconsistent renders

In NeRF-based models, the front and side views are usually well-rendered, but the back often contains noise or is not well-defined. This is due to a lack of visible training views.

UV mapping and textures

Many models do not provide proper UV unwrapping or high-quality textures. You get vertex colors or generic diffuse maps, which require further post-processing.

Fantasy and Interpretative Errors

AI sometimes invents details where there is insufficient input. As a result, you get objects that look good from one angle, but are structurally incorrect.

No scale or metric

Models deliver objects without consistent scale or dimensions. A pair of glasses can be as large as a car.

Future and Automation: AI as a Building Block in 3D Pipelines

The greatest value of text-to-mesh and image-to-mesh models lies in their potential for automation. Instead of manually designing each 3D model, they can serve as a starting point within a larger content pipeline.

Imagine an AR/VR environment where objects are automatically generated based on context or user input. For example, a user describes "a medieval shield," and within seconds it appears as an interactive object in a virtual world. In game engines, these models could automatically generate placeholder assets, environmental objects, or even NPCs during level design.

In e-commerce or digital twin applications, AI can create 3D models based on existing photos or descriptions of products, significantly increasing the scalability of 3D catalogs.

By linking these tools to existing software (such as Unity, Blender, or Houdini) and combining them with AI agents or prompt generators, fully semi-automatic pipelines can be created — from text or image to usable, optimized 3D content.

In short: these models will not replace the creative process in the future, but they will drastically accelerate it and make it more accessible.

Closure

AI-driven 3D generation is still in its infancy, but it already shows immense potential for faster, more flexible workflows. Whether you're working on a game, AR application, or product visualization, these tools can make a significant difference in speed and creativity.

Authors

  • /

    Hube Knaepkens, intern

  • /

    Jens Krijgsman, Automation & AI researcher, Teamlead

Want to know more about our team?

Visit the team page