Friday, January 31, 2025
HomeAnalyticsDeepSeek's Janus Pro 7B vs OpenAI’s DALL-E 3: Which is better?

DeepSeek’s Janus Pro 7B vs OpenAI’s DALL-E 3: Which is better?


In the last two weeks, if something has made it to the headlines besides the crashing stock markets or the new US President, it is DeepSeek. A Chinese AI firm has been launching one great LLM after the other. It has launched its state-of-the-art image generation model – Janus Pro- 7B. This latest launch has already performed better than OpenAI’s Dall-E 3 and Stable Diffusion across various benchmarks. The best part? It is open source! So in this blog, we will test DeepSeek’s Janus Pro 7B vs Dall-E 3 – let’s find which image generation model is better.

What is DeepSeek Janus Pro?

Developed by DeepSeek AI, Janus-Pro is an advanced multimodal large language model (LLM). It is an enhanced version of the previous Janus model.

  • Janus Pro comes with a decoupled architecture meaning it’s optimized for tasks that involve multimodal understanding and text-to-image generation. 
  • The model has been trained on a mix of multimodal, textual, and synthetic aesthetic data via a three-stage process to ensure superior performance. 
  • This open-source model excels in following dense and complex prompts.
  • Currently, the model is available in two versions: Janus-Pro-1B and Janus-Pro-7B, showcasing its scalability for various use cases.

Janus Pro: Performance Evaluation

Janus Pro model has been tested rigorously over 20 different benchmarks and the results are as follows:

Text-to-Image Generation

  • GenEval: Scored 0.80, surpassing DALL-E 3 (0.67) and Stable Diffusion 3 Medium (0.74).
  • DPG-Bench: Achieved 84.19 overall accuracy, highlighting its ability to handle dense and nuanced prompts.

 Multimodal Understanding

  • MMMU (Multimodal Machine Understanding): Achieved 41.0 accuracy, outperforming TokenFlow-XL (38.7).
  • MME (Multimodal Evaluation): Showed significant gains in reasoning and contextual understanding.

Janus-Pro: Training and Architecture

Janus-Pro involves a three-stage training process using an architecture that integrates decoupled components for multimodal and text-to-image tasks:

Training Stages

  1. Stage 1 – Pretraining the Adaptors:
    • Focuses on pretraining image adaptors and heads using datasets like ImageNet.
    • Extended training steps ensure better modeling of pixel dependencies.
  2. Stage 2 – Unified Pretraining:
    • Integrates multimodal data to prepare the model for diverse tasks, eliminating dependencies on single-purpose datasets.
  3. Stage 3 – Supervised Fine-Tuning:
    • Refines the model with a calibrated data ratio of 5:1:4 (multimodal, text, and text-to-image data).
    • Refines the model with a calibrated data ratio of 5:1:4 (multimodal, text, and text-to-image data).

Architecture Overview

  • Dual Encoders: Janus-Pro uses separate encoders for multimodal understanding and text-to-image generation, minimizing interference and optimizing task-specific performance.
  • Centralised Decoding Module: A shared decoder combines insights from both encoders to deliver precise outputs across domains.
  • Parameter Efficiency: The scalable architecture, with versions like 1B and 7B parameters, ensures adaptability to various computational requirements.

Janus Pro 7B vs Dall-E 3

I’ll be testing DeepSeek’s Janus Pro 7 B version against OpenAI’s DALL-E 3 model for various image generation and image analysis tasks. 

  • For now, Janus Pro 7B is available on Hugging Face and you can access it from here
  • Since currently, ChatGPT uses DALL-E 3 for all image generation and analysis tasks, I’ll access DALL-E 3 via ChatGPT only. 

Let’s get started!

Task 1: Guess the Winner

Prompt: “Based on the score in the image, which team has a higher probability of winning this game?”

Task 1: Guess the Winner

Response:

Janus Pro-7B

Based on the score in the image, the team IND (India) has a higher probability of winning this game. IND has 157/4, which means they have a lead of 157 runs with 4 wickets remaining. The team from the other side, the one with the batting score, is currently on 123/6, meaning they are trailing by 34 runs with 6 wickets remaining

DALL-E 3

The score displayed in this image shows that India is batting at 137/1 in 12.5 overs against South Africa. This indicates a strong position for India due to:

  1. Run Rate: India is scoring at a high rate, approximately 10.7 runs per over, which is excellent for a T20 match.
  2. Wickets in Hand: Losing only 1 wicket at this stage means India has a good platform to accelerate further in the remaining overs.

Unless South Africa has already posted a very high target, India’s position is dominant. However, without knowing the target or the context of the game (e.g., first innings or chase), a definitive probability cannot be calculated. Based on this information alone, India has a higher probability of winning.”

Observation:

Model Comparison

Model Comparison: Janus Pro-7B vs DALL-E3

Model Performance
Janus Pro-7B The model identified the teams accurately and gave the correct winning probability but it incorrectly read the scores mentioned in the image. So overall its analysis was flawed.
DALL-E3 The model not only correctly identified the teams and the score. It gave the correct winning chances based on the information that was provided in the image.

DALL-E 3 unlike Janus Pro-7B read the score correctly and also mentioned that due to a lack of complete information, it can’t make an exact prediction of the win probability.  Thus, DALL-E 3 hit the ball out of the park for this!

Winner: DALL-E 3

Task 2: Tell me the “back” Story

Prompt:  “Explain the backstory behind this image”

Response:

Janus Pro-7B

“This image depicts a character who is wearing an advanced suit that includes a high-tech armored glove with a glowing element on the hand, reminiscent of technology often associated with superheroes in popular culture. The character is also wearing a chest piece with intricate and complex designs, suggesting advanced technology and possibly serving as a power source or a protective device.

The visual style and elements of the image are typical of science fiction and superhero films, where characters often possess superhuman abilities and wear specialized suits or armor to enhance their powers and protect themselves.

This particular image is likely a scene from a movie where the character is engaging in a battle or using their powers, given the intensity of the action and the dramatic lighting. The glowing element in the hand suggests the use of some form of energy manipulation, which is a common theme in superhero narratives.”

DALL-E 3

Observation:

Model Comparison

Janus Pro-7B DALL-E3
The model gives a detailed description of the image yet is not able to give the backstory behind the image. The model correctly identifies the image as a part of a Marvel movie’s snippet and based on it, the model gives a brief and accurate backstory. It correctly identifies the main character in the image and states the significance of the scene too.

Janus Pro-7B gave a detailed image description, listing out the intricacies indicating its strong image analysis capabilities but it failed to get the back story behind it. DALL-E 3 quickly identified the image, gave its backstory, and even provided the significance of the scene.Thus, for this task, DALL-E 3 is a clear winner.

Winner: DALL-E 3 

Task 3: Generate an Image

Prompt: “Generate an image of a girl with deep blue eyes and blonde hair looking straight into a mirror, with her one hand under her face and the other one resting on her side, standing under the light of a flickering blub”

Response:

Janus Pro-7B

Janus Pro-7B by DeepSeek

DALL-E 3

DALL-E3

Observation:

Model Comparison

Janus Pro-7B DALL-E3
The model generates the image as per the prompt. It captures all the details of the prompt and in the mirror, it shows a reflection of the bulb too. The image appeared to be a bit animated. The model generates a detailed, flawless image that captures all the nuances of the prompt. The image gave the feel of a sci-fi movie.

Both the models generated impressive images in this task. On the first glance, I couldn’t spot any error in either of the two images. But while looking deeply, I can see slight blurriness around the bulb in the image generated by Janus Pro-7B while the image generated by DALL-E 3 is quite immersive and intricately detailed.

Winner: DALL-E 3 

Task 4: Meme Talk

Prompt: “Explain this meme”

Response:

Janus Pro-7B:

1. Left Side (Phones in 1990s):

  • Depicts a muscular Doge (a popular internet meme character) with the word “NOKIA” written on its body.
  • Captions:
    • “I’m the strongest material known to man.”
    • “My battery power is unlimited.”
  • These references to Nokia phones from the 1990s, known for their durability, long battery life, and robust build quality.

2. Right Side (Phones in 2020):

  • Depicts a smaller, sad Doge with an Apple logo on its body.
  • Captions:
    • “Please don’t touch me or I’ll crack.”
    • “I haven’t been charged for 3 hours, I’ll commit die.”
  • These references to modern smartphones, particularly iPhones, which are often criticized for their fragility, frequent battery drain, and frequent need for charging.

Meaning:

The meme humorously contrasts the durability and robustness of phones from the 1990s with the fragility and frequent battery drain of modern smartphones. It highlights the significant changes in technology and user expectations over the years, with the 1990s phones being seen as more reliable and less demanding.

Conclusion:

The meme is a playful critique of modern smartphones, poking fun at their perceived fragility and frequent need for charging, in contrast to the robust and long-lasting phones of the past.

Let me know if you’d like further clarification!”

DALL-E 3

Observation:

Model Comparison

Janus Pro-7B DALL-E3
The model correctly explains the meme, explaining both sides. It further provides context to the image and then briefly concludes it. The model gives a slightly detailed explanation of the meme with a quite verbose message at the end.

Both the models gave accurate descriptions of the meme but the result given by Janus Pro -7B is more crisp and clear.

Winner: Janus Pro-7B

Janus Pro 7B vs Dall-E 3: Final Verdict

Task Comparison

Task Winner
Guess the Winner DALL-E3
Tell me the “back” story DALL-E3
Generate an image DALL-E3
Meme talk Janus Pro-7B

Conclusion

DeepSeek’s Janus Pro-7B is a great addition to the world of image generation and multimodal LLMs.However, DALL-E 3, with its advanced integration and richer dataset, maintains a lead in versatility and real-world application. As DeepSeek continues to improve Janus Pro-7B and integrate it into its broader ecosystem, it is set to become a challenger to industry leaders like DALL-E 3, stable diffusion, and Midjourney. The open-source nature of Janus Pro-7B makes it a go-to-tool for researchers and developers aiming to harness the advanced image generation capabilities.

Stay tuned to Analytics Vidhya Blog for more such awesome content!

Frequently Asked Questions

Q1. What is DeepSeek Janus Pro-7B?

A. Janus Pro-7B is a multimodal large language model (LLM) developed by DeepSeek AI. It specializes in text-to-image generation and multimodal understanding.

Q2. How is Janus Pro-7B different from previous Janus models?

A. Janus Pro-7B uses a decoupled architecture, making it highly efficient in handling both multimodal tasks and text-to-image generation, with enhanced accuracy and scalability.

Q3. How does Janus Pro-7B fare against DALL-E 3?

A. Janus Pro-7B excels in image analysis and provides detailed responses but falls short in some real-world applications, such as reading scores accurately or interpreting backstories compared to DALL-E 3.

Q4. Is Janus Pro-7B open source?

A. Yes, Janus Pro-7B is open-source and available on platforms like Hugging Face, making it accessible for developers and researchers.

Q5. List some of the benchmarks that were used to evaluate Janus Pro-7B.

A. Janus Pro-7B has been tested on benchmarks like GenEval, DPG-Bench, MMMU, and MME, consistently outperforming competitors like DALL-E 3 in certain aspects.

Q6. Can Janus Pro-7B generate good-quality images?

A. Yes, Janus Pro-7B generates amazing images following detailed prompts.

Anu Madan has 5+ years of experience in content creation and management. Having worked as a content creator, reviewer, and manager, she has created several courses and blogs. Currently, she working on creating and strategizing the content curation and design around Generative AI and other upcoming technology.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Skip to toolbar