Skip to content

What is Dataiku?

Dataiku DSS (Data Science Studio) is a collaborative data science software platform for data professionals: data scientists, data engineers, data analysts, data architects, CRM and marketing teams. It is a centralised working environment that makes it easy to manipulate data, quickly explore and share analyses, make predictions and create Artificial Intelligence (AI) models with a few clicks.

The platform is also designed to simplify the automation and industrialisation of processing chains, i.e. data collection, data preparation, training, testing and monitoring of AI models and the production deployment phase.

The platform is used for a wide range of applications such as customer segmentation, fraud detection, customer scoring (churn calculation, appetence scores, risk scores, etc.), deep learning and natural language processing (NLP) analysis.

What is the history of Dataiku?

Dataiku DSS is the eponymous name of the AI platform developed by Dataiku, a startup founded in 2013, now based in the US. Founded in Paris by Florian Douetteau (current CEO), Clément Stenac, Thomas Cabrol and Marc Batty, the company has been growing rapidly since its inception. In 2015, Dataiku set up in New York. 

After raising $101 million in 2018, Dataiku raised another $400 million in 2021 for a total valuation of approximately $4.6 billion. The startup then became a unicorn and now has more than 1,000 employees and more than 300 customers among the world’s largest groups. Among them are the French companies Accor, BNP Paribas, Engie, the LVMH group, but also Morgan Stanley, UBS and Walmart.  The company’s investors include CapitalG (Google), Snowflake Ventures, Battery Ventures, etc.

The platform currently has more than 45,000 active users and more than 450 clients worldwide.

What are the main features of the Dataiku platform?

Dataiku DSS has more than 90 features that can be classified according to the following main themes:

Integration & Connectivity of Dataiku DSS within other infrastructures

The platform integrates with Hadoop, Spark, SQL, Teradata, and is available on the AWS, Azure and Google Cloud platform marketplaces.

The detection of data schemas and formats is automatic. Thus, Dataiku is able to natively recognise a numerical variable, a character string, an age, a date, or even a geographical location.

Moreover, there is a decorrelation between data storage and processing: the data stays where it is. Access to data is therefore instantaneous and without the need to transfer data for processing.

Plugins

Dataiku DSS comes with standard visual components to connect to data, process and train models. But Dataiku also offers the flexibility to implement custom components, package them and share them with others. These custom components are available as plugins. Each plugin consists of both a graphical user interface and a backend programmed by the developer in R or Python.

There is a gallery of more than 100 plugins in the Dataiku Plugin Store, providing data applications in many areas such as language translation, weather, recommendation systems, data import/export and ready-to-use graphical interfaces.

Optimised data preparation

The graphical interface of Dataiku DSS accelerates data wrangling with interactive data cleansing and enrichment. Contextual transformations are automatically suggested by Dataiku according to the type of data. For example, from a date, Dataiku proposes to calculate an age. From an address, Dataiku is able to extract the street number and name, the postal code or the city. There are more than 80 visual processors that can be activated with a few clicks and without code. This graphical console also allows, with simple clicks, to interact with the data for filtering, transformations or statistical summaries.

Integrated development

Many languages are supported by Dataiku DSS: Python, R, Scala, PySpark, SparkR and SparkSQL, SQL, Hive, Pig and Impala. Dataiku is therefore aimed at all types of users whatever their technical background and at all levels of expertise.

Machine learning & AI

The platform includes a complete graphical interface (called Datalab) dedicated to the development of machine learning models. This interface allows the configuration of models, the visualisation of model performance and a simplified reading of the results produced by the algorithms.

There is also a module for the automation of machine learning (AutoML). For information, other AI plugins exist for deep learning or natural language processing.

To find out more about AutoML, please check out this article: What is AutoML?

Collaboration & Governance

Dataiku DSS incorporates features to optimise sharing and exchange within data teams and business teams. These include project management, chat, wiki and versioning tools. 

For data governance, the platform provides a centralised catalogue of data, comments, elements and models. In addition, all user activities are shown on a dedicated dashboard and security is guaranteed by other features (such as, for example, permissions management, log management or monitoring of data size and instance activity). Dataiku meets all data governance and auditing requirements.

MLOps

Dataiku DSS manages the deployment of models: both within its ecosystem but also within other environments such as AWS, Azure, Google Cloud or even Kubernetes.

Data Analysis & Data Visualisation

The Datalab provides an interface for the construction of dashboards, by simple drag and drop actions. Data visualization can thus be done without code. If you are a coder, you can of course create custom charts or more elaborate web applications because Dataiku allows to integrate web libraries like Javascript, d3.js, Leaflet or plotly in its ecosystem.

Dataflow and intelligent recomputing

Dataflow is the term used to describe all the data and visual recipes. A dataflow can be visualised and re-run easily. Dataiku DSS also allows for intelligent recomputing of data via a reconstruction engine that allows calculations to be limited to the necessary data sets.

Intelligent recomputing is a first step in dataflow automation and in the orchestration of task automation scenarios.

The overall orchestration of the dataflow can be provided:

  • Either by Dataiku within its interface or by using APIs (this is the Dataiku DSS Python scenario API).
  • Or using an external orchestrator, with Dataiku scenarios being triggered by the Dataiku REST API.

In both cases, the workflow is launched automatically after an event is triggered (triggers such as data change, recalculation requested every 5 minutes, etc.). A very advanced monitoring of the workflow is possible thanks to the variety of triggers, thanks to probes (called probes) for checking metrics and thanks to user alerts.

Deployment & industrialisation of workflows

The platform allows the workflow to be packaged by including both the data and the models (i.e. all the workflows).

There are 2 types of instances for deployment: the design node (instance designed for development) and the automation node (workflow automation instance).

A single interface brings together the deployment models: from development to testing and from pre-production to production.

Going into production with Dataiku DSS is made easier by the possibility of managing model versions, performing rollbacks and monitoring workflows. Deployments are thus automated within a more global production strategy where all data scenarios can be launched from within Dataiku or outside the platform using the REST API.

Dataiku DSS: Benefits of the platform?

 Highlights
Functionalities Description
Data integration

  
+ Connectivity to other ecosystems and cloud infrastructures+ Automatic detection of data schemas and formats+ Fast data access
Data preparation + Simple and quick access graphical interface+ 80 visual processors to simplify data preparation operations (without code)+ Data prep in code or no-code mode, depending on the user’s technical expertise
Putting the workflow and
templates into production 
+ Simple to implement and monitor. Possibility of setting up user alert systems+ Intelligent recomputing of dataflow according to the age of the data
Collaborative environment+ project management, chat, wiki and versioning tools
Data governance+ Centralized catalogue of data and metadata.+ Fast auditing of data, logs and user activities with a dedicated dashboard.+ Security with user permissions management and monitoring. 
Machine learning & AI + Dedicated interface (Datalab) for configuration, development and monitoring of machine learning models.+ Wide variety of AI plugins
Technical Support &
Documentation
+ Good responsiveness of technical support.

How to learn about Dataiku?

Dataiku has set up an e-learning platform dedicated to learning about Dataiku. It is called Dataiku Academy and offers a set of online training courses. There are Quicks Start programmes that allow you to start using the solution in just a few hours. Depending on your business, there are sessions for more advanced learning: these are the Learning Paths to acquire the skills required by your business. 

Each programme leads to a Dataiku certification: Core Designer Certificate, ML Practitioner Certificate, Advanced Designer Certificate, Developer Certificate and MLOps Practitioner Certificate. 

These certifications are free and open to all.

Dataiku and the Devoteam TechRadar

This article is a part of a greater series centred around the technologies and themes found within the first edition of the Devoteam TechRadar. To read further into these topics, please download the TechRadar.