Dolly by Databricks

Making the power of ChatGPT available with open models

Overview of Dolly by Databricks

Databricks’ Dolly is a large language model that has been trained on the Databricks Machine Learning Platform. This model shows that a two-year-old open source model (GPT-J) is able to be fine tuned with only 30 minutes on a focused corpus of 50k records (Stanford Alpaca) and it can generate astonishingly accurate responses that were not similar to the original model.

Databricks believes this discovery is highly significant because it shows that developing advanced Artificial Intelligence technology is much easier than previously thought.

Data, Biases & Objectionable Content

Like all language models, Dolly-v1-6b reflects the data and limits of its training corpuses.

  • The Pile: The pre-training corpus of GPT-J has been collected from the public internet, and like the majority of web-scale datasets, it has content that many users may find offensive. Thus, the model might show these defects, either conspicuously when it is specifically asked to generate objectionable content, or subtly, when it produces biased or damaging implicit associations.

  • Stanford Alpaca: The instruction tuning corpus for Dolly-6b is likely to have similar shortcomings. In addition, it is known to have false facts, semantic and syntactic irregularities, nonsensical answers, and incorrect mathematical calculations, among other data issues. The model outputs will mirror these imperfections.

Dolly by Databricks screenshots

Dolly by Databricks - screen 1
Dolly by Databricks - screen 2

Read in Ukrainian or Ru