Extensive training datasets are crucial for developing advanced AI models, but they may also be the reason for these models’ failure.
Biases arise from discriminatory trends hidden inside extensive datasets, such as images mostly featuring white CEOs in an image categorization collection. Large datasets may be disorganized, arriving in forms that a model may not understand, filled with a significant amount of irrelevant information and noise.
According to a recent Deloitte poll on organizations using AI, 40% identified data-related problems, such as adequately preparing and cleansing data, as one of the main obstacles affecting their AI projects. Approximately 45% of scientists’ time is dedicated to data preparation chores, such as loading and cleaning data, according to a distinct survey of data scientists.
Ari Morcos, an experienced professional in the field of artificial intelligence for over ten years, aims to simplify various data preparation procedures related to AI model training. He has established a business with this specific goal in mind.
Morcos’ startup, DatologyAI, develops tools to automatically organize datasets similar to those used in training models such as OpenAI’s ChatGPT, Google’s Gemini, and other similar GenAI models. According to Morcos, the platform can determine the most crucial data based on the model’s use, such as creating emails. It also explores methods to enhance the dataset with more data and how to batch or segment it into smaller parts for more effective model training.
“Models are a representation of the data they are trained on,” Morcos said in an email conversation with Eltrys. Not all data is of equal value, and some training data is far more beneficial than others. Training models with appropriate data and methods may significantly influence the model’s outcome.
Morcos, with a PhD in neuroscience from Harvard, worked at DeepMind for two years using neurology-inspired strategies to enhance AI models and spent five years at Meta’s AI lab studying fundamental principles in models’ operations. Morcos, together with his co-founders Matthew Leavitt and Bogdan Gaza, who formerly worked as an engineering lead at Amazon and Twitter, founded DatologyAI to simplify the process of curating AI datasets.
Morcos highlights that the composition of a training dataset significantly influences several aspects of a model trained on it, including its performance on tasks, size, and domain knowledge depth. Enhanced datasets may reduce training time and result in a more compact model, reducing computational expenses. Datasets containing a wide variety of examples can better accommodate specialized requirements.
Executives are now concerned about the high expenses associated with implementing GenAI, which is known for being pricey.
Many firms are choosing to refine existing models, such as open source models, for their specific needs or use managed vendor services via APIs. However, others are creating models using bespoke data from the beginning, investing significant amounts of money ranging from tens of thousands to millions of dollars in computing resources for training and operation, often due to governance and compliance requirements.
“Businesses have amassed large amounts of data and aim to develop highly effective, high-performing, and specialized AI models to optimize their business outcomes,” Morcos said. Utilizing these extensive datasets effectively is rather difficult and might result in less efficient models that need more time to train and are bigger than needed if not done properly.
DatalogyAI can handle large amounts of data, even reaching petabytes, in many formats such as text, pictures, video, audio, tabular, and more specialized types like genomic and geospatial. It may be deployed on a customer’s infrastructure, either on-premises or via a virtual private cloud. This distinguishes it from other data preparation and curation technologies such as CleanLab, Lilac, Labelbox, YData, and Galileo, which Morcos argues are generally more restricted in the range and varieties of data they can handle.
Study of dates AI can identify complicated ideas within a dataset, such as those connected to U.S. history in an educational chatbot training set, that need higher-quality samples. It may also pinpoint facts that may lead a model to behave unexpectedly.
“To solve these problems, it is necessary to automatically identify concepts, determine their complexity, and assess the required level of redundancy,” said Morcos. “Data augmentation, frequently involving alternative models or synthetic data, is highly effective but should be executed meticulously and with specific objectives.”
How successful is DatologyAI’s technology? There is cause for skepticism. Historical evidence indicates that automated data curation may not always function as planned, regardless of the method’s sophistication or the diversity of the data.
LAION, a German NGO leading many GenAI initiatives, had to remove an algorithmically selected AI training dataset due to the presence of photos depicting child sexual assault. Other models, like ChatGPT, trained on a combination of manually and automatically filtered datasets to remove toxicity, have been shown to produce poisonous material when given certain cues.
Some experts feel that human curation is essential for achieving good outcomes with an AI model. Major providers like AWS, Google, and OpenAI use teams of human specialists and sometimes underpaid annotators to develop and improve their training datasets.
Morcos is adamant about datology. AI tools are designed to provide ideas that may not be considered by data scientists, especially those related to reducing the size of training datasets, rather than completely replacing human curation. Morcos co-authored a scholarly article with academics from Stanford and the University of Tübingen in 2022. The work focused on dataset pruning while maintaining model performance and received a best paper award at the NeurIPS machine learning conference that year.
“Identifying the appropriate data on a large scale is highly challenging and a cutting-edge research issue,” Morcos said. Our method results in models that learn more quickly and improve performance on subsequent challenges.
Study of dates AI’s technology showed great potential, leading prominent figures in the tech and AI industries such as Google’s chief scientist Jeff Dean, Meta’s chief AI scientist Yann LeCun, Quora founder and OpenAI board member Adam D’Angelo, and Geoffrey Hinton, known for developing key techniques in modern AI, to invest in the startup’s initial funding round.
Datology AI’s $11.65 million seed funding round was supported by Amplify Partners, Radical Ventures, Conviction Capital, Outset Capital, and Quiet Capital, along with angel investors including Cohere co-founders Aidan Gomez and Ivan Zhang, Contextual AI founder Douwe Kiela, ex-Intel AI VP Naveen Rao, and Jascha Sohl-Dickstein, a creator of generative diffusion models. The list of AI experts is remarkable and indicates that Morcos’ assertions may have merit.
“Models are only as effective as the data they are trained on. Identifying the appropriate training data from a vast number of examples is a highly complex issue,” LeCun said in an email to Eltrys. “Ari and his team at DatologyAI are renowned experts in this field, and I consider their product aimed at providing high-quality data curation accessible to all individuals interested in training a model to be crucial in ensuring the widespread success of AI.”
DatologyAI, headquartered in San Francisco, now has 10 workers, including the co-founders. The company aims to increase its headcount to about 25 by the end of the year if it achieves certain growth targets.
I inquired with Morcos about the correlation between the milestones and customer acquisition, but he refused to disclose. Additionally, he cryptically chose not to divulge the size of DatologyAI’s existing client base.