Scaling AI Frontiers: New Projects in Foundation Models

Rapid developments in model training, distributed training feasibility and more.

Feb 01, 2025

This day rapid developments in model training, distributed training feasibility, and broader project expansion. Key discussions focused on infrastructure scaling, dataset preparation, and refining execution strategies. Notably, the Hindi Wikipedia dataset was finalized for training, and deeper discussions emerged on high-performance distributed inference.

Hindi Wikipedia Model Training

Dataset Curation:
- The team identified multiple sources for Hindi Wikipedia datasets, including Kaggle and Hugging Face.
- Data transfer and server setup were initiated on Vast.ai.
Training Setup:
- A basic transformer architecture was outlined for training.
- Data loaders and preprocessing scripts were initiated.
- Training will proceed incrementally, with results monitored for optimization.

Infrastructure & Compute Scaling

GPU Rentals & Cloud Resources:
- A Vast.ai account was funded with initial credits, with the potential to scale as needed.
- AWS, GCP, and DigitalOcean credits were discussed as additional resources.
- Members are pooling funds to acquire access to H100 GPUs for more advanced training.
Technical Challenges in Distributed Training:
- Discussions highlighted the limitations of heterogeneous GPU clusters and network bottlenecks.
- vLLM and MODAL presentations were referenced as benchmarks for distributed inference performance.
- Private networking solutions were proposed to enhance efficiency and reduce latency.

Expanding the Scope: New Research Areas

The group is expanding beyond foundation models to include:

Material Science & AI:
- Predicting material properties for semiconductor applications.
- Optimizing material composition using AI-driven insights.
- Developing visualization tools for atomic/molecular structures.
Engineering Augmentation:
- AI-assisted troubleshooting for industrial maintenance and operations.
- No-UI conversational dashboards for factory-floor intelligence.
- AI-driven automation for predictive equipment maintenance.
Hardware & Compute Optimization:
- Exploring PTX programming for Nvidia GPUs.
- Investigating AMD’s ROCm ecosystem for model deployment.
- Optimizing memory management and compute efficiency for low-resource devices.

Project Nalanda: Next Steps

Phase 1: Hindi Language Model MVP
- Continue training a small-scale LLM on Hindi Wikipedia.
- Deploy an interactive chatbot for evaluating model performance.
Phase 2: Expanding Training Scope
- Explore fine-tuning on Sanskrit and other Indic texts.
- Evaluate domain-specific use cases in cultural and religious heritage.

Community & Execution Strategy

Task Assignment & Team Expansion:
- Clear task delegation for efficient execution.
- Identifying new contributors based on expertise in AI, hardware, and domain-specific research.
Scaling the Community:
- Over 100+ members interested in joining the initiative.
- Structured core teams for focused execution while maintaining an open-source collaborative environment.
MVP & Rapid Prototyping:
- Encouraging contributors to develop small-scale proof-of-concepts.
- Prioritizing quick iterations to validate ideas and attract funding.

This day saw reinforced momentum in building a robust foundation for AI model training and research. With growing interest and increasing execution speed, the group is positioned to make a significant impact in open-source AI development and computational research.

Community link: Here