Who we are
EvolutionaryScale’s mission is to develop artificial intelligence to understand biology for the benefit of human health and society, through open, safe, and responsible research, and in partnership with the scientific community. Over the next ten years AI will transform biological design, making molecules and entire cells programmable. We will develop the foundation models for biology that enable this.
The EvolutionaryScale team is based in San Francisco and New York. We believe in flexibility around work schedules and locations, but expect that our team members will work half of the days or more of most weeks from one of our offices.
What you’ll do
As a Data Infrastructure Engineer, you will work closely with bioinformatics and research teams to ensure our data jobs are reliable, efficient, and scalable. You'll implement best practices for handling large-scale data processing, select and integrate the right technologies, and drive continuous improvements in performance and quality of our data sets.
The role
- Design, develop, and maintain large-scale batch processing pipelines using tools like Spark and Ray, for acquiring biology datasets.
- Manage data infrastructure components to ensure robust and fault-tolerant operations.
- Optimize data ingestion, storage, and retrieval processes for acquiring large and growing biology datasets, and for efficient pre and post training data ingestion.
- Create systems for easy and reproducible data evaluation and experiments.
- Integrate modern ML based data curation technologies with data processing pipelines.
- Work with researchers and other engineering teams to understand data needs, create solutions that meet modeling requirements.
Preferred qualifications
Apply even if you don’t meet all of these!
- Proven experience with large-scale data processing systems using technologies such as Hadoop, Spark, or Ray.
- Knowledge of streaming data frameworks like Kafka Streams, Spark Streaming, or Flink.
- Understanding of data processing principles and best practices.
- Strong problem-solving skills, including the ability to research, debug, and resolve complex technical problems.
- Experience with major cloud providers (AWS, GCP, or Azure), including familiarity with data warehousing tools is a plus.
- Knowledge of biology and biology datasets is a big plus but not required.
- Experience with large scale distributed systems or machine learning is also not required but a plus.
- 5+ years of experience in the above systems.