Member of technical staff - research

Canberra

Stealth

Posted: 13 March

Offer description

We are Australia's first venture backed foundation model lab, currently in stealth. We build AI forecasting systems. Our reasoning models beat human superforecasters at prediction tasks. We're backed by Blackbird Ventures and notable angels, including Balaji Srinivasan, Synthesia founders, and Supabase founders.

The total compensation for this role is $200,000 - $500,000 p/a. Traders with experience in mid-frequency strategies are a good fit for this role. Even if you are not a trader, we encourage you to apply. This role is a great fit for philosophy majors.

The Role

If you've worked at a quant trading firm, you already know the core skill: decomposing a complex question into a probabilistic driver tree, identifying which nodes are near-certain and which carry real uncertainty, and concentrating your analytical effort where it matters.

At a place like SIG, the workflow is: researcher builds a driver tree for Stock X → specs data requirements → Data Engineer feeds the leaf nodes → Researcher compares estimate to market-implied price → Trade.

With us, the object of work shifts up one level of abstraction:

* You iterate on our model → our model generates driver trees for all questions → Trees are tested against live markets → Data Scientist measures model quality within statistical limits → Quantified answers flow back to you → you help fix the model.
* You co-design testable questions given small N and no cross-market correlation with a Data Scientist.
* You don't build a driver tree for a single question. You build and iterate the training corpus that teaches our AI model how to build driver trees for any question. When the model produces a bad decomposition, you read the trace, identify the structural failure, and fix the model.

Three things change from the trading firm model:

* The data engineer becomes a data scientist, and the relationship inverts. At a trading firm, you spec data requirements and the engineer delivers. Here, the data scientist owns evaluation pipelines and decision-surface optimization. You work side-by-side to co-design what questions are even statistically answerable given our constraints: small sample sizes, no sample correlation structure, each forecasting question resolving exactly once. You bring probabilistic intuition; they bring statistical rigor.
* The system loops. Forecasts resolve. The data scientist measures what those resolutions mean for model quality. Those quantified answers flow back to you. You use them to fix the model. The system improves with every resolution cycle. At a trading firm, the tree doesn't get better because you traded on it — here it does.
* Your output is training data, not a spreadsheet. Training data is a written artifact that encodes the full reasoning procedure: how to explore decomposition strategies, when to stop splitting, how to genuinely argue with yourself, and what a valid tree looks like. Clear, precise writing is not optional.

What You'll Do

* Read reasoning traces and use LLMs to diagnose structural failure modes. Then build systems that catch these automatically.
* Iterate the model to attack the dominant failure mode exposed by each batch of traces, measured against mechanical validators and LLM-as-judge rubrics.
* Work with the data scientist to turn qualitative intuitions into statistically testable hypotheses given small sample sizes and non-correlated markets.
* Define and refine the mechanical and judgment-based evaluation rubrics that score trace quality across structural validity, critic sharpness, decomposition family diversity, and evidence discrimination.
* Contribute to the training pipeline: curate SFT datasets, define preference pairs for DPO, and specify what "better" means at each stage.
* Maintain a running postmortem synthesis: which failure modes have been closed, which remain open, and what the model cannot yet express.

Requirements

* Deep probabilistic reasoning ability: You should be able to look at a decomposition tree and immediately see how it is sub-optimal.
* Experience with LLM prompting beyond surface-level usage: Understanding how prompt structure creates answers from the model that you can trust.
* Experience building or evaluating probabilistic models in a trading, forecasting, or quantitative research: We care that you've had to put numbers on uncertain things and been scored on the result.
* Comfort with structured decomposition of complex problems: scenario analysis, necessary-conditions framing, causal modelling, or decision trees. The vocabulary matters less than the habit.
* Ability to formulate precise analytical questions and work with a data scientist to determine statistical testability: You don't need to run Brier decompositions yourself, but you need to know what it means and what to ask for.
* Intellectual honesty about what you can and cannot conclude from small samples: We operate in a regime where most interesting questions are underpowered, and the right answer is often "we can't know that yet, but here's a weaker version we can test."
* Strong theory of mind. You analyzing a reasoning system that fails in subtle ways. You need to anticipate how a model will interpret an ambiguous instruction, why a particular phrasing causes it to shortcut rather than search, and what it "sees" when it reads its own prior output during a critique step.
* High trust: You work on the most competitively sensitive part of the company. We prefer that candidates are within two degrees of the professional network of a founder, with strong referenceable relationships.

Nice To Have

* Experience at a quantitative trading firm (SIG, Jane Street, Citadel, Optiver, or similar) where probabilistic reasoning was a daily practice, not an occasional exercise.
* Familiarity with prediction markets, superforecasting, or calibration training.
* Background in any domain where decomposition quality directly determines outcome quality: intelligence analysis, decision analysis, actuarial science, or options pricing.
* Strong written communication: The data our model trains from are written artifacts. The traces are written artifacts. The postmortems are written artifacts. Clear, precise writing is preferred.
* Exposure to ML training pipelines (SFT, DPO, RLHF); not to build them, but to understand what makes good training data.

Why Us

* Real traction: our live system already outperforms human superforecasters.
* Frontier technical problem: you are building the reasoning capability that no existing training method can teach.
* Your work directly improves a system that is tested against real markets — feedback is fast and unambiguous.
* Small, technical founding team with high ownership and fast iteration.
* Backed by top-tier investors and operators.
* Remote-friendly with Sydney, Melbourne, and San Francisco presence.

How To Apply

Send your resume and a brief note covering:

* A time you identified a structural flaw in a model, analysis, or decision framework that others had missed. What was the flaw and how did you find it?
* How you think about decomposing a complex uncertain question into parts. Walk us through your approach on any question you find interesting.
* Create a tree of drivers for the following forecasting question: What is the probability of the 2026 oil crisis (triggered by a US/Israel-Iran war and Strait of Hormuz closure) matching or exceeding the severity of the 1970s energy crises? This diagram should be optimized for forecasting solvability, not breadth. What is the optimal way to break this problem apart so that if you answered all its child nodes, it would result in a highly calibrated forecast?
#J-18808-Ljbffr

Send an application

Create a job alert

Save