Book 3 of the Sequences Highlights

While beliefs are subjective, that doesn't mean that one gets to choose their beliefs willy-nilly. There are laws that theoretically determine the correct belief given the evidence, and it's towards such beliefs that we should aspire.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.  Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it's clear that a simple "steering vector" will not work. Nonetheless, as the authors show, it's possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result.  Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn't need to fully elucidate this geometry in order for steering to be effective.  Therefore, we want a procedure which learns a nonlinear steering intervention given only the model's activations and labels (e.g. the correct next-token).  Such a procedure might look something like this: * Assume we have paired data $(x, y)$ for a given concept. $x$ is the model's activations and $y$ is the label, e.g. the day of the week.  * Define a function $x' = f_\theta(x, y, y')$ that predicts the $x'$ for steering the model towards $y'$.  * Optimize $f_\theta(x, y, y')$ using a dataset of steering examples. * Evaluate the model under this steering intervention, and check if we've actually steered the model towards $y'$. Compare this to the ground-truth steering intervention.  If this works, it might be applicable to other examples of nonlinear feature geometries as well.  Thanks to David Chanin for useful discussions. 
Trying out my new journalist strategy.
Woah. OpenAI antiwhistleblowing news seems substantially more obviously-bad than the nondisparagement-concealed-by-nondisclosure stuff. If page 4 and the "threatened employees with criminal prosecutions if they reported violations of law to federal authorities" part aren't exaggerated, it crosses previously-uncrossed integrity lines. H/t Garrison Lovely. [Edit: probably exaggerated; see comments. But I haven't seen takes that the "OpenAI made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation" part is likely exaggerated, and that alone seems quite bad.]
Algon20
0
When tracking an argument in a comment section, I like to skip to the end to see if either of the arguers winds up agreeing with the other. Which tells you something about how productive the argument is. But when using the "hide names" feature on LW, I can't do that, as there's nothing distinguishing a cluster of comments as all coming from the same author.  I'd like a solution to this problem. One idea that comes to mind is to hash all the usernames in a particular post and a particular session, so you can check if the author is debating someone in the comments without knowing the author's LW username. This is almost as good as full anonymity, as my status measures take a while to develop, and I'll still get the benefits of being able to track how beliefs develop in the comments. @habryka 
Leaving Dangling Questions in your Critique is Bad Faith Note: I’m trying to explain an argumentative move that I find annoying and sometimes make myself; this explanation isn’t very good, unfortunately.  Example > Them: This effective altruism thing seems really fraught. How can you even compare two interventions that are so different from one another?  Explanation of Example I think the way the speaker poses the above question is not as a stepping stone for actually answering the question, it’s simply as a way to cast doubt on effective altruists. My response is basically, “wait, you’re just going to ask that question and then move on?! The answer really fucking matters! Lives are at stake! You are clearly so deeply unserious about the project of doing lots of good, such that you can pose these massively important questions and then spend less than 30 seconds trying to figure out the answer.” I think I might take these critics more seriously if they took themselves more seriously.  Description of Dangling Questions A common move I see people make when arguing or criticizing something is to pose a question that they think the original thing has answered incorrectly or is not trying sufficiently hard to answer. But then they kinda just stop there. The implicit argument is something like “The original thing didn’t answer this question sufficiently, and answering this question sufficiently is necessary for the original thing to be right.” But importantly, the criticisms usually don’t actually argue that — they don’t argue for some alternative answer to the original questions, if they do they usually aren’t compelling, and they also don’t really try to argue that this question is so fundamental either.  One issue with Dangling Questions is that they focus the subsequent conversation on a subtopic that may not be a crux for either party, and this probably makes the subsequent conversation less useful.  Example > Me: I think LLMs might scale to AGI.  > > Friend: I don’t think LLMs are actually doing planning, and that seems like a major bottleneck to them scaling to AGI.  > > Me: What do you mean by planning? How would you know if LLMs were doing it?  > > Friend: Uh…idk Explanation of Example I think I’m basically shifting the argumentative burden onto my friend when it falls on both of us. I don’t have a good definition of planning or a way to falsify whether LLMs can do it — and that’s a hole in my beliefs just as it is a hole in theirs. And sure, I’m somewhat interested in what they say in response, but I don’t expect them to actually give a satisfying answer here. I’m posing a question I have no intention of answering myself and implying it’s important for the overall claim of LLMs scaling to AGI (my friend said it was important for their beliefs, but I’m not sure it’s actually important for mine). That seems like a pretty epistemically lame thing to do.  Traits of “Dangling Questions” 1. They are used in a way that implies the target thing is wrong vis a vis the original idea, but this argument is not made convincingly. 2. The author makes minimal effort to answer the question with an alternative. Usually they simply pose it. The author does not seem to care very much about having the correct answer to the question. 3. The author usually implies that this question is particularly important for the overall thing being criticized, but does not usually make this case. 4. These questions share a lot in common with the paradigm criticisms discussed in Criticism Of Criticism Of Criticism, but I think they are distinct in that they can be quite narrow. 5. One of the main things these questions seem to do is raise the reader’s uncertainty about the core thing being criticized, similar to the Just Asking Questions phenomenon. To me, Dangling Questions seem like a more intellectual version of Just Asking Questions — much more easily disguised as a good argument. Here's another example, though it's imperfect. Example From an AI Snake Oil blog post: > Research on scaling laws shows that as we increase model size, training compute, and dataset size, language models get “better”. … But this is a complete misinterpretation of scaling laws. What exactly is a “better” model? Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases. Explanation of Example The argument being implied is something like “scaling laws are only about perplexity, but perplexity is different from the metric we actually care about — how much? who knows? —, so you should ignore everything related to perplexity, also consider going on a philosophical side-quest to figure out what ‘better’ really means. We think ‘better’ is about emergent abilities, and because they’re emergent we can’t predict them so who knows if they will continue to appear as we scale up”. In this case, the authors have ventured an answer to their Dangling Question, “what is a ‘better’ model?“, they’ve said it’s one with more emergent capabilities than a previous model. This answer seems flat out wrong to me; acceptable answers include: downstream performance, self-reported usefulness to users, how much labor-time it could save when integrated in various people’s work, ability to automate 2022 job tasks, being more accurate on factual questions, and much more. I basically expect nobody to answer the question “what does it mean for one AI system to be better than another?” with “the second has more capabilities that were difficult to predict based on the performance of smaller models and seem to increase suddenly on a linear-performance, log-compute plot”. Even given the answer “emergent abilities”, the authors fail to actually argue that we don’t have a scaling precedent for these. Again, I think the focus on emergent abilities is misdirected, so I’ll instead discuss the relationship between perplexity and downstream benchmark performance — I think this is fair game because this is a legitimate answer to the “what counts as ‘better’?” question and because of the original line “Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence”. The quoted thing is technically true but in this context highly misleading, because we can, in turn, draw clear relationships between perplexity and downstream benchmark performance; here are three recent papers which do so, here are even more studies that relate compute directly to downstream performance on non-perplexity metrics. Note that some of these are cited in the blog post. I will also note that this seems like one example of a failure I’ve seen a few times where people conflate “scaling laws” with what I would refer to as “scaling trends” where the scaling laws refer to specific equations for predicting various metrics based on model inputs such as # parameters and amount of data to predict perplexity, whereas scaling trends are the more general phenomenon we observe that scaling up just seems to work and in somewhat predictable ways; the scaling laws are useful for the predicting, but whether we have those specific equations or not has no effect on this trend we are observing, the equations just yield a bit more precision. Yes, scaling laws relating parameters and data to perplexity or training loss do not directly give you info about downstream performance, but we seem to be making decent progress on the (imo still not totally solved) problem of relating perplexity to downstream performance, and together these mean we have somewhat predictable scaling trends for metrics that do matter. Example Here’s another example from that blog post where the authors don’t literally pose a question, but they are still doing the Dangling Question thing in many ways. (context is referring to these posts): > Also, like many AI boosters, he conflates benchmark performance with real-world usefulness. Explanation of Example (Perhaps it would be better to respond to the linked AI Snake Oil piece, but that’s a year old and lacks lots of important evidence we have now). I view the move being made here as posing the question “but are benchmarks actually useful to real world impact?“, assuming the answer is no — or poorly arguing so in the linked piece — and going on about your day. It’s obviously the case that benchmarks are not the exact same as real world usefulness, but the question of how closely they’re related isn’t some magic black box of un-solvability! If the authors of this critique want to complain about the conflation between benchmark performance and real-world usefulness, they should actually bring the receipts showing that these are not related constructs and that relying on benchmarks would lead us astray. I think when you actually try that, you get an answer like: benchmark scores seem worse than user’s reported experience and than user’s reported usefulness in real world applications, but there is certainly a positive correlation here; we can explain some of the gap via techniques like few-shot prompting that are often used for benchmarks, a small amount via dataset contamination, and probably much of this gap comes from a validity gap where benchmarks are easy to assess but unrealistic, but thankfully we have user-based evaluations like LMSYS that show a solid correlation between benchmark scores and user experience, … (if I actually wanted to make the argument the authors were, I would be spending like >5 paragraphs on it and elaborating on all of the evidences mentioned above, including talking more about real world impacts, this is actually a difficult question and the above answer is demonstrative rather than exemplar) Caveats and Potential Solutions There is room for questions in critiques. Perfect need not be the enemy of good when making a critique. Dangling Questions are not always made in bad faith.  Many of the people who pose Dangling Questions like this are not trying to act in bad faith. Sometimes they are just unserious about the overall question, and they don’t care much about getting to the right answer. Sometimes Dangling Questions are a response to being confused and not having tons of time to think through all the arguments, e.g., they’re a psychological response something like “a lot feels wrong about this, here are some questions that hint at what feels wrong to me, but I can’t clearly articulate it all because that’s hard and I’m not going to put in the effort”. My guess at a mental move which could help here: when you find yourself posing a question in the context of an argument, ask whether you care about the answer, ask whether you should spend a few minutes trying to determine the answer, ask whether the answer to this question would shift your beliefs about the overall argument, ask whether the question puts undue burden on your interlocutor.  If you’re thinking quickly and aren’t hoping to construct a super solid argument, it’s fine to have Dangling Questions, but if your goal is to convince others of your position, you should try to answer your key questions, and you should justify why they matter to the overall argument.  Another example of me posing a Dangling Question in this: > What happens to OpenAI if GPT-5 or the ~5b training run isn't much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired.  Explanation of Example (I’m not sure equating GPT-5 with a ~5b training run is right). In the above quote, I’m arguing against The Scaling Picture by asking whether anybody will keep investing money if we see only marginal gains after the next (public) compute jump. I think I spent very little time trying to answer this question, and that was lame (though acceptable given this was a Quick Take and not trying to be a strong argument). I think for an argument around this to actually go through, I should argue: without much larger dollar investments, The Scaling Picture won’t hold; those dollar investments are unlikely conditional on GPT-5 not being much better than GPT-4. I won’t try to argue these in depth, but I do think some compelling evidence is that OpenAI is rumored to be at ~$3.5 billion annualized revenue, and this plausibly justifies considerable investment even if the GPT-5 gain over this isn’t tremendous. 

Popular Comments

Recent Discussion

3weightt an
TLDR give pigs guns (preferably by enhancing individual baseline pigs, not by breeding new type of smart powerful pig. Otherwise it will probably just be two different cases. More like gene therapy than producing modified fetuses) As of lately I hold an opinion that morals are proxy to negotiated cooperation or something, I think it clarifies a lot about the dynamics that produce it. It's like evolutionary selection -> human desire to care about family and see their kids prosper, implicit coordination problems between agents of varied power levels -> morals. So, like, uplift could be the best way to ensure that animals are treated well. Just give them power to hurt you and benefit you, and they will be included into moral considerations, after some time for it to shake out. Same stuff with hypothetical p-zombies, they are as powerful as humans, so they will be included. Same with EMs. Also, "super beneficiaries" are then just powerful beings, don't bother to research the depth of experience or strength of preferences. (e.g. gods, who can do whatever and don't abide by their own rules and perceived to be moral, as an example of this dynamics). Also, pantheon of more human like gods -> less perceived power + perceived possibility to play on disagreements -> lesser moral status. One powerful god -> more perceived power -> stronger moral status. Coincidence? I think not. Modern morals could be driven by a lot stronger social mobility. People have a lot of power now, and can unexpectedly acquire a lot of power later. so, you should be careful with them and visibly commit to treating them well (e.g. be moral person, with particular appropriate type of morals). And it's not surprising how (chattel) slaves were denied a claim on being provided with moral considerations (or claim on being a person or whatever), in a strong equilibrium where they are powerless and expected to be powerless later. tldr give pigs guns (preferably by enhancing individual baseline pigs, no
Dagon20

I think this is misguided.  It ignores the is-ought discrepancy by assuming that the way morals seem to have evolved is the "truth" of moral reasoning.   I also think it's tactically unsound - the most common human-group reaction to something that looks like a threat and isn't already powerful enough to hurt us is extermination.  

I DO think that uplift (of humans and pigs) is a good thing on its own - more intelligence means more of the universe experiencing and modeling itself.  

The current patent process has some problems. Here are some of them.

patenting is slow

The US Patent Office tracks pendency of patents. Currently, on average, there's 20 months from filing to the first response, and over 25 months before issuance.

That's a long time. Here's a paper on this issue. It notes:

The USPTO is aware that delays create significant costs to innovators seeking protection. The result? A limiting of the number of hours an examiner spends per patent in order to expedite the process. As such, the average patent gets about nineteen hours before an examiner in total, between researching prior art, drafting rejections and responses, and interfacing with prosecuting attorneys. Plainly, this allotment is insufficient.

A patent application backlog means it takes longer before work is published and other people...

2Donald Hobson
Also, there are big problems with the idea of patents in general.  If Alice and Bob each invent and patent something, and you need both ideas to be a useful product, then if Alice and Bob can't cooperate, nothing gets made. This becomes worse the more ideas are involved.  It's quite possible for a single person to patent something, and to not have the resources to make it (at least not at scale) themselves, but also not trust anyone else with the idea. Patents (and copyright) ban a lot of productive innovation in the name of producing incentives to innovate.  Arguably the situation where innovators have incentive to keep their idea secret and profit off that is worse. But the incentives here are still bad.  How about  1. When something is obviously important with hindsight, pay out the inventors. (Innovation prize type structure. Say look at all the companies doing X, and split some fraction of their tax revenue between inventors of X) This is done by tracing backwards from the widely used product. Not tracing forwards from the first inventor. If you invent something, but write it up in obscure language and it gets generally ignored, and someone else reinvents and spreads the idea, that someone gets most of the credit.  2. Let inventors sell shares that are 1% of any prize I receive for some invention.
bhauth20

That has all been considered extensively before and this post isn't a good place to discuss it. Prizes have been found to be generally worse than patents and research funding.

This post is written in a spirit of constructive criticism. It's phrased fairly abstractly, in part because it's a sensitive topic, but I welcome critiques and comments below. The post is structured in terms of three claims about the strategic dynamics of AI safety efforts; my main intention is to raise awareness of these dynamics, rather than advocate for any particular response to them.

Claim 1: The AI safety community is structurally power-seeking.

By “structurally power-seeking” I mean: tends to take actions which significantly increase its power. This does not imply that people in the AI safety community are selfish or power-hungry; or even that these strategies are misguided. Taking the right actions for the right reasons often involves accumulating some amount of power. However, from the perspective of an...

2quetzal_rainbow
Backlash for environmentalism was largely inevitable. The whole point of environmentalism is to internalize externalities in some way, i.e., impose costs of pollution/ecological damage on polluters. Nobody likes to get new costs, so backlash ensues.

That's a good point. But not all of the imposed costs were strategically wise, so the backlash didn't need to be that large to get the important things done. It could be argued that the most hardline, strident environmentalists might've cost the overall movement immensely by pushing for minor environmental gains that come at large perceived costs.

I think that did happen, and that similarly pushing for AI safety measures should be carefully weighed in cost vs benefit. The opposite argument is that we should just get everyone used to paying costs for ai safe... (read more)

2quetzal_rainbow
Are you saying that AIS movement is more power-seeking than environmentalist movement that spent 30M$+ on lobbying in single 2023 and has political parties in 90 countries, in five countries - in ruling coalition? For comparison, this paper in Politico with maximally negative attitude mentions AIS lobbying around 2M$. It's like saying "NASA default plan is to spread light of consciousness across the stars", which is kinda technically true, but in reality NASA actions are not as cool as this phrase implies. "MIRI default plan" was "to do math in hope that some of this math will turn out to be useful".
2peterbarnett
Fairness, Accountability, Transparency, Ethics. I think this research community/area is often also called "AI ethics"

Summary/Introduction

Aschenbrenner’s ‘Situational Awareness’ (Aschenbrenner, 2024) promotes a dangerous narrative of national securitisation. This narrative is not, despite what Aschenbrenner suggests, descriptive, but rather, it is performative, constructing a particular notion of security that makes the dangerous world Aschenbrenner describes more likely to happen.

This piece draws on the work of Nathan A. Sears (2023), who argues that the failure to sufficiently eliminate plausible existential threats throughout the 20th century emerges from a ‘national securitisation’ narrative winning out over a ‘humanity macrosecuritization narrative’. National securitisation privileges extraordinary measures to defend the nation, often centred around military force and logics of deterrence/balance of power and defence. Humanity macrosecuritization suggests the object of security is to defend all of humanity, not just the nation, and often invokes logics of collaboration, mutual restraint...

3Phib
(Cross comment from EAF) Thank you for making the effort to write this post.  Reading Situational Awareness, I updated pretty hardcore into national security as the probable most successful future path, and now find myself a little chastened by your piece, haha [and just went around looking at other responses too, but yours was first and I think it's the most lit/evidence-based]. I think I bought into the "Other" argument for China and authoritarianism, and the ideal scenario of being ahead in a short timeline world so that you don't have to even concern yourself with difficult coordination, or even war, if it happens fast enough.  I appreciated learning about macrosecuritization and Sears' thesis, if I'm a good scholar I should also look into Sears' historical case studies of national securitization being inferior to macrosecuritization.  Other notes for me from your article included: Leopold's pretty bad handwaviness around pausing as simply, "not the way", his unwillingness to engage with alternative paths, the danger (and his benefit) of his narrative dominating, and national security actually being more at risk in the scenario where someone is threatening to escape mutually assured destruction. I appreciated the note that safety researchers were pushed out of/disincentivized in the Manhattan Project early and later disempowered further, and that a national security program would probably perpetuate itself even with a lead.   FWIW I think Leopold also comes to the table with a different background and set of assumptions, and I'm confused about this but charitably: I think he does genuinely believe China is the bigger threat versus the intelligence explosion, I don't think he intentionally frames the Other as China to diminish macrosecuritization in the face of AI risk. See next note for more, but yes, again, I agree his piece doesn't have good epistemics when it comes to exploring alternatives, like a pause, and he seems to be doing his darnedest narrativel
2Chris_Leong
This seems like a poorly chosen definition that's simply going to confuse any discussion of the issue. If neither macrosecuritisation or a pause a likely to occur, what's the alternative if not Aschenbrenner? (To clarify, I'm suggesting outreach to the national security folks, not necessarily an AI Manhattan project, but I'm expecting the former to more or less inevitably lead to the latter).
8Ruby
I was only familiar with sic to mean "error in original" (I assume Rafe also), but this alternative use makes sense too.

FWIW I was also confused by this usage of sic, bc I've only ever seen it as indicating the error was in the original quote. Quotes seem sufficient to indicate you're quoting the original piece. I use single quotes when I'm not quoting a specific person, but introducing a hypothetical perspective.  

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Matt Yglesias explains why history is not a steam roller with a steering wheel welded tight, unyielding and incorrigible.

Of course, only a very naive person would see history as the unfolding of random occurrences driven purely by individual choices. But I think sophisticated people tend to overcorrect. Once something happens — like, for example, Joe Biden getting himself renominated — smart people are often eager to explain how this was "always going to happen."

This is, of course, at its worst, the historicism criticized by Popper, the idea that history is a deterministic process following a predetermined plan. Marx famously refrained from suggesting how a future communist society should function because if progress is driven by historical necessity, then the system is going to be what it's going...

This is especially true with geography. The political implications of hastily drawn borders are one example, but even which states get established at all.

For instance, in 1606, the Dutch were the first Europeans to step foot in Australia, but they do much. They thought the land was too arid and resource-deficient, so they mostly just created maps of the Australian coastlines and kept their focus on the East Indies. Australia wasn't colonized until the late 18th century, just because the British were more optimistic about it.


 

This post was inspired by some talks at the recent LessOnline conference including one by LessWrong user “Gene Smith”.

Let’s say you want to have a “designer baby”. Genetically extraordinary in some way — super athletic, super beautiful, whatever.

6’5”, blue eyes, with a trust fund.

Ethics aside[1], what would be necessary to actually do this?

Fundamentally, any kind of “superbaby” or “designer baby” project depends on two steps:

1.) figure out what genes you ideally want;

2.) create an embryo with those genes.

It’s already standard to do a very simple version of this two-step process. In the typical course of in-vitro fertilization (IVF), embryos are usually screened for chromosomal abnormalities that would cause disabilities like Down Syndrome, and only the “healthy” embryos are implanted.

But most (partially) heritable traits and disease risks are...

it’s not even clear what it would mean to be a 300-IQ human

IQ is an ordinal score, not a cardinal one--it's defined by the mean of 100 and standard deviation of 15. So all it means is that this person would be smarter than all but about 1 in 10^40 natural-born humans. It seems likely that the range of intelligence for natural-born humans is limited by basic physiological factors like the space in our heads, the energy available to our brains, and the speed of our neurotransmitters. So a human with IQ 300 is probably about the same as IQ 250 or IQ 1000 or IQ 10,000, i.e. at the upper limit of that range.

1Elias711116
I believe you meant to say: [...] it only makes skin cells?
9habryka
Promoted to curated: I think this topic is quite important and there has been very little writing that helps people get an overview over what is happening in the space, especially with some of the recent developments seeming quite substantial and I think being surprising to many people who have been forecasting much longer genetic engineering timelines when I've talked to them over the last few years.  I don't think this post is the perfect overview. It's more like a fine starting point and intro, and I think there is space for a more comprehensive overview, and I would curate that as well, but it's the best I know of right now. Thanks a lot for writing this!  (And for people who are interested in this topic I also recommend Sarah's latest post on the broader topic of multiplex gene editing)

I want to thank @Ryan Kidd, @eggsyntax and Jeremy Dolan for useful discussions and for pointing me to several of the relevant resources (mentioned in this post) that I have used for linking my own ideas with those of others.

Executive summary

Designing an AI that aligns with human goals presents significant challenges due to the unpredictable nature of complex systems and the limitations in specifying precise initial conditions. This post explores these challenges and connects them to existing AI alignment literature, emphasizing three main points:

  1. Finite predictability in complex systems: Complex systems exhibit a finite predictability horizon, meaning there is a limited timeframe within which accurate predictions can be made. This limitation arises from the system's sensitivity to initial conditions and from the complex interactions within it. Small inaccuracies in

...

If the human doesn't know what they would want, it doesn't seem fair to blame the problem on alignment failure. In such a case, the problem would be a person's lack of clarity.

Hmm, I see what you mean. However, that person's lack of clarity would in fact be also called "bad prediction", which is something I'm trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don't call it "misaligned behaviour" is because we... (read more)

1Alejandro Tlaie
With outer alignment I was referring to: "providing well-specified rewards" (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what's relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned ("one can construct a desirability tree over various possible various future states."). Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what's it good for? I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above). Completely agreed! On a related note, you may find this interesting: https://arxiv.org/abs/1607.00913
1Alejandro Tlaie
I think there was a gap in my reasoning, let me put it this way: as you said, only when you can cleanly describe the things you care about you can design a system that doesn't game your goals (thermostat). However, my reasoning suggests that one way in which you may not be able to cleanly describe the things you care about (predictive variables) is due to the inaccuracy attribution degeneracy that I mention in the post. In other words, you don't (and possibly can't) know if the variable you're interested in predicting isn't being accurately forecasted because a lack of relevant things to be specified (most common case) or due to misspecified initial conditions of all the relevant variables. I partially agree: I'd say that, in that hypothetical case, you've solved one layer of complexity and this other one you're mentioning still remains! I don't claim that solving the issues raised by chaotic unpredictability solve goal gaming, but I do claim that without solving the former you cannot solve the latter (i.e., solving chaos is a necessary but not sufficient condition).