Toward AI Realism

Opening Notes on Machine Learning and Our Collective Future

Holly Lewis

June 7, 2024

How do we talk about generative AI productively? People tend to feel a way about it, conversations run on vibes, and no one seems to know very much. Critics point out that models run on non-consensually harvested data, reproduce biases, and threaten to corrode our collective history, that scanners routinely fail to identify Black faces, bullies use deepnudes to harass women and gender non-conforming people, artists are being put out of work, education has been upended. And for what? A technology that’s a drain on the planet amid climate change. Extreme critics, also known as AI doomers, argue that the technology could end biological life, citing concerns ranging from rogue lethal autonomous weapons to emergent power-seeking behavior within the models themselves.

AI optimists have a different set of intuitions. They note that AI research has developed treatments for sickle cell anemia, made the COVID-19 vaccine possible, globally improved rural health, advanced cancer treatment, reduced pesticides, sniffed out wildfires, and pinpointed the world’s worst polluters, while autonomous vehicles and brain-computer interfaces allow greater self-determination for disabled people. Extreme Prometheans, like the effective acceleration movement, champion boundless AI advancement even in hypothetical scenarios where billions die.¹

Most people try to blend the two positions into a reasonable middle ground: let’s use what’s good about artificial intelligence and toss the rest. AI debunkers, on the other hand, are annoyed by the conversation itself. For them, AI is a nothingburger with extra cheese—pure hype, the latest crypto-style hustle. Some fall into debunking because they’re exhausted by all the chatter. Others argue that trained machines will never replicate what organic brains do and therefore never be anything very interesting. Some, like Noam Chomsky, are concerned that AI research is passing itself off as science rather than engineering.² Others just move the goalposts to make the problem disappear: If this unpaywalled AI model isn’t a robot overlord capable of changing diapers while correctly recalling nuanced minutiae about academic philosophy, is it really fair to call it intelligence ?

I propose a fourth position: AI realism.³ Beyond vibes, AI realists would be committed to grasping how the technology works, contextualizing it, and examining our intuitions, whether they be to vilify or idealize, to mystify or oversimplify. They would understand that models are not just commodities or platforms, but the unfolding outcome of the systemic logic of embedded material social relations. Large models are created for the purpose of profit maximization and trained on the data that humans have generated, ideologically, as subjects making sense of their lives within capitalist social relations. AI realism would measure the impact of machine learning in terms of months and years, rather than speculate about decades, centuries, and millennia. AI realism would entail intellectual humility, admit its own errors, and forgo wild leaps of logic without denying that the world is growing increasingly strange even as it becomes more predictable. Just because something isn’t unfolding as it would in a movie doesn’t mean it isn’t happening. And just because something feels like fiction doesn’t mean it isn’t true.

Developing AI realism requires taking on three tasks: (1) building a framework for coherent conversations about artificial intelligence, (2) reviewing how the most advanced model architectures roughly function, and (3) detailing trends in machine learning so that we know what we’re swimming in, what’s on the immediate horizon, and what’s as unreachable as the stars. Nowhere will I imply that artificial neural networks are conscious, have subjective experiences, or think. I use the term learning with no depth—that is, machines don’t know how to do a generalizable task, then they are trained, and then they can do the general task. If there is a better word for that phenomenon, I’m open to it.

Why Generative AI Is Different

Simply put, an artificial intelligence is a machine capable of outputting autonomous predictions, inferences, and actions in response to a prompt. Smart appliances, Google Maps, Siri, Alexa, vehicle parking assist, social media moderation bots, facial recognition cameras, predictive policing algorithms, and autonomous drones are all AI projects that have been used by us or on us for at least a decade.⁴

But none involve the qualitative advance in artificial intelligence known as transformer architecture. Transformer architecture, first developed in 2017 by Google researchers, made the growth of large language models (LLMs) possible and fueled rapid advancements in computer vision, robotics, sound and brain wave analysis, code generation, and scientific research overall.⁵ Artificial intelligence is not just an appliance component or a type of consumer-facing app, but the imperfect category for machine learning, a subset of computer engineering projects. It’s important to note that the advances we’re witnessing caught almost the entirety of the capitalist world by surprise, including many industry insiders. AI is not the product of a grand conspiracy. The history of its development is open and traceable.

Before transformer architecture, artificial neural networks were primarily used for projects like superficial anomaly detection, image and speech recognition, machine translation, and sentiment analysis in social media posts and product reviews. Transformer architecture improved these capacities, added rich content generation and even steps toward automated problem solving. For example, instead of just analyzing known protein structures, in 2018, DeepMind’s AlphaFold mapped almost all of the world’s unknown protein structures, a task that would have taken the global possible workforce of PhDs hundreds of thousands of years to complete.⁶ A further step toward medical breakthroughs, AlphaFold3 now predicts the structure and interactions of proteins, DNA, RNA, and ligands with high accuracy.⁷ In 2023, AI models discovered seven hundred new materials. And, of course, models are generating increasingly sophisticated permutations of text, image, music, video, 3D digital objects, and computer code from the matrix of humanity’s collective productivity, leaving some users enthralled with their newfound access to cultural productivity and many creative people bereft over the market devaluation of skill sets that not only pay the rent, but constitute how they make meaning in and through the world.

Large machine learning models are not specifically programmed to solve a specific task. They are not hand-coded software. They’re not based on if-then logic. Modeled loosely after human brains, they’re trained inductively, built with parallelized hardware and advanced chips known as compute. Constructed from calculus, linear algebra, and statistics, these mathematical machines transform data into an internal knowledge infrastructure. They do not store the content they’re trained on. The models’ ability to reason, which is in part an emergent property, is still weak. However, the largest models, called foundation models, make inductive and abductive choices about patterns in training data, then apply what has been gleaned to new inputs. As of this writing, they cannot learn continually from experience, which would require extended memory and real-time retraining.⁸

Large artificial neural networks are generative, but at the cost of consistency and intelligibility. As Remo Pareschi notes in his work on AI reasoning, getting the best output from an AI means collaborating with it. It doesn’t just do the thing.⁹ Users have to explain what they want from the AI, talk to it like a coworker or employee. Outcomes are therefore syntheses of the mathematical model’s ability to assess patterns mixed with our own prompting abilities. This interactivity requirement may eventually downgrade the industrial value of programming languages. The goal, as stated by Nvidia CEO Jensen Huang, is that non-experts should be able to talk to computers using natural language prompts, not code.¹⁰

Engineers know how to build large neural networks, but knowing how to build things doesn’t translate to understanding them. The term black box is neither hype nor conspiracy. Any large neural nets’ generative process is beyond the collective human capacity to calculate in real time.¹¹ Such a lack of transparency, unnerving on its own, makes it difficult to trace not only causality in models, but also biases. Since the quantity of cleaned data affects quality, everything that can be scraped tends to go into foundation models: thoughtful analysis, public domain texts, news, recorded conversations, YouTube transcriptions, movie and television dialogue, literary fiction, debates within Marxist history, polemics against marginalization and oppression, love letters, free academic papers, screeds of all sorts, calls for genocide, and rationalizations for colonialism, racism, queer-bashing, transphobia, antisemitism, and islamophobia.

Beyond vibes, AI realists would be committed to grasping how the technology works, contextualizing it, and examining our intuitions, whether they be to vilify or idealize, to mystify or oversimplify. They would understand that models are not just commodities or platforms, but the unfolding outcome of the systemic logic of embedded material social relations.

For their own economic interests, tech firms have strengthened guardrails to prevent foundation models—pretrained AI models that can be adapted and fine-tuned for individual use cases—from inadvertently spewing antisocial content. But we don’t know how to replicate the guardrails or what inferences lurk below the interface on these closed-source networks. Nor does open source AI development solve ethical problems. The democratization of AI doesn’t preclude the creation and deployment of bespoke, toxic models.¹²

While many on the broad left are scrambling to find ways to stop AI models from scraping their content, and while European centrists are passing weak legislative seawalls to ward against the tsunami of disinformation and misuse, the far-right is rushing to load the Internet with prospective training data in hopes of shifting what they call “woke corporate models.” Whether any of these gambits could work is anyone’s guess.

A Snapshot of AI History

AI was not dreamed into existence by celebrity tech moguls in underground mansion bunkers. It is borne of human intelligence, the result of the collective labor power of millions of workers across a century and around the globe: computer scientists, roboticists, software engineers, mathematicians, geoscientists, ocean cable technicians, underground miners, landfill workers, equipment installers, teachers, data annotators, semiconductor processors, and the millions of paid and unpaid carers who maintain them.¹³ Making celebrity CEOs the protagonists of our story, even as villains, undermines labor’s contribution to computational engineering and obscures capitalist logics.

As industrial and imperial rivalries intensified during the early twentieth century, fantasies about a universal computing machine began to percolate. Human calculators gave way to mechanical calculators, which in turn were replaced with vacuum tube computers whose blinking on/off signals became mid-century Hollywood’s visual shorthand for techno-futuristic command and control. Integrated circuits and transistors replaced vacuum tubes. Computers, once the size of whole buildings, became portable and the transmissible data packets they processed became the stuff of the modern Internet. Computing, until then largely a military project, grounded the material infrastructure that enabled capital to push for anti-labor trade agreements, which globalized supply chains to exploit labor power at its cheapest, displacing local and regional networks of production, exchange, and consumption.

AI research emerged within this matrix.¹⁴ With consistent funding from the US Department of Defense, its early researchers fell into two camps: symbolic or connectionist. Adherents of the symbolic approach, also known as good old-fashioned AI (GOFAI), assumed humans ought to code deductive rationality into machines. They defined intelligence as the ability to manipulate logical symbols and form abstract theories about the world. Connectionists—the field’s eccentric fringe—argued instead that breakthroughs would emerge from building brain-like electronic networks. Machine learning should emerge from the mechanization of empirical sense-making. Machines would no more need to learn logic to become agentic than a child needs to learn topology to tie their shoes. Horrified, the symbolic camp expelled connectionists from funding streams. By the 1970s, they wrote them out of AI history.

But, despite achievements in chess and logic, symbolic AI’s success sharply plateaued. When GOFAI worked, it yielded precise, air-tight deductions. But its success was limited. How could all human knowledge be encoded into symbols? And then there was the problem of robotic movement—Moravec’s Paradox. While machines might outperform humans at logic, programming them to complete simple physical tasks like twirling a pencil or lifting different types of objects was nearly impossible.

By the millennium, multicore processors and advanced graphics processing units provided the computational power neural networks required. As capitalists reorganized the Internet into platforms, the incomprehensibly vast data, now harvestable, transformed connectionist research. It turned out that machines didn’t require humans to code the meaning of the world into them. They just needed power and data.

Lots and lots and lots of data.

Quantitative to Qualitative Change: Data Relations and Data Production

From cave paintings to magnetic tape, humans have always stored information according to their material social needs. Punch card machines, invented at the close of the nineteenth century, tabulated data for war and colonial conquest as well as for labor management, supply chain control, and inventory tracking. Digital technologies were likewise organized through the lens of capital’s concerns: the acceleration of profit-extracting production and exchange.

Computation changed the way information moved during the last two decades of the twentieth century. In rooms full of clacking keyboards, data entry workers turned ink on paper into bits on a glowing glass screen. Corporate offices, once cluttered with overstuffed file cabinets, became aesthetically airy, now able to move records worldwide at the speed of light. This new era of data transfer helped capital pivot operations globally to wherever labor could be most cheaply exploited and, as well, facilitated lean production. Rapid information exchange also served to heal time gaps in what Harry Braverman describes as the severing of concept (intellect) and execution (action) in the labor process.¹⁵ At home, Internet consumers could now relay messages to friends through data packets. On clunky cathode-ray tube monitors everywhere, pundits proclaimed we were entering a new age of openness and connectivity.

But the Internet, originally ARPANET, was not intended as a technology to reconnect with childhood friends. It was a US Department of Defense Cold War hub for maintaining communication during nuclear threats. In the 1970s, it expanded internationally and, by the 1980s, tech standardization allowed it to become a resource for thousands of academics and journalists. Soon, the World Wide Web burst onto the scene capturing nearly ten million users, producing an information explosion.

AI was not dreamed into existence by celebrity tech moguls in underground mansion bunkers. It is borne of human intelligence, the result of the collective labor power of millions of workers across a century and around the globe: computer scientists, roboticists, software engineers, mathematicians, geoscientists, ocean cable technicians, underground miners, landfill workers, equipment installers, teachers, data annotators, semiconductor processors, and the millions of paid and unpaid carers who maintain them.

Our relationship to data was fundamentally different during the web’s early years: no endless scroll, no reason to capture and hold the user’s gaze, no web search ranking beyond keyword counts.¹⁶

The operative metaphor for data collection was noise-to-signal ratio. Data harvesting may have been a boon for the CIA after 9/11, but making it profitable was messy. You had to comb through heaps of metadata—details like login times, browsing duration, and clickstream paths—to find the important stuff. Users floated over the Internet, ghostlike, anonymous, even designing peer-to-peer networks to evade centralized servers where data might be collected. Early netizens might have worried about hackers revealing their deep dark secrets, but few worried about third-party entities selling their clicking patterns.

Then platform capital lured users with the promise of frictionless social connection. In exchange, the platforms logged interpersonal networks. Originally, Google hadn’t thought to mine individual user data. But after the dot-com bust, with no products to sell, they pivoted to monetizing metadata and search content. By 2003, they filed a patent for personal data harvesting files. In 2006, they bought the supposedly unmonetizable YouTube, scanned one-third of all published texts into Google Books, and began to photograph streets and roads to build their rudimentary world simulator Google Earth.¹⁷ The problem of noise-to-signal ratio was solved: all data counted as a type of signal.

Early machine learning techniques helped platform capital decode us. The rise of social media and the global market penetration of smartphones ushered in an era defined by data analytics rather than data transfer. Suddenly, data brokers could make a fortune selling intel about individual psychologies to marketers and states. Police began using social media metadata—connections between accounts, likes, shares, and even the timing of posts—to map networks of activists and monitor protest potential, a seed project that quickly bloomed into predictive policing.¹⁸

In the 2010s, the connectionist camp harnessed this quantitative data explosion to build something qualitatively new. In 2009, Fei Fei Li presented a paper dubbed ImageNet, detailing how she hired thousands of low-paid data labelers from the gig platform Mechanical Turk to help her organize a constellation of images.¹⁹ AlexNet processed Li’s work through a small neural network in 2012.²⁰ In 2013, Word2Vec language mapping made word embeddings—mathematical calculations of word meaning—possible.²¹ Soon, artificial neural nets were assessing sentiment in online posts, moderating content, organizing crowd control, scanning faces, determining credit scores, pricing real estate, vetting job applications, and helping sway elections.²² Artificial neural networks had reshaped the world into what Frank Pasquale calls the black box society, where machines make inscrutable calculations, pass their judgements onto humans, the results are accepted, and no one knows exactly why.²³

Of course, data transfer and data analytics are still ongoing. However, those pivotal moments in our social history are now materially fueling new technological projects. Our data is not just fodder for machines to analyze, our data is a precondition for the machine’s existence. If the ethos of the data transfer era was seamless communication and the ethos of the data analytics era was surveillant control, how do we decipher the meaning of a world enthralled with the possibility of letting the steering wheel steer the wheel?

How Generative AI Works

As AI realists, we have to do some work to get a hairsbreadth grasp on these machines. Most importantly, it must be underscored that neural networks are not algorithms. Algorithms are used to train models, but the models themselves are not algorithms. So, then, what are they? Artificial neural models consist of nodes capable of holding numbers that are interconnected through a tunable network. This sounds complicated, but can be understood more intuitively through an example. For instance, if we wanted to calculate the likelihood of a strike, we could weigh two important facts—say, outcomes of informal polls on a strike authorization vote, which in the example below I’ve set as two separate votes called “Vote Type A & B.” The number represents whether the vote carried. Weights are simply how much weight we give the event on a scale from 0 to 1. The weights in this example have been set by us in terms of what we consider meaningful. The image also shows something called the bias. Bias here doesn’t mean discrimination. In artificial networks, all neurons, except the first layer, are slightly strengthened or weakened depending on the state of the whole, and this is called the bias.

We might be able to calculate the above model in our heads, but with more data the math becomes less manageable. Let’s imagine an architecture where we have three polls for input data and four distinct outputs: strike, no strike, bosses settle before strike, and company dissolves before strike.

Now we have a small prediction engine, but it’s weak. We need more data and middle (hidden) layers in our architecture to handle the complexity.

Our model now has 55 parameters and seems pretty complex.²⁴ For comparison, Yann LeCun’s classic handwriting recognition model from 1998 has seven layers and around 60,000 parameters.²⁵ GPT-4, by comparison, has 1.76 trillion parameters, a complex self-attention transformer architecture, and adaptive processing, meaning that the model changes the way it processes based on the characteristics of queries.

Before transformers, neural networks processed data tokens equally. With transformers, data passes through attention blocks that share information laterally before passing information forward to the next layer. An attention block can adjust one vector to change the meaning of another vector, even if that information is found at the end of a long passage. Some layers might focus on grammar, others on semantics, others on tone. But as for how it works in practice, all we have are educated guesses. The model itself determines how attention blocks change content and on what grounds.

Attention blocks are not limited to language. OpenAI’s Sora produces moving images using visual patches rather than generating sentences from language tokens.²⁶

Our data is not just fodder for machines to analyze, our data is a precondition for the machine’s existence. If the ethos of the data transfer era was seamless communication and the ethos of the data analytics era was surveillant control, how do we decipher the meaning of a world enthralled with the possibility of letting the steering wheel steer the wheel?

But this only describes how transformer models function. To actually understand how they function, we need to examine the training process.

Let’s say that we want to build a foundation model from scratch—because we have tens of millions of dollars laying around to buy the required high-end chips and pay the water and electric bills.²⁷ What would be required? Our first step would be to access cleaned data sets or cleaning our own. The data-cleaning industry once fully relied on low-paid human annotators, but now many sets, like the Colossal Cleaned Crawled Corpus, are public domain. One unique aspect of data sets as a commodity is that they can be consumed without being exhausted, meaning that clean data sets can be used in perpetuity. Though data annotation continues to be a gig for human workers, significant elements of the cleaning process have been outsourced to AI models. One company, Dynosaur, has leveraged LLMs to reduce the cost of producing 800,000 instruction-tuning samples to just $12 USD.²⁸

The amount of available cleaned data illustrates why poisoning algorithms such as Nightshade, which hope to corrupt image training models that use scraped content, are a weaker tool than they seem at first glance. Nightshade adds a drop of distorted data into a sea of sufficiently clean information.

But even clean data sets, below the surface, are murky to humans. They appear as scrambled clusters of uncontested facts broken into tokens and eventually into numbers. Time-crunched low-paid gig workers, machine learning engineers, and pre-processing AIs can’t correct factual or contextual errors in the same way a domain expert would. So who resolves these contradictions? While high-stakes data sets are reviewed by humans, lower stakes data sets are often resolved by the machines themselves without explanatory notes.

Now back to our model. Since we’re imagining that we’re building a foundation model, which, to be robust and complex, requires as much data as possible, it would be reasonable to use any clean data we could get our hands on. We cannot easily evaluate billions of tokenized, scrambled, decontextualized inputs for inaccuracies or discriminatory content. Next, we will organize the model architecture and set the weights and biases at random.²⁹ We cannot preload our assumptions into the weights like in the earlier example where we tried to predict our strike. The machine itself determines how to weigh the value of its connections. The bitter lesson has been that giving models helpful human hints during training counterintuitively slows their ability to reach benchmarks.³⁰ (And, per usual, no one knows why.) OpenAI’s Sora model illustrates how raw quantity yields quality. Its puppy videos didn’t improve because developers explained puppies to it or even because more puppy videos were added during training. More accurate puppies emerged solely from increases in compute power.³¹

Okay, now that we’ve chosen our model architecture, we’re ready to load our data—in scrambled batches to ensure randomization. The model calculates the values for each layer and feeds them forward. As the model engages the data, it determines patterns and relationships at a calculation rate inherently unfathomable to human engineers while simultaneously adjusting its own settings using algorithms to reorganize its initially random parts into a working system.³² Once the forward pass of data is complete, a phase called backpropagation calculates error gradients moving backward from the final layer to the beginning. Then, using a technique called gradient descent, the model readjusts its weights based on this new information. When loss (error) descends to the lowest point in a curve, it means the model is likely approaching optimal parameters. This is what learning looks like for generative AI. We can visualize it as water droplets running downhill toward the deepest nearby basins.

We stop training when our model hits a plateau or reaches a benchmark. We might even leave the model a little underbaked to avoid overfitting. Overfitting is when a model organizes itself to fit the training data so well that it stops being able to properly assess new data because it clings too tightly to the original patterns. Large language models aren’t designed to directly reproduce items in their training sets. Artificial neural nets do not collage information. AI models are not intelligent hard drives pulling stored content from themselves.

Now that our model is trained, it will go through a phase called inference, an entirely separate testing process with new data sets. The inference phase is designed to test our new model for accuracy. After inference, we would subject it to safety tests (red teaming) and allow human domain experts to evaluate and perhaps attempt to tweak its responses.

Okay! Now we have…wait, what exactly do we have? What we have is a next-token predictor. If we build a language model, we have a text predictor. If it’s an image model, we have an image predictor. If it’s a moving image model, we have a visual motion predictor. With machine learning, generation is prediction. So, did we go through all that work just to build spicy autocorrect? Well, it depends on what we mean by just and spicy. For example, if I ask my cell phone to complete “my cat went…,” it quickly collapses into gibberish:

My cat went to the hospital and I was wondering if you could get a ride home from work and when you get home I can get you a ride home and…

While GPT-4 yields:

My cat went to explore the mysterious corners of the old attic, where she discovered a hidden world of forgotten toys and sunbeams dancing through dusty windows.

Why is there a difference? How can an LLM calculate the meaning of words like cat or sunbeam, let alone tune parameters to assess how cats relate to sunbeams?

The first difference is that large models have longer context windows than autocorrect. A context window is how much content a transformer can synthesize before making a prediction. GPT-3 has a context size of 2048 tokens. GPT-4 Turbo can process 128,000 tokens or about 200 single-spaced pages. Anthropic’s customers can buy context windows of up to a million tokens, which means the model can analyze around 2,000 pages of text before predicting the next word. Google has just announced a new architecture called infini-attention that gives models a theoretically infinite context window by creating a working memory within it.³³But context length doesn’t explain how a model knows that sunbeams dance through dusty windows. Is that just a quoted fragment pulled from its training data mashed together with another quote? It is easy—and comforting—to imagine that models simply replicate and reassemble snippets from their training data. But, as illustrated, that’s not how properly functioning models generate responses.

The second difference is embeddings.

Humans use symbols to represent phenomena: letters to represent a cat (C-A-T), for example. Machines use numbers. In language models, each word is a set of coordinates mapped to other coordinates in multidimensional concept space.³⁴ These tangles of vectors, called embeddings, allow the model to discern relationships between concepts as if they were spatial distances. The point for cat, for instance, would be closer to rat than fog. Formed during training, embeddings can be imagined as multidimensional sculptures that capture relationships and complexity.³⁵The number of embeddings used by GPT-4 is secret, but we can assume the model is more advanced than GPT-3, which had a vocabulary of 50,000 embeddings each with 12,288 dimensions of meaning.

Neural networks can be trained on anything patterned: language, images, videos, music, brainwaves, code, engineering blueprints, motion.³⁶ Due to successful training on the latter, roboticists are coming closer to solving the problem of fluid machine movement. The result: artificial intelligence is stepping out of the two-dimensional frame and into the world.

Physical AI Agents

The first AI, Shakey, resembled a wobbly filing cart on wheels. Built at Stanford in 1966, it served for half a century as the model for realistic robotic horizons. Machines might surpass humans at calculation, but movement, philosophers argued, required a world awareness, an embodied intelligence that machines could never have.³⁷ This physical self-knowledge is so natural to humans that capital has long degraded it as unskilled.³⁸

Though reinforcement learning (RL) is an older machine learning technique based on behaviorist theories about operant conditioning, transformer architecture has enhanced its potential. Known as Deep RL, it enables robots to learn in unstructured environments with unforeseen objects. Virtual reality is now part of its training gym. Within Nvidia’s Omniverse, hundreds of versions of the same robot can simultaneously build skill sets in challenging virtual landscapes using different trial-and-error methods.³⁹ The real world robot then acquires this information via transfer learning so it can continue its training in the physical world.

This is not theoretical or merely on the horizon. Virtually trained robotic AI is in its industrial piloting phase. New, cloud-based robotics-as-a-service models are allowing companies to run robot fleets 24/7. Agility Robotics’ Digit is loading boxes at Amazon’s Seattle facility.⁴⁰ That’s in addition to Amazon’s existing non-humanoid swarms. Apptronik just signed a deal with Mercedes-Benz.⁴¹ BMW signed with Figure, which uses OpenAI’s GPT model to converse with users.⁴² In Canada, auto parts company Magna is using Sanctuary AI’s humanoid, Phoenix. Boston Robotics is claiming that its new Atlas, now electric, will be ready for industrial deployment in a few years.⁴³ But robot timelines may have just gotten a boost: Nvidia has launched a robotics foundation model, GR00T, alongside its Blackwell chip, which is five times faster than the compute that current models use.⁴⁴

Digital transfer in the twentieth century involved moving printed content into digital space. Now digitization involves whole persons. Meta’s Ego-Exo4 project, for instance, uses augmented reality to sync participants’ first-person perspective to external third-person cameras, capturing their fine-motor skills from subjective and objective viewpoints: a bicycle mechanic fixes a wheel, a stylist snips hair between her fingers, someone plays piano, cooks, climbs a rock wall.⁴⁵ Medical prosthetics and remote-operated industrial robots similarly produce human behavioral data for machine learning. Humanoids aren’t the only robot style, of course. Four-legged models are already deployed for industrial purposes, as are autonomous vehicular machinery, drones, and AI-trained robot arms.⁴⁶

However, autonomous physical machines are still in their infancy. Unsurprisingly, caretaking and household reproductive labor are still weak points for industrial development.⁴⁷ So is agentic AI, defined as autonomous AI that can draw conclusions about how to best complete tasks for users, although the new GPT-4o model’s multimodal fluidity is a step in the direction of agentic deployment. Foundation models, as of this writing, cannot reliably generate actions on behalf of users without complex sets of permissions.⁴⁸ Transformer architecture is also a rough fit for building machines that could engage in autonomous exploration and a suboptimal fit for building self-driving cars, which may be better served by smaller, explainable liquid neural networks.⁴⁹

Nor are current generative models artificial general intelligences (AGIs). Existing models have been largely trained to do one task well: language generation or image generation or brain wave interpretation or movement. Until recently, multimodal models have been mixtures of experts—three AIs in a trench coat. However, OpenAI’s new GPT-4o and Google’s newest version of Gemini are natively multimodal, meaning that mixtures of data types such as text, images, video, and sound went into their training sets. This is why OpenAI and Google talk about steps toward general intelligence. Models are indeed becoming less narrow. But AGI is a slippery term with no fixed meaning. Some use it to refer to “AIs that would think like humans.” Others use it to insinuate sci-fi superintelligence. Others use it to suggest that machines can achieve self-consciousness.

But when capitalists say AGI, it means something different: “Open AI defines AGI as highly-autonomous systems that surpass humans in most economically valuable tasks.”⁵⁰ Capital is not aiming to produce self-aware machines. Firms that give little thought to the rich internal lives of their meat-bag workers probably aren’t overly concerned about developing machines with deep thoughts. The hope seems to be for obedient, cost-cutting, productive things. And also for destructive things.

Capital is not aiming to produce self-aware machines. Firms that give little thought to the rich internal lives of their meat-bag workers probably aren’t overly concerned about developing machines with deep thoughts. The hope seems to be for obedient, cost-cutting, productive things. And also for destructive things.

Algorithmic Killers and Deskillers

An algorithm is a step-by-step solution. Capitalism has always been algorithmic. The objectification of African peoples into slaves and the colonization of whole lifeworlds into exploitable resources would have been impossible without algorithmic protocols. For hundreds of years, capitalists have used scientific management to analyze workers’ behavior to maximize relative surplus value. The three volumes of Karl Marx’s Capital are, in a way, the first text dedicated to cracking the system’s core algorithms to reveal its hidden layers.

Harry Braverman’s Labor and Monopoly Capital extends Marx’s deskilling thesis, even to those tasked with conceptual labor.⁵¹ Braverman’s antagonists argued that conceptual labor was immune to proletarianization. And yet here we are. The World Economic Forum has predicted that 42 percent of workplace tasks will be automated by 2027.⁵² Matt Welsh, former Harvard professor of computer science, is telling potential junior software engineers that they will soon be replaceable at the cost of around $31 per year.⁵³ Many who remain employed will likely be subject to labor speed-ups and reskilling. Of course, new job categories will emerge. But unlike inventions such as steam, electricity, and digital computing, the entire hope for AI is that it will “outperform humans at most economically valuable work.” Human ability is being reframed as what machines can’t do yet, even though the source of machine ability is our art, our stories, our conversations, our engineering innovations, our habits, even our motor skills.

Proof-positive that conscious machines with rich interior lives are not the goal, AI fighter pilots beat human fighter pilots because they’ll risk self-destruction to win aerial dogfights.⁵⁴ A new US Department of Defense initiative, Replicator, is transporting thousands of lethal autonomous drones to the Indo-Pacific to leverage against a potential invasion of Taiwan.⁵⁵ Taiwan Semiconductor Manufacturing Company is the world’s only advanced computer chip fabricator. The geopolitical calculations here are dizzying. If China and the United States were to engage over Taiwan, the military hardware itself would depend on Taiwan’s manufacturing infrastructure.

But it’s not just tension between the United States and China at play. War is increasingly waged like a video game. Ukraine is an ongoing testing ground for autonomous drone warfare. And when it comes to transformer models, AI’s black-box nature can even be part of its ideological charm: welcome to AI-washing. Israel’s The Gospel picks bombing targets in Gaza while a separate AI, Lavender, chooses which Palestinians to execute, cloaking genocide in technical rationality.⁵⁶ Back in the United States, the Texas Department of Public Safety, the Department of Homeland Security, and the Los Angeles Police Department subscribe to—among other home-grown companies—a monitoring system from Cobweb Technologies, an Israeli firm, originally trained on surveilling Palestinians’ social media and physical activities.⁵⁷ The system continually collects online data from the public, monitoring discussions across communities, even following people onto the dark web. The technology enables police to geofence locations, create instant target cards for individual protesters, which include a profile photograph of them, their online browsing history, a brokered data profile, and real-time tracking location.

As mentioned earlier, the most popular position when it comes to AI is “up with the good stuff, down with the bad.” This common-sense intuition forgets that what counts as good and bad are an ongoing contestation, that we are the ethical agents, not the machines, and that we live inside an organizational logic where commodities are launched for profit, not for good. The AI safety movement talks about making sure machines understand human values, but in a world where the economic operating system actively devalues human flourishing and climate health, that directive seems like a blurry set of prompts. It’s hard to imagine the AI embedding vectors for “human values” as pointing to anything other than a set of vague niceties or a multidimensional wreck of contradictions.

Getting Real About AI

The internet boom…[might] end up looking a lot like the CB radio: initially a cult among specialists; a sudden, skyrocketing surge in popularity, and then, well…not much, really.⁵⁸

Do our computer pundits lack all common sense? The truth is no online database will replace your daily newspaper.… [Negroponte] predicts we’ll buy books and newspapers straight over the Internet. Uh, sure.⁵⁹

Given the overwhelming hype, many in the 1990s and early 2000s argued that the Internet was little more than a passing fad for tech enthusiasts in wealthy nations. But excitement levels don’t change material logic.⁶⁰ While some consumer commodities can survive on hype, if productive commodities don’t solve individual capitalists’ quest toward profitability, they fail.⁶¹ There’s very little hype surrounding ocean Internet cables or industrial lubricants and quite a bit surrounding semiconductors at the moment, but capitalist infrastructure collapses if any of them disappear. As Søren Mau details in Mute Compulsion, economic power radiates through and is constitutive of capitalist society. It compels workers to sell their labor power. It compels capitalists to adopt productivity accelerants at the right moment or risk obsolescence.⁶² The market is an algorithm of punishment and reward. But its algorithmic quality also implies an element of predictability, meaning we can cut through illusions—including our wildest hopes and deepest fears—to calculate its logical dependencies. So let’s make a few predictions based on what we now know.

It’s reasonable to assume that AI advancement will continue its steady acceleration. Even if venture capitalists were to zip their wallets, global defense budgets would almost certainly include AI research as a line item. Though there’s no reason to believe capital has any intention to divest—quite the opposite. Public support for outlawing AI research seems unlikely given its proven ability to advance medicine. Since, as I’ve tried to establish here, the choices of capitalists, broadly speaking, aren’t inherently irrational or hype-dependent and their goals don’t require solving hard problems like “machine consciousness,” and since individual implementations of accelerated AI will produce profits for those who get to market first (and doom for those who fail to adopt AI wisely), we can assume that pressures on capital for AI development and adoption will continue. This will cause what Marx called the organic composition of capital to further rise and employment will decline in many sectors, but at no point in the foreseeable future would it be machines all the way down.⁶³

Infrastructure collapse, however, is a genuine threat. AIs have bodies. Data centers alone comprise 4 percent of the world’s carbon footprint. In fact, training AI models is so resource intensive that Microsoft is already under contract to build a nuclear reactor to support its AI goals. The higher organic composition of capital from increased machinery could potentially lead to lower-priced AI components, but struggle over conflict minerals, shortages in scientific and engineering labor, and even carefully organized targeted strikes could disrupt AI development. Since neural networks are statistically driven, they inherently contain margins of error, so it’s not absurd to speculate that error-prone models, architected for agency, could be capable of both preventing and causing disasters. While explainable and interpretable models function for some tasks, it’s hard to imagine generative models, particularly large language models, without black box components in their architecture, because language itself is fundamentally complex.

Artificial intelligence is a contested term. The word intelligence itself has a dark history and all too often ends up meaning the capacity for knowledge production in service of those who dominate. The artificial part of AI, however, is less contentious. But should it be? Machine learning models are made of mathematics—the web of existence—and models are designed to organize themselves via information about humanity and the natural world through algorithms that human math scholars invented. We may not all be model-building engineers, but our lives and bodies are the content that constructs the machines. AI reflects us. Just because something doesn’t directly crawl from the earth doesn’t make it alien. If AI is a form of collective general intelligence, and part of this collective intelligence yields the means to sustain life, why wouldn’t there be collective ownership of our data and our machines in the same way we assert the right to our bodies, our natural resources, and our cities?⁶⁴

Capitalism is an operating system. Operating systems can be changed. All we have is each other, the earth, the sky, our animal co-inhabiters, and these machines we’ve created. Humanity’s legacy should not be exploitation, the reduction of life into ingestible data sets, and the rationalization of genocide. Our current operating system has brought the world to its present state, but we are creative and resourceful beings. We can do better. And it is up to us. There is no technological solution because we are the origin of all technology. The future is our responsibility. It’s time to get real.