Virtual Copy of You

Upload incoming

Nov 25, 2021

Ever since GPT-3 entered the stage I’ve been interested in how the news and development of future GPT models will be received by the public. And, I am a little miffed. GPT-3, a model that is too dangerous to be released to the public, is now mostly known and discussed in the AI community.

Not many people know that today’s state-of-the-art technology is far more advanced than it is portrayed. We are constantly being bombarded that AI is coming, AI is here, AI is not as good as humans, AI is a threat, AI is not a threat, and so on. It is easy to get confused with all the sensationalistic headlines, clickbait for and against AI. So much so that we dismiss any truly important news as just another post about AI. The “beauty” of AI is that it’s become so prevalent and invisible as most of its activity is done in support of some service offered to customers. Advertisers use it to predict your preferences, social media companies use it to “steer” your interests, the Chinese Communist Party uses it for its credit system to pacify its population. Just to name a few less-known examples next to all prevalent self-driving cars example.

Pace of progress

There have been so many developments in and around how we interface with computers in the last couple of decades that it borders on science fiction. Actually, for many people, it is science fiction and if anyone asked me 10 years ago I would have been hard-pressed to imagine such rapid development.

Two years ago, In 2019 researchers from Japan showed a new model for recreating images reconstruction from human brain activity. Such efforts have been going on for a long time and usually but were dismissed as being fiction and not possible. There is no way to get an image out of the human mind, it simply is not possible, there are too many connections, the activity in the mind is too chaotic.

Well, improvement since the first proper attempts in 2008 is drastic, to put it mildly.

If the complexity of the task is reduced, lats say from decoding video or images from brain activity to something simpler like writing the job gets much easier. A 2020 paper from the US details a procedure that can enable reading of alphabet characters from brain activity with an accuracy of over 94% accuracy or “90 characters per minute at >99% accuracy with a general-purpose autocorrect”. This procedure is invasive and requires a microelectrode array to be implanted in your brain but with NeuralLink on the horizon, the procedure will become less invasive.

The advances don’t end there. In 2019 another research paper described a method of Multi-Person Brain-to-Brain Interface for Direct Collaboration Between Brains with an average accuracy of 81%.

The interface combines electroencephalography (EEG) to record brain signals and transcranial magnetic stimulation (TMS) to deliver information noninvasively to the brain. The interface allows three human subjects to collaborate and solve a task using direct brain-to-brain communication. Two of the three subjects are "Senders" whose brain signals are decoded using real-time EEG data analysis to extract decisions about whether to rotate a block in a Tetris-like game before it is dropped to fill a line. The Senders' decisions are transmitted via the Internet to the brain of a third subject, the "Receiver," who cannot see the game screen. The decisions are delivered to the Receiver's brain via magnetic stimulation of the occipital cortex.
…
Our results raise the possibility of future brain-to-brain interfaces that enable cooperative problem solving by humans using a "social network" of connected brains.

Resistance is futile. Your biological and technological distinctiveness will be added to our own.

As mentioned before, not everyone is fine with having doctors cut open their heads to put electrodes in, but there’s also another recent discovery that is pulling everything together, Graphene. Graphene is biologically inert and interfaces with our body with no rejections making it a perfect material to replace current bulky platinum electrodes while also increasing the resolution by 16 times.

It will be some time before we will be able to upload ourselves to the digital domain. That which was once in the realm of hard science fiction is more and more looking like a real possibility we could see in the next decade if this pace of progress continues.

The Digital You ChatBot

All AI models need labeled data to learn from. We cant expect AI to recognize a car or a plane without it having an image labeled car or plane to learn from. Depending on what data we give it, it could actually develop undesirable personality traits, just like Microsoft learned in 2016 with its Twitter chatbot that turned into a Nazi in just 24h. As with a human child, a digital one will learn what we teach it.

There is now a standardized collection of texts on which Generative Pre-trained Transformer or GPT models like GPT-NEO are trained on, and it’s unsarcastic called The Pile.

Currently, It is a collection of “825 GB of diverse, open-source language modeling data set that consists of 22 smaller, high-quality datasets combined together.” It contains texts from sources like Pile-CC, PubMed Central, Books3, OpenWebText2, ArXiv, Github, FreeLaw, StackExchange, USPTO Backgrounds, PubMed Abstracts, Gutenberg (PG-19), OpenSubtitles, Wikipedia (en), DM Mathematics, Ubuntu IRC, BookCorpus2, EuroParl, HackerNews, YoutubeSubtitles, PhilPapers, NIH ExPorter, and Enron Emails.

Not sure how I feel about the inclusion of Enron email in AI that could be used for some e-commerce purposes. It’s not a big dataset, just shy of 1 GB and “used to aid in understanding the modality of email communications”. So, it will be polite to you while it’s screwing you out of your savings.

The people at Eleuthera have a great sense of humor as The Pile is hosted by The Eye.

The-Eye is a non-profit, community driven platform dedicated to the archiving and long-term preservation of any and all data including but by no means limited to... websites, books, games, software, video, audio, other digital-obscura and ideas.
140TB of data.

What makes The Pile important is that it can be amended, edited, and most notably it’s publicly available and can be downloaded making it possible to create a fully custom AI that has The Pile as its base of knowledge and then “fine-tuned” with your own data. If you have a large body of work like articles, books, social media posts, and comments you could load them up, add higher significance to them and what you get is an approximation of your mind that can fit into about 70 GB of computer memory. It could also end up being smarter than most people it mimics, after all, it has read the entire English Wikipedia.

There are already strides being done to bring AI down from datacenter size to a more manageable size of local computers. One such endeavor is project Replicant which aims to create an open-source AI companion that is self-hostable.

You can check out a GPT-Neo Chatbot on Google Colab to get a feeling of where AI is going.

You are now caught up to the most important developments in general AI.

Welcome to the Future, have a great day.

Barn Lab

Discussion about this post