Python3 Deep Dive 101

Start the Python3 interpreter and type 29.

>>> 29
29

The >>> that you see, is called REPL. It is one of the ways that you can interact with Python3. Other obviously being writing code in a file and then executing it.

REPL basically stands for Read – Evaluate – Print – Loop. It basically does what it says, it reads user input, evaluates the input, prints if anything needs to be printed, loops back and waits for user input.

You can interact with any language in an REPL format but the expressive power of each differs. For instance GDB can be considered as an REPL for C, but unlike Python3 REPL where you can create functions on the fly, you cannot do that in C with GDB.

…and now, most importantly, it is not called REPL Loop šŸ˜€

Oh, the irony!

So what happens when you type 29 in REPL?


Python3 interpreter reads 29. Now it needs to evaluate it. It will realize that 29 is an Integer Literal (or is it?…) and there is no need to evaluate a literal. Because it is already in its simplest form, a value (not really and we will discuss this later). Python3 simply prints the value and loops back to read user input.

Let us type something else,

>>> 3 + 5
8
>>> x = 5
>>> x + 2
7

The two statements in bold above, are called expressions. The evaluation of expressions yield values. So when Python3 interpreter reads a statement, it finds out whether the statement is itself a value or is it an expression that needs to be evaluated to be reduced to a value. So basically, our end goal is a value.

There is a nice answer on stackoverflow that explains what an expression is, more formally :

Let us now dive directly into Python3. Paste this code snippet ( found on reddit ) in the REPL and type 29 again :

import ctypes

def deref(addr, typ):
    return ctypes.cast(addr, ctypes.POINTER(typ))

deref(id(29), ctypes.c_int)[6] = 100
>>> 29
100

Wait, what !?

If you are coming from a language like C, where the number 29 is an Integer literal you might be a little surprised as to what just happened. In C, integer literals are literally literals. What that means is, when the compiler encounters 29, it does semantic analysis (?) and converts 29 to its binary form.

The story of Python3 is however different. It is captured in the quote below :

>>> Everything in Python3 is an Object !

Integer literals, other types of literals, functions, classes, instance of classes *duh*, variables* and everything else are Objects. Every time Python3 encounters a literal, an object is created, when it encounters a function definition, an object is created, when it sees a class definition an object is created and so on and so forth.

An Object is something that has (?) attributes also called variables or state and methods also called functions and what not.

So 29 is actually an Integer literal that identifies ( refers to ) an Integer Object. This Integer Object has an attribute, amongst others, called value that contains the value 29. Let me say that again….so basically, 29 is just an identifier in the Symbol table ( also exists in memory* ) that refers to an Object which has multiple attributes, one of them being ‘value‘ that stores 29 in it. Which means, that if somehow I can change the contents stored in the attribute ‘value’ from ‘ 0001 1101’ to let’s say ‘ 0000 0101 ‘ then the Integer Object referred to by 29 will contain the value 5 in its ‘value’ attribute!

In-fact that is what the reddit post did above. We will deep dive into that later. Let us look at the code and memory snapshot of C and Python :

C Language

int x;
x = 5;

This code allocates 4 bytes or more precisely the sizeof integer in memory and stores the value 5 in it. After execution of the code above, this is what the memory snapshot looks like :

.
.
0x1000
0x1001  0000 0101 <--- 5
0x1002  0000 0000
0x1003  0000 0000
0x1004  0000 0000
0x1005
.
.

Let us forget about the Endianness of the architecture for now. If you see, 4 bytes are used after the execution of those 2 lines. ‘x‘ simply exists in the Symbol table and is mapped to address 0x1001 and its size is 4 bytes ( YMMV )

Now if you use a C pointer :

int* ptr;
ptr = &x;

The memory snapshot will change to something like the following :

.
.
0x0050  0001 0000 <--- points to 0x1001
0x0051  0000 0001 <---
0x0052  0000 0000
0x0053  0000 0000
.
.
.
0x1000
0x1001  0000 0101 <--- 5
0x1002  0000 0000
0x1003  0000 0000
0x1004  0000 0000
0x1005
.
.

Here ‘ptr‘ variable is created at the address 0x0050. And as you would expect, it stores the address of ‘x’ i.e. 0x1001 in binary format.

So now, we use 8 bytes ( let us assume the int* is 4 bytes ) – 4 bytes for storing the actual value i.e. 5 and 4 bytes to store the pointer or address or location or reference to that value.

Python3

In Python3 if you do the following :

>>> x = 5

This does something similar to the C code above. It actually creates ‘x‘ as a reference variable ( do not call it a pointer, I will explain why ) and creates 5 as an Integer Object and makes ‘xrefer ( not point ) to the Object.

This means, that with a single statement in Python3, we have allocated 2 chunks of memory, one is the Integer object that holds the value 5 and other being the reference variable referring to it.

We arrive to a new conclusion here about variables* :

>>> Variables in Python3 can only store the Object's ID.

How to check this ? Well, let us look at the code below :

>>> h = 5
>>> id (h)
4521951024
>>> id (5)
4521951024

As we have already discussed, 5 is a an Integer Literal that refers to Integer Object that contains 5. Its id i.e. a Unique identification ( revert back ? šŸ˜€ ) is 4521951024 in this case. Now when you say ‘ h = 5 ‘ the id is copied into the variable h.

This is a really good explanation on StackOverflow. Read before you proceed.

For CPython, id(x) is the memory address where x is stored.

Now as I have warned before, not to call Python3 variables as pointers ( call them references ) the reason is that those variables do not always point to the Object’s memory location. Only in the CPython implementation of Python3, the Object ID = memory address of the Object. It will differ in other implementations, where you would still get a unique ID per Object, but it might not be the memory address.

Now with these fundamental concepts in mind, let us decode the code :

import ctypes

def deref(addr, typ):
    return ctypes.cast(addr, ctypes.POINTER(typ))

deref(id(29), ctypes.c_int)[6] = 100
  • The code works on the Integer Object referred to by the Integer literal 29.
  • The deref function returns the address of the C Structure which is the actual CPython representation of a Python3 Object.
  • At the 6th offset is a field called value that originally consisted of 29.
  • When we say deref ( . . . ) = 100, we take the value from the Integer Object referred to by 100 and copy it in the value of the Integer Object referred to by 29.
  • Read the last point again.
  • And again… ( check this out where there is an in-depth explanation of the code )

So now if I do the following, it should make sense :

import ctypes

def deref(addr, typ):
    return ctypes.cast(addr, ctypes.POINTER(typ))

deref(id(100), ctypes.c_int)[6] = 29
deref(id(29), ctypes.c_int)[6] = 100
>>> 29
29
>>> 100
29

Think about how to revert this behavior ( solution is in the reddit reference ).

Now can you explain the sequence below ( Hint: Python Small Integer Caching ) :

>>> import ctypes

>>> def deref(addr, typ):
....    return ctypes.cast(addr, ctypes.POINTER(typ))

>>> deref(id(99999), ctypes.c_int)[6] = 29

>>> 99999
99999

>>> y = 99999
>>> y
99999

>>> deref(id(y), ctypes.c_int)[6] = 29
>>> y
29

>>> 99999
99999

Leave any questions below and I will do my best to answer them.

Thank you for reading!

References

Leave a comment