Python源码阅读-内建对象(2)

Yaoxuan Wei

2018-03-08 (Updated: 2020-11-29)

这是《Python源码剖析 — 深度探索动态语言核心技术》的阅读记录.

Python中的整数对象 - IntObject

IntObject对象

之前我们说过Python对象分成定长和不定长的两种, 其中不定长对象就是对定长对象的一个扩展. 除了这样划分, 我们还可以划分成可变对象和不可变对象, 本节的IntObject就是不可变对象.

在intobject.h中可以看到:

PyIntObject represents a (long) integer. This is an immutable object;
an integer cannot change its value after creation.

一旦创建, 值就是不可变的.

但是我们知道啊, 一个简单的程序都会伴随的大量的整数的创建, 改变, 消灭. 要是这样看的话, 根据Python的引用计数的垃圾回收机制, 那系统的堆访问岂不是会访问炸了? 虽然你可能曾经了解过了, 对于这个, Python所使用的机制是使用一个整数对象池, 这个对象池其实就是用做缓冲之用. 这种缓冲池的应用, 除了IntObject, 在其他的不可变对象, 也是存在的.

还是来看代码:

PyTypeObject PyInt_Type = {
	PyObject_HEAD_INIT(&PyType_Type)
	0,
	"int",
	sizeof(PyIntObject),
	0,
	(destructor)int_dealloc,		/* tp_dealloc */
	(printfunc)int_print,			/* tp_print */
	0,					/* tp_getattr */
	0,					/* tp_setattr */
	(cmpfunc)int_compare,			/* tp_compare */
	(reprfunc)int_repr,			/* tp_repr */
	&int_as_number,				/* tp_as_number */
	0,					/* tp_as_sequence */
	0,					/* tp_as_mapping */
	(hashfunc)int_hash,			/* tp_hash */
        0,					/* tp_call */
        (reprfunc)int_repr,			/* tp_str */
	PyObject_GenericGetAttr,		/* tp_getattro */
	0,					/* tp_setattro */
	0,					/* tp_as_buffer */
	Py_TPFLAGS_DEFAULT | Py_TPFLAGS_CHECKTYPES |
		Py_TPFLAGS_BASETYPE,		/* tp_flags */
	int_doc,				/* tp_doc */
	0,					/* tp_traverse */
	0,					/* tp_clear */
	0,					/* tp_richcompare */
	0,					/* tp_weaklistoffset */
	0,					/* tp_iter */
	0,					/* tp_iternext */
	int_methods,				/* tp_methods */
	0,					/* tp_members */
	0,					/* tp_getset */
	0,					/* tp_base */
	0,					/* tp_dict */
	0,					/* tp_descr_get */
	0,					/* tp_descr_set */
	0,					/* tp_dictoffset */
	0,					/* tp_init */
	0,					/* tp_alloc */
	int_new,				/* tp_new */
	(freefunc)int_free,           		/* tp_free */
};

当我们把所有没有定义的属性移走之后, 剩下的就是:

PyTypeObject PyInt_Type = {
	...
	(destructor)int_dealloc,		/* tp_dealloc */
	(printfunc)int_print,			/* tp_print */
	...
	(cmpfunc)int_compare,			/* tp_compare */
	(reprfunc)int_repr,			/* tp_repr */
	&int_as_number,				/* tp_as_number */
	...
	(hashfunc)int_hash,			/* tp_hash */
	...
    (reprfunc)int_repr,			/* tp_str */
	PyObject_GenericGetAttr,		/* tp_getattro */
	...
	Py_TPFLAGS_DEFAULT | Py_TPFLAGS_CHECKTYPES |
		Py_TPFLAGS_BASETYPE,		/* tp_flags */
	int_doc,				/* tp_doc */
	...
	int_methods,				/* tp_methods */
	...
	int_new,				/* tp_new */
	(freefunc)int_free,           		/* tp_free */
};

我们先来看看是怎么比较两个IntObject的大小的:

static int
int_compare(PyIntObject *v, PyIntObject *w)
{
	register long i = v->ob_ival;
	register long j = w->ob_ival;
	return (i < j) ? -1 : (i > j) ? 1 : 0;
}

其实就是简单的比较两个long数值. 是吧, 其实也没那么难.

但是我们要关注的其实是这个: &int_as_number.

static PyNumberMethods int_as_number = {
	(binaryfunc)int_add,	/*nb_add*/
	(binaryfunc)int_sub,	/*nb_subtract*/
	(binaryfunc)int_mul,	/*nb_multiply*/
	(binaryfunc)int_classic_div, /*nb_divide*/
	(binaryfunc)int_mod,	/*nb_remainder*/
	(binaryfunc)int_divmod,	/*nb_divmod*/
	(ternaryfunc)int_pow,	/*nb_power*/
    ....(omitted)

这些就是函数指针就是整数对象所支持的函数方法, 其实也没有把所有的函数都实现, 但大部分都实现的了, 我们还是随便挑一个来看:

static PyObject *
int_add(PyIntObject *v, PyIntObject *w)
{
	register long a, b, x;
	CONVERT_TO_LONG(v, a);
	CONVERT_TO_LONG(w, b);
	x = a + b;
	if ((x^a) >= 0 || (x^b) >= 0)
		return PyInt_FromLong(x);
	return PyLong_Type.tp_as_number->nb_add((PyObject *)v, (PyObject *)w);
}

首先就是先规范类型接着进行关键的加法运算, 接着判断是否出现加法溢出, 如果没有直接返回新的IntObject, 如果溢出就使用LongObject了, 这个就先不考虑了. 有趣的是这个CONVERT_TO_LONG.

#define CONVERT_TO_LONG(obj, lng)		\
	if (PyInt_Check(obj)) {			\
		lng = PyInt_AS_LONG(obj);	\
	}					\
	else {					\
		Py_INCREF(Py_NotImplemented);	\
		return Py_NotImplemented;	\
	}

这是一个宏, 其中又使用了一个宏:

1 2	/* Macro, trading safety for speed / #define PyInt_AS_LONG(op) (((PyIntObject )(op))->ob_ival)

这个宏上面的注释说用安全交换速度. 安全指的就是类型安全啦, 而在intobject.c中, 还有一个函数版本的转换函数: PyInt_AsLong. 这个函数就考虑的较多了, 更加安全. 当然了这样的代价就是速度.

稍微向int_as_number的上面看看:

PyDoc_STRVAR(int_doc,
"int(x[, base]) -> integer\n\
\n\
Convert a string or number to an integer, if possible.  A floating point\n\
argument will be truncated towards zero (this does not include a string\n\
representation of a floating point number!)  When converting a string, use\n\
the optional base.  It is an error to supply a base when converting a\n\
non-string. If the argument is outside the integer range a long object\n\
will be returned instead.");

嚯, 这不就是我们在Python Shell中看的文档吗! 所以现在我们就知道了, Python直接将她的文档集成在语言的实现中关于这个宏的定义, 出现在整个Python实现的元头文件(Python.h):

/* Define macros for inline documentation. */
#define PyDoc_VAR(name) static char name[]
#define PyDoc_STRVAR(name,str) PyDoc_VAR(name) = PyDoc_STR(str)
#ifdef WITH_DOC_STRINGS
#define PyDoc_STR(str) str
#else
#define PyDoc_STR(str) ""
#endif

创建一个IntObject

现在就让我们来了解下Python对整数对象设计的缓冲池是怎样的. 我们已经知道Python对小整数对象设计了一个对象池. 现在提出三个问题:

什么样的整数属于小整数?
对象池怎么实现的?
大整数怎么办?

接下来我们就来回答这些, 在正式开启之前, 我们还是先来了解下一个IntObject创建方式有哪些.

在头文件中, 我们可以找到函数声明:

PyAPI_FUNC(PyObject *) PyInt_FromString(char*, char**, int);
#ifdef Py_USING_UNICODE
PyAPI_FUNC(PyObject *) PyInt_FromUnicode(Py_UNICODE*, Py_ssize_t, int);
#endif
PyAPI_FUNC(PyObject *) PyInt_FromLong(long);
PyAPI_FUNC(PyObject *) PyInt_FromSize_t(size_t);
PyAPI_FUNC(PyObject *) PyInt_FromSsize_t(Py_ssize_t);

至于具体他们的实现当然是可以在c文件中找到, 但是只有了解到Python的整数对象在内存中的表现形式, 才可以去理解.

现在就来回答上面的三个问题, 首先第1问.

小整数

什么样的整数才算小的? 这个当然是取决于使用场景. 在Python的世界中, 默认的小整数范围是从**[-5, 257)**的. 由于使用场景不同, 这个值是可以进行更改的, 在这里(intobject.c):

#ifndef NSMALLPOSINTS
#define NSMALLPOSINTS		257
#endif
#ifndef NSMALLNEGINTS
#define NSMALLNEGINTS		5
#endif
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
/* References to small integers are saved in this array so that they
   can be shared.
   The integers that are saved are those in the range
   -NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive).
*/
static PyIntObject *small_ints[NSMALLNEGINTS + NSMALLPOSINTS];
#endif

也就是说, 你只要修改这两个宏所对应的值, 就可以实现自己定义小整数范围. 只要判断条件通过就会激活这个对象池, 也就是这个静态的PyIntObject数组.

大整数

小整数的范围就这样了, 但是大整数不是说就不怎么使用了啊? 那怎么办? Python官方在空间和时间上做了平衡之后得到的解决方法是 - 提供一个单独的内存空间用来除了上面定义的小整数之外的整数使用. 这个内存空间就是一个新的结构, _intblock:

#define BLOCK_SIZE	1000	/* 1K less typical malloc overhead */
#define BHEAD_SIZE	8	/* Enough for a 64-bit pointer */
#define N_INTOBJECTS	((BLOCK_SIZE - BHEAD_SIZE) / sizeof(PyIntObject))

struct _intblock {
	struct _intblock *next;
	PyIntObject objects[N_INTOBJECTS];
};

typedef struct _intblock PyIntBlock;

static PyIntBlock *block_list = NULL;
static PyIntObject *free_list = NULL;

一个小疑问, 这里为啥不写在一起? 直接typedef不就行了? 可能只是代码风格问题吧

当然, 这么一个空间也是可以动态调整的, 只要修改之后重新编译一下就好了.

这就是装着整数对象的单向链表, 而free_list就是一开始指向这个链表表头的指针, 随着程序移动. 到了这里, 我们的这个小标题其实可以做个更改了, 它更应该叫做通用整数.

创建过程

行了, 我们现在就开始看创建函数吧:

PyObject *
PyInt_FromLong(long ival)
{
	register PyIntObject *v;
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
	if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS) {
		v = small_ints[ival + NSMALLNEGINTS];
		Py_INCREF(v);
#ifdef COUNT_ALLOCS
		if (ival >= 0)
			quick_int_allocs++;
		else
			quick_neg_int_allocs++;
#endif
		return (PyObject *) v;
	}
#endif
#################我自己的分割线######################
	if (free_list == NULL) {
		if ((free_list = fill_free_list()) == NULL)
			return NULL;
	}
	/* Inline PyObject_New */
	v = free_list;
	free_list = (PyIntObject *)v->ob_type;
	PyObject_INIT(v, &PyInt_Type);
	v->ob_ival = ival;
	return (PyObject *) v;
}

可以看出来, 如果小整数的对象池被激活并且需要创建的整数对象就在对象池里面, 那么就会直接从那里面取出来用. 这就很简单了, 但是如果不是在里面或者对象池没有被激活, 那么就会走下面的函数过程了. 来分析下:

如果当前还没有分配, 就先创建我们的int block. 这个行为不仅出现在最一开始的fill_free_list调用上, 在所有的空闲内存(for intblock)没有的时候, free_list会重新变成NULL, 那个时候调用fill_free_list就又会进行分配了.

这个创建的过程是这样的:

static PyIntObject *
fill_free_list(void)
{
	PyIntObject *p, *q;
	/* Python's object allocator isn't appropriate for large blocks. */
	p = (PyIntObject *) PyMem_MALLOC(sizeof(PyIntBlock));
	if (p == NULL)
		return (PyIntObject *) PyErr_NoMemory();
	((PyIntBlock *)p)->next = block_list;
	block_list = (PyIntBlock *)p;
	/* Link the int objects together, from rear to front, then return
	   the address of the last int object in the block. */
	p = &((PyIntBlock *)p)->objects[0];
	q = p + N_INTOBJECTS;
	while (--q > p)
		q->ob_type = (struct _typeobject *)(q-1);
	q->ob_type = NULL;
	return p + N_INTOBJECTS - 1;
}

首先分配整个通用整数对象池的内存空间, 如果malloc失败, 报错处理. 如果malloc成功就可以继续了, 先忽略掉后面的那两行我们来看注释下面的. 接下来, 我们把IntObject数组的第一个元素贴到这个内存区域的头部, 把另一个指针q调整到这个对象池的最末尾+1的位置, 此时p指针指向头部第一个元素(也就是objects[0]). Then, 开始从后向前的移动q指针, 每一次移动都把对象池后面的那个整数对象的ob_type指向它前一个整数对象的ob_type直到q指针和p指针碰面, 也就是同时指向首位. 接着将第一个整数对象的ob_type设置成NULL. 这样就算是完成了, 最后返回这个对象池的最后一个元素的内存地址.

看到这里, 你可能会觉得有点疑惑, 尤其是ob_type那里, 为什么要这么做呢? 别急, 让我们继续往下看, 你会发现这一段代码的巧妙之处.

/* Inline PyObject_New */
v = free_list;
free_list = (PyIntObject *)v->ob_type;
PyObject_INIT(v, &PyInt_Type);
v->ob_ival = ival;
return (PyObject *) v;

之前在说对象池的时候就介绍过了, free_list相当于是对象池的对象指针, 那么每一次返回一个整数, 我们都需要把这个指针进行移动. 那么如何进行移动呢? 看上面的代码就可以知道了: 我们只需要把ob_type传递给free_list就行了.

怎么样是不是十分巧妙数组中的每一个元素通过这个ob_type属性就这样巧妙的串起来了!

接着我们想象一下, 这个对象池一直在被填充直到free_list指针指向了最前面也就是刚刚的object[0]的位置. 这个地方的ob_type是NULL! 没错, 此时就认为是对象池满了, 那怎么办? 当然是重新激活一个新的通用整数对象池了.

那原来的对象池怎么处理呢? 这就是之前我们忽略过去的那两行:

1 2	((PyIntBlock )p)->next = block_list; block_list = (PyIntBlock )p;

这就是在把原来的整数对象池连接到了新的对象池的后面.

使用对象池

好了, 现在我们已经清楚整数对象是怎么被创造出来的. 接下来来考虑这么一个问题: 如果说现在有两个对象池, 第一个已经被填满了, 第二个尚有空间. 在这个时候第一个对象池有IntObject被删掉了, 也就是流出了一个空闲内存. 那么我该怎么知道并且在下一次创建的时候使用到这个空间呢? 我们来看看析构函数:

static void
int_dealloc(PyIntObject *v)
{
	if (PyInt_CheckExact(v)) {
		v->ob_type = (struct _typeobject *)free_list;
		free_list = v;
	}
	else
		v->ob_type->tp_free((PyObject *)v);
}

这里的判断宏其实就是在看传入的Object的ob_type是不是PyInt_Type, 先不管他. 来看后面, Python将这个被删除的对象连接到了由ob_type作为指针所连接的链表上.接着再把这个表示下一个空闲内存的指针移动到了这里. 就是十分的自然!

如果不是整数类型, 就会调用那个对象所指向的tp_free方法了.

小整数对象池的初始化

小整数对象池又是怎么被初始化的呢, 答案就在初始化函数中:

int
_PyInt_Init(void)
{
	PyIntObject *v;
	int ival;
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
	for (ival = -NSMALLNEGINTS; ival < NSMALLPOSINTS; ival++) {
              if (!free_list && (free_list = fill_free_list()) == NULL)
			return 0;
		/* PyObject_New is inlined */
		v = free_list;
		free_list = (PyIntObject *)v->ob_type;
		PyObject_INIT(v, &PyInt_Type);
		v->ob_ival = ival;
		small_ints[ival + NSMALLNEGINTS] = v;
	}
#endif
	return 1;
}

可以看到基本过程和上面的创建通用整数对象池十分相像. 只不过这里是具体的数值了.

Python整数对象结论测试

学完了这一节, 到底是不是这样呢? 我们来实际上做个测试: 修改源代码重新编译, 看看到底是怎样的.

主要的修改目标就是打印函数, 输出更多信息来使得我们了解到底层:

>>> a = int(-1234)
>>> a
Address of -1234: 0x14f1408
Address of free_list: 0x14f1438

>>> b = int(-123)
>>> b
Address of -123: 0x14f1438
Address of free_list: 0x14f1420

>>> del b
>>> a
Address of -1234: 0x14f1408
Address of free_list: 0x14f1438

这就是我们修改源码编译之后的效果, 可以看到事实就是如此, b的内存位置就是a的free_list的位置, 而当我们删除b了之后, 空闲指针又重新指向了b缺失的位置.