从源码角度看traces.txt是如何生成的

简介

traces.txt 位于安卓系统下/data/anr目录下,当系统中有应用出现ANR时,framework会在弹出Dialog的同时dump出当前各线程的堆栈情况,方便开发者分析出ANR的root cause。

ANR是Application Not Responsing 的简称,简而言之,就是安卓系统内置提示用户应用界面没有反应的机制,是用来避免应用界面一直卡顿,增加系统用户友好度的一种方式。

造成ANR的原因很多,都是因为在主线程执行任务太久阻塞了界面更新导致的,主要有以下几类:

  1. Broadcast Timeout: 前台广播执行超过10s, 后台广播执行超过60s (要注意的是,只有串行的广播才有超时机制,并行的广播并不会超时,也就是说,如果广播是动态注册的,直接调用sendBroadcast触发,如果主线程Looper中排在后面的Message不会触发超时机制,那么即时这个广播是前台广播,系统也永远不会弹出框提示用户超时了)
  2. Service Timeout: 前台服务之星超过20s, 后台服务之星超过200s
  3. Provider Timeout: 内容提供者,publish超过10s
  4. Input Timeout: 按键触摸事件分发超过5s

能够造成ANR的前提是任务是在主线程上执行的,执行什么样的任务主要有以下几点:

  1. 执行耗时任务过久,如文件读取或存储,网络访问获取文件太耗时
  2. 线程被阻塞过久,或者干脆出现了死锁
  3. 线程饥饿,如Binder线程总共16个,Binder主线程占有一个,剩余的15个工作线程都被占满
  4. CPU饥饿,负载值过大,虽然代码正常额但任务一直没有来得及执行

那么回到traces.txt文件,它到底包含了什么信息可以帮助开发者找到ANR问题的根源呢。

在这篇文章,我先从一份traces.txt的日志实例开始解析,然后再通过追踪源码来解释traces.txt是如何生成的。

我是基于Android 5.0原生代码进行分析的,最新的一些代码逻辑放到了AppErrors类中了,有些小差别,但是大体的逻辑和步骤没有变化,不会影响阅读。

日志实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// 触发ANR的应用pid及触发时间
----- pid 30307 at 2017-09-03 14:51:14 -----
Cmd line: com.example.androidyue.bitmapdemo
JNI: CheckJNI is off; workarounds are off; pins=0; globals=272
DALVIK THREADS:
(mutexes: tll=0 tsl=0 tscl=0 ghl=0)
// JAVA 态线程状态
// 线程名称; 优先级; 线程id; 线程状态
"main" prio=5 tid=1 TIMED_WAIT
// 线程组; 线程suspend计数; 线程debug suspend计数; 同辈线程对象; 当前线程对象
| group="main" sCount=1 dsCount=0 obj=0x416eaf18 self=0x416d8650
// 线程id; 进程优先级; 调度者优先级; 调度者组名; 线程缓存句柄
| sysTid=30307 nice=0 sched=0/0 cgrp=apps handle=1074565528
// native线程状态; 调度者状态; 执行用户代码的总时间; 执行系统代码的总时间; 任务cpu使用数量
| state=S schedstat=( 0 0 0 ) utm=5 stm=4 core=3
// 线程调用栈
at java.lang.VMThread.sleep(Native Method)
at java.lang.Thread.sleep(Thread.java:1044)
at java.lang.Thread.sleep(Thread.java:1026)
at com.example.androidyue.bitmapdemo.MainActivity$1.run(MainActivity.java:27)
at android.app.Activity.runOnUiThread(Activity.java:4794)
at com.example.androidyue.bitmapdemo.MainActivity.onResume(MainActivity.java:33)
at android.app.Instrumentation.callActivityOnResume(Instrumentation.java:1282)
at android.app.Activity.performResume(Activity.java:5405)
// Native态线程状态
"Binder_5" prio=5 tid=32 NATIVE
| group="main" sCount=1 dsCount=0 obj=0x41dd6570 self=0x678fa458
| sysTid=29473 nice=0 sched=0/0 cgrp=apps handle=1732666432
| state=S schedstat=( 578000 10600000 3 ) utm=0 stm=0 core=0
#00 pc 00016fe4 /system/lib/libc.so (__ioctl+8)
#01 pc 0002a97d /system/lib/libc.so (ioctl+16)
#02 pc 00016ba1 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
#03 pc 00017363 /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+154)
#04 pc 0001b15d /system/lib/libbinder.so
#05 pc 00011267 /system/lib/libutils.so (android::Thread::_threadLoop(void*)+114)
#06 pc 0004679f /system/lib/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+66)
#07 pc 00010dcd /system/lib/libutils.so
#08 pc 0000e3d8 /system/lib/libc.so (__thread_entry+72)
#09 pc 0000dac4 /system/lib/libc.so (pthread_create+160)
at dalvik.system.NativeStart.run(Native Method)

关键词对照

关键词 解释
prio 优先级
tid 线程id
group 组名
sCount 挂起次数
dsCount 调试下 挂起计数
obj 同辈线程对象
self 当前线程对象
sysTid 线程id
nice 进程优先级
sched 调度者优先级
cgrp 调度者组名
handle 线程缓存句柄
state native线程状态
schedstat 调度者状态
utm 用户态CPU时间
stm 内核态CPU时间
core 运行所在核

线程状态对照

JAVA状态 CPP状态 解释
TERMINATED ZOMBIE 线程死亡
RUNNABLE RUNNING/RUNNABLE 可运行或正在运行
TIMED_WAITING TIMED_WAIT 执行了带有超时参数的wait, sleep或join函数
BLOCKED MONITOR 线程阻塞,等待获取对象锁
WAITING WAIT 执行了无超时参数的wait
NEW INITIALIZING 新建,正在初始化中
NEW STARTING 新建,正在启动
RUNNABLE NATIVE 正在执行JNI本地函数
WAITING VMWAIT 正在等待VM资源
RUNNABLE SUSPENDED 线程暂停,通常由于GC或者DEBUG被挂起

源码分析 (虚拟机)

我将从代码的最底层开始分析,再慢慢到触发点的顶端,先从cpp开始

Thread::DumpState

thread.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
void Thread::DumpState(std::ostream& os, const Thread* thread, pid_t tid) {
// 组名
std::string group_name;
// 优先级
int priority;
// 是否是daemon线程
bool is_daemon = false;
// 拿到当前线程对象指针
Thread* self = Thread::Current();
if (thread != nullptr) {
...
}
if (gAborting == 0 && self != nullptr && thread != nullptr && thread->tlsPtr_.opeer != nullptr) {
ScopedObjectAccessUnchecked soa(self);
priority = soa.DecodeField(WellKnownClasses::java_lang_Thread_priority)
->GetInt(thread->tlsPtr_.opeer);
is_daemon = soa.DecodeField(WellKnownClasses::java_lang_Thread_daemon)
->GetBoolean(thread->tlsPtr_.opeer);
...
if (thread_group != nullptr) {
...
group_name = (group_name_string != nullptr) ? group_name_string->ToModifiedUtf8() : "<null>";
}
} else {
priority = GetNativePriority();
}
std::string scheduler_group_name(GetSchedulerGroupName(tid));
if (scheduler_group_name.empty()) {
scheduler_group_name = "default";
}
// 这里开始输出线程状态
if (thread != nullptr) {
// 线程名输出
os << '"' << *thread->tlsPtr_.name << '"';
if (is_daemon) {
os << " daemon";
}
// 优先级; ID; 状态
os << " prio=" << priority
<< " tid=" << thread->GetThreadId()
<< " " << thread->GetState();
if (thread->IsStillStarting()) {
os << " (still starting up)";
}
os << "\n";
} else {
os << '"' << ::art::GetThreadName(tid) << '"'
<< " prio=" << priority
<< " (not attached)\n";
}
// 线程组; 线程suspend计数; 线程debug suspend计数; 同辈线程对象; 当前线程对象
if (thread != nullptr) {
MutexLock mu(self, *Locks::thread_suspend_count_lock_);
os << " | group=\"" << group_name << "\""
<< " sCount=" << thread->tls32_.suspend_count
<< " dsCount=" << thread->tls32_.debug_suspend_count
<< " obj=" << reinterpret_cast<void*>(thread->tlsPtr_.opeer)
<< " self=" << reinterpret_cast<const void*>(thread) << "\n";
}
// 线程id; 进程优先级; 调度者组名
os << " | sysTid=" << tid
<< " nice=" << getpriority(PRIO_PROCESS, tid)
<< " cgrp=" << scheduler_group_name;
// 调度者优先级; 缓存句柄
if (thread != nullptr) {
int policy;
sched_param sp;
CHECK_PTHREAD_CALL(pthread_getschedparam, (thread->tlsPtr_.pthread_self, &policy, &sp),
__FUNCTION__);
os << " sched=" << policy << "/" << sp.sched_priority
<< " handle=" << reinterpret_cast<void*>(thread->tlsPtr_.pthread_self);
}
os << "\n";
// Grab the scheduler stats for this thread.
std::string scheduler_stats;
if (ReadFileToString(StringPrintf("/proc/self/task/%d/schedstat", tid), &scheduler_stats)) {
scheduler_stats.resize(scheduler_stats.size() - 1); // Lose the trailing '\n'.
} else {
scheduler_stats = "0 0 0";
}
char native_thread_state = '?';
int utime = 0;
int stime = 0;
int task_cpu = 0;
GetTaskStats(tid, &native_thread_state, &utime, &stime, &task_cpu);
// native线程状态; 调度者状态; 执行用户代码的总时间; 执行系统代码的总时间; 任务cpu使用数量
os << " | state=" << native_thread_state
<< " schedstat=( " << scheduler_stats << " )"
<< " utm=" << utime
<< " stm=" << stime
<< " core=" << task_cpu
<< " HZ=" << sysconf(_SC_CLK_TCK) << "\n";
if (thread != nullptr) {
// dump 线程当前栈
os << " | stack=" << reinterpret_cast<void*>(thread->tlsPtr_.stack_begin) << "-"
<< reinterpret_cast<void*>(thread->tlsPtr_.stack_end) << " stackSize="
<< PrettySize(thread->tlsPtr_.stack_size) << "\n";
// dump 线程锁信息
os << " | held mutexes=";
for (size_t i = 0; i < kLockLevelCount; ++i) {
if (i != kMonitorLock) {
BaseMutex* mutex = thread->GetHeldMutex(static_cast<LockLevel>(i));
if (mutex != nullptr) {
os << " \"" << mutex->GetName() << "\"";
if (mutex->IsReaderWriterMutex()) {
ReaderWriterMutex* rw_mutex = down_cast<ReaderWriterMutex*>(mutex);
if (rw_mutex->GetExclusiveOwnerTid() == static_cast<uint64_t>(tid)) {
os << "(exclusive held)";
} else {
os << "(shared held)";
}
}
}
}
}
os << "\n";
}
}

GetTaskStats

utils.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void GetTaskStats(pid_t tid, char* state, int* utime, int* stime, int* task_cpu) {
*utime = *stime = *task_cpu = 0;
std::string stats;
// 从stat节点获取数据
if (!ReadFileToString(StringPrintf("/proc/self/task/%d/stat", tid), &stats)) {
return;
}
// Skip the command, which may contain spaces.
stats = stats.substr(stats.find(')') + 2);
// Extract the three fields we care about.
std::vector<std::string> fields;
Split(stats, ' ', &fields);
*state = fields[0][0];
*utime = strtoull(fields[11].c_str(), nullptr, 10);
*stime = strtoull(fields[12].c_str(), nullptr, 10);
*task_cpu = strtoull(fields[36].c_str(), nullptr, 10);
}

DumpUnattachedThread

thread_list.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// 读取该线程中的所有状态信息:JAVA态,Native态,Kernel态
static void DumpUnattachedThread(std::ostream& os, pid_t tid) NO_THREAD_SAFETY_ANALYSIS {
// [接上]
Thread::DumpState(os, nullptr, tid);
DumpKernelStack(os, tid, " kernel: ", false);
if (false) {
DumpNativeStack(os, tid, " native: ");
}
os << "\n";
}
void ThreadList::DumpUnattachedThreads(std::ostream& os) {
DIR* d = opendir("/proc/self/task");
if (!d) {
return;
}
Thread* self = Thread::Current();
dirent* e;
// 获取当前进程中所有线程,dump这些线程的信息
while ((e = readdir(d)) != nullptr) {
char* end;
pid_t tid = strtol(e->d_name, &end, 10);
if (!*end) {
bool contains;
{
MutexLock mu(self, *Locks::thread_list_lock_);
contains = Contains(tid);
}
if (!contains) {
// [接上]
DumpUnattachedThread(os, tid);
}
}
}
closedir(d);
}
bool ThreadList::Contains(Thread* thread) {
return find(list_.begin(), list_.end(), thread) != list_.end();
}

DumpForSigQuit

thread_list.cc

1
2
3
4
void ThreadList::DumpForSigQuit(std::ostream& os) {
...
DumpUnattachedThreads(os);
}

runtime.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
void Runtime::DumpForSigQuit(std::ostream& os) {
GetClassLinker()->DumpForSigQuit(os);
GetInternTable()->DumpForSigQuit(os);
GetJavaVM()->DumpForSigQuit(os);
GetHeap()->DumpForSigQuit(os);
TrackedAllocators::Dump(os);
os << "\n";
// [接上]
// dump 线程列表
thread_list_->DumpForSigQuit(os);
BaseMutex::DumpAll(os);
}

HandleSigQuit

singal_catcher.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void SignalCatcher::HandleSigQuit() {
Runtime* runtime = Runtime::Current();
std::ostringstream os;
// 输出进程id以及当前dump时间
os << "\n"
<< "----- pid " << getpid() << " at " << GetIsoDate() << " -----\n";
DumpCmdLine(os);
// 输出系统信息
std::string fingerprint = runtime->GetFingerprint();
os << "Build fingerprint: '" << (fingerprint.empty() ? "unknown" : fingerprint) << "'\n";
os << "ABI: '" << GetInstructionSetString(runtime->GetInstructionSet()) << "'\n";
os << "Build type: " << (kIsDebugBuild ? "debug" : "optimized") << "\n";
// [接上]
runtime->DumpForSigQuit(os);
...
os << "----- end " << getpid() << " -----\n";
Output(os.str());
}

SignalCatcher::Run

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
void* SignalCatcher::Run(void* arg) {
...
// Set up mask with signals we want to handle.
SignalSet signals;
signals.Add(SIGQUIT);
signals.Add(SIGUSR1);
while (true) {
// 等待信号
int signal_number = signal_catcher->WaitForSignal(self, signals);
...
switch (signal_number) {
case SIGQUIT:
// [接上]
signal_catcher->HandleSigQuit();
break;
case SIGUSR1:
signal_catcher->HandleSigUsr1();
break;
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
}
int SignalCatcher::WaitForSignal(Thread* self, SignalSet& signals) {
...
int signal_number = signals.Wait();
...
return signal_number;
}

源码分析 (framework)

可见,android framework通过发送信号3来通知虚拟机来进行dump到traces.txt文件中的

dumpStackTraces

ActivityManagerService.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
private static void dumpStackTraces(String tracesPath, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
// 当trace写完后,监听并触发onEvent方法
FileObserver observer = new FileObserver(tracesPath, FileObserver.CLOSE_WRITE) {
@Override
public synchronized void onEvent(int event, String path) {
notify();
}
};
try {
// 开始监听
observer.startWatching();
// 首先收集firstPids链表里的进程信息
if (firstPids != null) {
try {
int num = firstPids.size();
for (int i = 0; i < num; i++) {
synchronized (observer) {
// 发送信号3来触发
Process.sendSignal(firstPids.get(i), Process.SIGNAL_QUIT);
observer.wait(200); // Wait for write-close, give up after 200msec
}
}
} catch (InterruptedException e) {
Slog.wtf(TAG, e);
}
}
// 收集native进程信息
...
// 最后收集CPU信息
...
} finally {
observer.stopWatching();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
// 获取trace文件路径
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
...
File tracesFile = new File(tracesPath);
try {
...
FileUtils.setPermissions(tracesDir.getPath(), 0775, -1, -1); // drwxrwxr-x
if (clearTraces && tracesFile.exists()) tracesFile.delete();
tracesFile.createNewFile();
FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1); // -rw-rw-rw-
} catch (IOException e) {
Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesPath, e);
return null;
}
// [接上]
dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
return tracesFile;
}

appNotResponding

ActivityManagerService.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
final void appNotResponding(ProcessRecord app, ActivityRecord activity,
ActivityRecord parent, boolean aboveSystem, final String annotation) {
ArrayList<Integer> firstPids = new ArrayList<Integer>(5);
SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20);
...
final ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
// [接上]
File tracesFile = dumpStackTraces(true, firstPids, processCpuTracker, lastPids,
NATIVE_STACKS_OF_INTEREST);
...
Slog.e(TAG, info.toString());
if (tracesFile == null) {
// There is no trace file, so dump (only) the alleged culprit's threads to the log
Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
}
addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
cpuInfo, tracesFile, null);
...
}

触发ANR的方式上面我已经列举过了,这里我以触发Broadcast Timeout时进行分析

AppNotResponding.run

BroadcastQueue.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
private final class AppNotResponding implements Runnable {
private final ProcessRecord mApp;
private final String mAnnotation;
public AppNotResponding(ProcessRecord app, String annotation) {
mApp = app;
mAnnotation = annotation;
}
@Override
public void run() {
// [接上]
mService.appNotResponding(mApp, null, null, false, mAnnotation);
}
}

broadcastTimeoutLocked

BroadcastQueue.java

1
2
3
4
5
6
7
8
9
10
11
12
13
final void broadcastTimeoutLocked(boolean fromMsg) {
...
String anrMessage = null;
...
if (app != null) {
anrMessage = "Broadcast of " + r.intent.toString();
}
...
if (anrMessage != null) {
// [接上]
mHandler.post(new AppNotResponding(app, anrMessage));
}
}

BroadcastHandler.handleMessage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
private final class BroadcastHandler extends Handler {
public BroadcastHandler(Looper looper) {
super(looper, null, true);
}
@Override
public void handleMessage(Message msg) {
switch (msg.what) {
...
case BROADCAST_TIMEOUT_MSG: {
synchronized (mService) {
// [接上]
broadcastTimeoutLocked(true);
}
} break;
...
}
}
};

setBroadcastTimeoutLocked

BroadcastQueue.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
final void setBroadcastTimeoutLocked(long timeoutTime) {
if (! mPendingBroadcastTimeoutMessage) {
Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
mHandler.sendMessageAtTime(msg, timeoutTime);
mPendingBroadcastTimeoutMessage = true;
}
}
final void processNextBroadcast(boolean fromMsg) {
if (! mPendingBroadcastTimeoutMessage) {
long timeoutTime = r.receiverTime + mTimeoutPeriod;
// [接上]
setBroadcastTimeoutLocked(timeoutTime);
}
}

总结

下面总结一下触发广播ANR后输出traces.txt的过程:

  1. 有序广播在注册后发送
  2. AMS设置广播超时时间,通过Handler进行延迟设定
  3. 广播超时触发超时机制
  4. Framework层发送信号给虚拟机
  5. 虚拟机dump出特定进程所有线程信息并存储到traces.txt文件中
扫码支持0.99元,您的支持将鼓励我继续创作!