top
last pid: 6009; load averages: 88.80, 123.85, 115.24 up 1+16:53:21 23:11:53 12284 processes:185 running, 10860 sleeping CPU states: 1.1% user, 20.1% nice, 62.1% system, 15.3% interrupt, 1.4% idle Mem: 1903M Active, 236M Inact, 1350M Wired, 128M Cache, 199M Buf, 5232K Free Swap: 2560M Total, 440M Used, 2119M Free, 17% Inuse, 672K In, 468K Out
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND 5721 in2 64 0 16056K 15128K CPU0 0 0:02 3.22% 2.78% top 652 bbs 10 0 60476K 41228K nanslp 0 103:18 1.76% 1.76% shmctl 5186 bbs 3 5 61352K 22760K ttyin 1 0:01 0.59% 0.59% mbbsd.0621 97697 bbs 2 5 61116K 36928K sbwait 1 0:01 0.49% 0.49% mbbsd.0620 95092 bbs 2 5 61152K 8620K sbwait 0 0:00 0.39% 0.39% mbbsd.0620 5794 bbs 2 5 61152K 15972K sbwait 1 0:00 0.43% 0.34% mbbsd.0620 88708 bbs 2 5 61088K 40412K RUN 1 0:02 0.24% 0.24% mbbsd.0620
netstat
$ netstat 1
input (Total) output
packets errs bytes packets errs bytes colls
5347 0 324688 6257 0 1687630 0
5460 0 331432 6504 0 1896940 0
5457 0 331445 6290 0 1729601 0
5359 0 325050 6316 0 1772674 0
5455 0 333756 6336 0 1794385 0
4799 0 290265 5377 0 1046608 0
4665 0 285941 5475 0 1690104 0
4897 0 299218 5785 0 1831958 0
5019 0 305061 5902 0 1769572 0
5136 0 315167 5965 0 1627654 0
5653 0 343014 6495 0 1786860 0
5169 0 312786 6075 0 1648568 0
5393 0 326060 6311 0 1874674 0
5158 0 314924 5999 0 1717284 0
vmstat
$ vmstat 1 procs memory page disks faults cpu r b w avm fre flt re pi po fr sr da0 da1 in sy cs us sy id 422 1 0 1715816 130076 13880 3 2 2 1009 4525 0 250 6694 13782 3140 15 41 44 417 1 0 1705988 174636 65476 33 32 32 12550 77169 36 120 84951 153087 45415 23 77 0 450 0 0 1702680 167468 19499 5 26 0 3345 0 49 47 22624 43676 11282 15 85 0 381 0 0 1696276 154644 23424 21 19 0 5460 0 62 47 33782 75912 17758 19 81 0 445 2 0 1731508 146712 32664 8 5 0 3931 0 58 86 26547 44593 13597 21 78 0 461 0 0 1726036 137428 42695 0 0 0 4511 0 45 123 29500 47849 15487 20 79 1 528 0 0 1704276 195272 36165 6 11 32 7961 76695 104 68 49004 78044 25485 20 76 3 480 0 0 1729964 146220 90734 7 39 0 21379 0 47 78 128277 238502 69274 20 80 0 416 0 0 1707028 188244 87826 32 39 32 14980 76364 137 197 99309 165981 53412 21 78 0 483 6 0 1718108 171592 32796 1 4 0 6151 0 15 79 39720 57699 21506 20 8 0
systat -vmstat
20 users Load 83.10116.68113.86 Jun 21 23:22
Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER
Tot Share Tot Share Free in out in out
Act 880244 4364 1776508 7228 128372 count 2
All 3761504 11944 3684528 61196 pages 6
271 zfod Interrupts
Proc:r p d s w Csw Trp Sys Int Sof Flt 262 cow 10072 total
440 *** 4771 59631272110071 1630 59521383844 wire 134 ahc0 irq2
1944504 act 331 ahc1 irq9
64.3%Sys 15.5%Intr 0.5%User 19.8%Nice 0.0%Idl 252620 inact 9166 pcn0 irq11
| | | | | | | | | | 123140 cache 213 ahc2 irq16
================================++++++++-------- 5232 free sio0 irq4
daefr 100 clk irq0
Namei Name-cache Dir-cache 279 prcfr 128 rtc irq8
Calls hits % hits % react
14868 14584 98 66 0 pdwake
pdpgs
Disks da0 da1 da2 da3 da4 da5 da6 intrn
KB/t 3.80 3.89 3.64 3.66 3.41 3.24 4.50 204096 buf
tps 56 98 50 131 46 62 90 1472 dirtybuf
MB/s 0.21 0.37 0.18 0.47 0.15 0.20 0.40 255421 desiredvnodes
% busy 36 27 20 52 30 35 52 230966 numvnodes
190064 freevnodes
iostat
$ iostat 1
tty da0 da1 da2 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
3 702 3.50 55 0.19 2.40 23 0.05 3.55 142 0.49 2 13 32 9 44
8 976 2.67 54 0.14 3.63 121 0.43 3.31 56 0.18 3 21 57 18 1
11 1399 4.26 270 1.12 4.85 694 3.29 3.82 409 1.53 0 18 64 13 5
3 306 2.98 8 0.02 5.58 5 0.02 3.19 4 0.01 1 22 62 15 0
3 1309 4.33 43 0.18 2.88 16 0.04 3.61 56 0.20 7 16 63 13 2
4 44 5.31 39 0.20 6.62 38 0.25 5.27 48 0.25 0 18 63 14 4
3 230 3.21 32 0.10 3.91 148 0.56 3.64 25 0.09 3 14 65 13 6
6 45 3.00 32 0.09 3.88 240 0.91 2.83 23 0.06 3 15 61 11 10
8 2018 2.97 29 0.08 4.10 93 0.37 3.79 66 0.24 0 21 58 14 7
9 2857 3.40 54 0.18 3.09 22 0.07 2.57 14 0.03 7 20 58 12 3
8 4594 3.00 27 0.08 3.39 23 0.08 3.17 24 0.07 4 16 62 14 3
32 1451 3.71 119 0.43 3.33 41 0.13 2.53 17 0.04 2 20 52 15 11
6 2005 3.70 210 0.76 2.80 15 0.04 3.54 55 0.19 3 19 58 15 5
14 369 5.94 16 0.09 2.90 20 0.06 2.81 27 0.07 7 19 50 14 9
17 1236 3.66 40 0.14 3.22 18 0.06 3.52 66 0.23 0 15 60 16 9
diskstat
$ diskstat
da0 da1 da2 da3 da4 da5 da6 da7 da8 da9 da10
busy busy busy busy busy busy busy busy busy busy busy
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
16.3% 59.4% 13.8% 21.8% 27.2% 54.4% 40.6% 35.6% 28.7% 33.6% 24.2%
59.3% 21.9% 17.5% 39.4% 25.8% 26.7% 60.8% 57.9% 21.4% 14.6% 33.6%
21.5% 5.9% 30.4% 27.4% 24.5% 51.9% 53.4% 38.2% 26.4% 25.9% 30.8%
22.8% 4.0% 13.9% 36.7% 38.1% 16.8% 30.2% 40.6% 22.8% 36.2% 32.7%
26.1% 45.3% 18.2% 20.7% 45.8% 34.9% 48.2% 62.5% 19.7% 19.7% 27.6%
31.9% 28.4% 20.4% 33.4% 24.9% 24.9% 32.9% 50.3% 35.4% 41.3% 42.3%
34.8% 27.9% 49.7% 16.9% 32.8% 44.8% 33.8% 40.3% 32.8% 38.3% 28.9%
27.4% 9.0% 19.4% 21.9% 27.9% 42.3% 40.8% 25.9% 33.8% 22.4% 42.8%
^C
--- diskstat statistics (8 samples)---
da7: 43.906700%
da6: 42.581387%
da5: 37.099586%
da10: 32.857498%
da4: 30.870256%
da0: 30.015738%
da9: 29.005894%
da8: 27.627535%
da3: 27.259638%
da1: 25.193283%
da2: 22.919621%
mrtg



rrdtool

The Big Picture
The Big Picture
The Big Picture
The Big Picture
The Big Picture
SCSI vs. IDE
SCSI vs. IDE
SCSI IDE
capacity (9G,18G,)36G,72G,144G (40G,)80G,120G
160G,200G,250G
buffer 8MB,16MB 2MB,8MB
rpm (7200,)10000,15000 7200(,10000)
MTBF 1,200,000 hours 500,000 hours
t-rate U160: 160MBps Serial ATA: 150MBps
U320: 320MBps ATA-100: 100MBps
#device most 15 devices most 2 device on one cable
price 36G/15k/8MB 5,500
73G/15k/8MB 11,000 80G/7200/8MB < 3,000
147G/15k/8MB 22,500 160G/7200/8MB < 3,500
250G/7200/8MB < 8,000
But I want my program run faster on this box!
But I want my program run faster on this box!
But I want my program run faster on this box!
DEVICE_POLLING
SYNOPSIS
options DEVICE_POLLING
options HZ=1000
ACCEPT_FILTER_HTTP
SYNOPSIS
options ACCEPT_FILTER_HTTP
ACCEPT_FILTER_HTTP
SYNOPSIS
options ACCEPT_FILTER_HTTP
I'll use the following program as demonstration
typedef double TYPE;
#define SIZE 50000
#define TIMES 20000
void f(TYPE *array)
{
int i;
for( i = 0 ; i < (SIZE - 1) ; ++i )
array[i] *= array[i + 1];
}
int main(int argc, char **argv)
{
TYPE array[SIZE];
int i;
for( i = 0 ; i < TIMES ; ++i )
f(array);
return 0;
}
-fomit-frame-pointer
SYNOPSIS
$ gcc -fomit-frame-pointer
f: |f:
pushl %ebp |
movl %esp, %ebp |
pushl %esi | pushl %esi
pushl %ebx | pushl %ebx
subl $4, %esp | subl $4, %esp
movl $0, -12(%ebp) | movl $0, (%esp)
.L2: |.L2:
cmpl $8, -12(%ebp) | cmpl $8, (%esp)
jle .L5 | jle .L5
jmp .L1 | jmp .L1
.L5: |.L5:
movl -12(%ebp), %eax | movl (%esp), %eax
leal 0(,%eax,8), %ebx| leal 0(,%eax,8), %ebx
movl 8(%ebp), %esi | movl 16(%esp), %esi
movl -12(%ebp), %eax | movl (%esp), %eax
leal 0(,%eax,8), %ecx| leal 0(,%eax,8), %ecx
movl 8(%ebp), %edx | movl 16(%esp), %edx
movl -12(%ebp), %eax | movl (%esp), %eax
branch prediction http://www.mbi.ufl.edu/papkelab/pics/branch.jpg
SYNOPSIS (in gcc 3.3)
$ gcc -fprofile-arcs
$ ./a.out
$ gcc -fbranch-probabilities
SYNOPSIS (in gcc 3.4 or icc)
$ gcc -fprofile-generate
$ ./a.out
$ gcc -fprofile-use
Architecture
%xmm
Architecture(cont.)
f: |f:
movl 4(%esp), %edx | movl 4(%esp), %edx
movl $0, %eax | movl $0, %eax
.L5: |.L5:
fldl (%edx,%eax,8) | movsd (%edx,%eax,8), %xmm0
fmull 8(%edx,%eax,8) | mulsd 8(%edx,%eax,8), %xmm0
fstpl (%edx,%eax,8) | movsd %xmm0, (%edx,%eax,8)
incl %eax | addl $1, %eax
cmpl $8, %eax | cmpl $8, %eax
jle .L5 | jle .L5
ret | ret
Architecture(cont.)
f: |f:
movl 4(%esp), %edx | movl 4(%esp), %edx
movl $0, %eax | movl $0, %eax
.L5: |.L5:
fldl (%edx,%eax,8) | movsd (%edx,%eax,8), %xmm0
fmull 8(%edx,%eax,8) | mulsd 8(%edx,%eax,8), %xmm0
fstpl (%edx,%eax,8) | movsd %xmm0, (%edx,%eax,8)
incl %eax | addl $1, %eax
cmpl $8, %eax | cmpl $8, %eax
jle .L5 | jle .L5
ret | ret
1 + 1 = 0
1 + 1 = 0
f: |f:
movl 4(%esp), %edx | movl 4(%esp), %edx
movl $0, %eax | movl $0, %eax
.L5: |.L5:
flds (%edx,%eax,4) | movss (%edx,%eax,4), %xmm0
fmuls 4(%edx,%eax,4) | mulss 4(%edx,%eax,4), %xmm0
fstps (%edx,%eax,4) | movss %xmm0, (%edx,%eax,4)
incl %eax | addl $1, %eax
cmpl $49998, %eax | cmpl $49998, %eax
jle .L5 | jle .L5
ret | ret
1 + 1 = 0
f: |f:
movl 4(%esp), %edx | movl 4(%esp), %edx
movl $0, %eax | movl $0, %eax
.L5: |.L5:
flds (%edx,%eax,4) | movss (%edx,%eax,4), %xmm0
fmuls 4(%edx,%eax,4) | mulss 4(%edx,%eax,4), %xmm0
fstps (%edx,%eax,4) | movss %xmm0, (%edx,%eax,4)
incl %eax | addl $1, %eax
cmpl $49998, %eax | cmpl $49998, %eax
jle .L5 | jle .L5
ret | ret
1 + 1 = 0 (cont.)
1 + 1 = 0 (cont.)
optimistic options:
-fno-cprop-registers (1.741) -fno-delayed-branch (1.4)
-fno-crossjumping (1.205) -fexpensive-optimizations (1.741)
-frerun-cse-after-loop (1.01) -frerun-loop-opt (1.79)
-fforce-mem (2.668) -ffloat-store (1.01)
-fnew-ra (1.254) -mno-align-stringops (1.01)
-ffinite-math-only (2.034)
pessimistic options:
-fno-defer-pop (-1.428) -fno-guess-branch-probability (-1.136)
-fno-loop-optimize (-1.185) -fgcse (-1.087)
-fschedule-insns (-2.16) -falign-labels (-1.282)
-fprefetch-loop-arrays (-1.624) -freduce-all-givs (-1.38)
-mfpmath=sse (-1.672) -fomit-frame-pointer (-1.282)
If you have Intel processors, why not buy a compiler from Intel!
Here's an example
Here's an example
..B1.3: # Preds ..B1.3 ..B1.2
movdqa (%esp,%edx,2), %xmm0 #17.4
pmullw %xmm0, %xmm0 #17.4
movdqa 16(%esp,%edx,2), %xmm1 #17.4
movdqa 32(%esp,%edx,2), %xmm2 #17.4
movdqa 48(%esp,%edx,2), %xmm3 #17.4
movdqa %xmm0, (%esp,%edx,2) #17.4
pmullw %xmm1, %xmm1 #17.4
pmullw %xmm2, %xmm2 #17.4
pmullw %xmm3, %xmm3 #17.4
movdqa %xmm1, 16(%esp,%edx,2) #17.4
movdqa %xmm2, 32(%esp,%edx,2) #17.4
movdqa %xmm3, 48(%esp,%edx,2) #17.4
addl $32, %edx #17.2
cmpl $100000, %edx #17.2
jb ..B1.3 # Prob 99% #17.2
# LOE eax edx ebp esi edi
You could do it better!
..B1.3: # Preds ..B1.3 ..B1.2
movdqa (%esp,%edx,2), %xmm0 #17.4
movdqa 16(%esp,%edx,2), %xmm1 #17.4
movdqa 32(%esp,%edx,2), %xmm2 #17.4
movdqa 48(%esp,%edx,2), %xmm3 #17.4
movdqa %xmm0, (%esp,%edx,2) #17.4
pmullw %xmm0, %xmm0 #17.4
pmullw %xmm1, %xmm1 #17.4
pmullw %xmm2, %xmm2 #17.4
pmullw %xmm3, %xmm3 #17.4
movdqa %xmm1, 16(%esp,%edx,2) #17.4
movdqa %xmm2, 32(%esp,%edx,2) #17.4
movdqa %xmm3, 48(%esp,%edx,2) #17.4
addl $32, %edx #17.2
cmpl $100000, %edx #17.2
jb ..B1.3 # Prob 99% #17.2
# LOE eax edx ebp esi edi
You could do it better!
..B1.3: # Preds ..B1.3 ..B1.2
movdqa (%esp,%edx,2), %xmm0 #17.4
movdqa 16(%esp,%edx,2), %xmm1 #17.4
movdqa 32(%esp,%edx,2), %xmm2 #17.4
movdqa 48(%esp,%edx,2), %xmm3 #17.4
movdqa %xmm0, (%esp,%edx,2) #17.4
pmullw %xmm0, %xmm0 #17.4
pmullw %xmm1, %xmm1 #17.4
pmullw %xmm2, %xmm2 #17.4
pmullw %xmm3, %xmm3 #17.4
movdqa %xmm1, 16(%esp,%edx,2) #17.4
movdqa %xmm2, 32(%esp,%edx,2) #17.4
movdqa %xmm3, 48(%esp,%edx,2) #17.4
addl $32, %edx #17.2
cmpl $100000, %edx #17.2
jb ..B1.3 # Prob 99% #17.2
# LOE eax edx ebp esi edi
Another example, by a real program - svm:
gcc version 3.3.3 [FreeBSD] 20031106
5710.08 real 5312.78 user 9.17 sys
icc Version 8.0
2099.64 real 1952.74 user 2.56 sys
Finally, optimize the f***ing code!
Finally, optimize the f***ing code!
Use the profiler
$ gcc -o test test.c -pg $ ./test $ gprof test test.tmon
Use the profiler
$ gcc -o test test.c -pg $ ./test $ gprof test test.tmon
<skip> % cumulative self self total time seconds seconds calls ms/call ms/call name 99.8 13.75 13.75 10000 1.38 1.38 f [2] 0.2 13.78 0.03 _mcount [3] 0.0 13.78 0.00 .mcount (7) 0.0 13.79 0.00 main [1] 0.0 13.79 0.00 1 0.00 0.00 ___sysctl [239] 0.0 13.79 0.00 1 0.00 0.00 __cxa_finalize [240] 0.0 13.79 0.00 1 0.00 0.00 _mcleanup (241) 0.0 13.79 0.00 1 0.00 0.00 _profil [242] 0.0 13.79 0.00 1 0.00 0.00 exit [4] 0.0 13.79 0.00 1 0.00 0.00 moncontrol [5] 0.0 13.79 0.00 1 0.00 0.00 sysctl [6]
memory leaking?
memory leaking?
#ifdef TEST_MEMORYLEAK
#define malloc(p) MY_MALLOC( (p) )
#define free(p) MY_FREE( (p) )
#endif
memory leaking?
#ifdef TEST_MEMORYLEAK
#define malloc(p) MY_MALLOC( (p) )
#define free(p) MY_FREE( (p) )
#endif
p = NULL;
memory leaking?
#ifdef TEST_MEMORYLEAK
#define malloc(p) MY_MALLOC( (p) )
#define free(p) MY_FREE( (p) )
#endif
p = NULL;
memory hierarchy (above i386 architecture)
(high address)
%ebp ----------------------------------------------
stack ↓ (local variables, return address, push/pop)
%esp ----------------------------------------------
data segment top ----------------------------------------------
heap ↑(malloc, alloc, new)
bss top ----------------------------------------------
(low address)
how malloc(3) and free(3) work?
// 0x8048000 is data segment starting point above i386
printf("%d\n", (int)sbrk(0) - 0x8048000));
tuning malloc(3) and free(3)
int main(int argc, char **argv)
{
int *plarge, *psmall;
plarge = malloc(10485760); // 10MB
psmall = malloc(1024); // 1KB
free(plarge);
// now size of data segment ~ 10MB
return 0;
}
int main(int argc, char **argv)
{
int *plarge, *psmall;
psmall = malloc(1024); // 1KB
plarge = malloc(10485760); // 10MB
free(plarge);
// now size of data segment ~ 1KB
return 0;
}
allocate memory directly from os
void *MALLOC(int size)
{
int *p;
p = (int *)mmap(NULL, (size + 4), PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
p[0] = size;
return (void *)&p[1];
}
void FREE(void *ptr)
{
int size = ((int *)ptr)[-1];
munmap((void *)(&(((int *)ptr)[-1])), size);
}
In Michael A. Jackson "Two rules of when to optimize" ( Principles of program Design, 1975 )
In Michael A. Jackson "Two rules of when to optimize" ( Principles of program Design, 1975 )
1. Don't do it.
In Michael A. Jackson "Two rules of when to optimize" ( Principles of program Design, 1975 )
1. Don't do it.
2. (For experts only)
In Michael A. Jackson "Two rules of when to optimize" ( Principles of program Design, 1975 )
1. Don't do it.
2. (For experts only)
Don't do it yet.
^_^